Thursday 3 May 2012

Extracting Font metadata from MS Excel 2007 and MS Word 2007 files


Ever wondered how to programmatically extract a list of fonts from MS Excel 2007 or Word 2007?
Me neither ... However, thanks to a Corey Harrell Tweet I read about a case where documents could be proved false based upon the fonts used. That is, the forgers used a font which did not yet exist at the purported creation time. Corey also helpfully provided links that described which fonts are installed with the various MS Office versions (MS Office 2010 , Office 2007 & other MS Office versions).

My first stop was Exiftool (on SIFT v2.12) which extracted plenty of metadata but unfortunately not font information.
So my next stop was a Perl script. But to do this requires some background knowledge of how Excel 2007 and Word 2007 files are structured.

Background

Office 2007 uses a zipped series of XML files to construct the various file types.
These file structures are detailed in the Office Open XML Format specification.
Some helpful summaries are available here and here.
Some Word 2007 specific information is available here.

Another resource is "Digital Forensics with Open Source Tools" by Altheide & Carvey pp.201-203.

The 1 minute briefing is:
- Office Open XML files can be unzipped into a consistent file structure.
- At the top level, there is a "_rels" folder, "docProps" folder, "[ContentTypes].xml" file and an application specific folder (eg "word" or "xl"). Most sub-folders will also contain a "_rels" folder (containing various .xml.rels files) which are then used to construct relationships between the various folders/files (via an XML schema).
- Document metadata such as Subject, Creator are located in "docProps/core.xml". Additional application specific metadata (eg word count, template used) is located in "docProps/app.xml".
- "[ContentTypes].xml" lists all files (aka "document parts") which make up the ZIP archive.
- Within the application specific folder, there is a "_rels" folder and other various folders and XML files which contain the actual user content (eg headers, footers, footnotes, graphics, comments, Word document text, Excel worksheet contents).
- Any media inserted into a document has it's original filename changed to a generic filename (eg "image1.png") but these files can still retain their internal metadata (eg EXIF data).

After unzipping some sample .xlsx and .docx files, I found a list of fonts in the "xl/styles.xml" and "word/fontTable.xml" files respectively. Powerpoint 2007 .pptx files seem to be different - I could not find a central non-theme related file that lists all fonts used :(

XML documents look similar to HTML. Basically, an "element" (or "node") is considered the smallest building block of an XML schema. An element is delineated by a beginning "tag" and an end "tag".  Elements can also contain "attributes" which are indicated by the use of "=" within a "tag".

It'll probably help if I list an (edited) "word/fontTable.xml" excerpt here:

<w:fonts xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:font w:name="Calibri"> ... </w:font></w:fonts>

So we can see that there is a "w:fonts" element which in turn contains a "w:font" element. The"w:font" element has an "attribute" called "w:name" which has the value "Calibri" (which is what we're interested in).

So now we know WHAT we want to read, we need a way of programmatically reading the XML.
I found this helpful nugget of code on the StackOverflow forum.

Enter ... CPAN's XML::XPath!

The Code

#CODE BEGINS ON NEXT LINE
#!/usr/bin/perl -w

# MS Office 2010 fonts
# http://support.microsoft.com/kb/2121313

# MS Office 2k7 fonts
# http://support.microsoft.com/kb/924623

# MS Office2k3, 2k, 97 fonts
# http://support.microsoft.com/kb/837463


use strict;

use Getopt::Long;
use XML::XPath;
#use XML::XPath::XMLParser;

my $version = "docx-font-extractor.pl v2012.04.29";
my $help = 0; # help flag
my $isdocx = 0;
my $isxlsx = 0;
my $fontfile = "";

# TODO my @OFFICE2007_FONTS = ();
# TODO my @OFFICE2010_FONTS = ();


GetOptions('help|h' => \$help,
    'd' => \$isdocx,
    'x' => \$isxlsx,
    'f=s' => \$fontfile);

if ($help || $fontfile eq "" || ($isdocx eq 0 and $isxlsx eq 0) || ($isdocx eq 1 and $isxlsx eq 1) )
{
    print("\n$version\n");
    print("Perl script to list fonts used in an MS Office .docx or .xlsx file\n");
    print("Assumes .docx or .xlsx has already been unzipped to a local directory\n\n");
    print("Example: docx-font-extractor.pl -d -f /home/sansforensics/word2k7/word/fontTable.xml\n");
    print("Example: docx-font-extractor.pl -x -f /home/sansforensics/excelbk1/xl/styles.xml\n");
    exit;
}

my $xpath = XML::XPath->new(filename => $fontfile);
my $nodeset;
my $xmlfontfield;

if ($isdocx)
{
    $nodeset = $xpath->find("/w:fonts/w:font");
    $xmlfontfield = "w:name";
}
elsif ($isxlsx)
{
    $nodeset = $xpath->find("/styleSheet/fonts/font/name");
    $xmlfontfield = "val";
}

print "Found ".$nodeset->size." results\n";

foreach my $node ($nodeset->get_nodelist)
{
    my $fontname = $node->getAttribute($xmlfontfield);
    print "Found font = $fontname\n";

    # TODO Lookup $fontname in list of stored Office fonts

    # TODO Print "The ... font is installed on MS Office ..."

}

#END CODE


Code Summary

We start off with a "GetOptions" and Help section as usual.

Then we create a new "XML::XPath" object passing in the user specified filename (ie "$fontfile") as the argument.

If the user has specified the .docx flag ("-d"), we get our XML::XPath object ($xpath) to "find" the MS Word specific font element ("/w:fonts/w:font"). Note: the "/"s are used to seperate the element names. We then set the "$xmlfontfield" variable to the attribute name we're interested in (ie "w:name"). We'll use this later.

Similarly, if the user has specified the .xlsx flag ("-x"), we get our XML::XPath object ($xpath) to "find" the MS Excel specific font element ("/styleSheet/fonts/font/name") and then set "$xmlfontfield" to the attribute name we're interested in (ie "val").

Our various calls to "$xpath->find" should return results in the form of an XML::XPath::NodeSet object. We can then call XML::XPath::NodeSet's "get_nodelist" to return a list of search results. Next we can  iterate through each "node" and print out the relevant "attribute" value by calling the XML::XPath::Node::Element's "getAttribute" function with the "$xmlfontfield" variable we set previously.

Testing

After unzipping example .xlsx and .docx files to their respective "/home/sansforensics/" sub-folders (eg "unzip /cases/excel2k7-Book1.xlsx -d /home/sansforensics/excelbk1/") we now run our Perl script.

sansforensics@SIFT-Workstation:~$ ./docx-font-extractor.pl -x -f /home/sansforensics/excelbk1/xl/styles.xml
Found 4 results
Found font = Calibri
Found font = Calibri
Found font = Calibri
Found font = Arial Black
sansforensics@SIFT-Workstation:~$


For confirmation, here's the example listing of "xl/styles.xml":

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><fonts count="4"><font><sz val="11"/><color theme="1"/><name val="Calibri"/><family val="2"/><scheme val="minor"/></font><font><b/><sz val="11"/><color theme="1"/><name val="Calibri"/><family val="2"/><scheme val="minor"/></font><font><b/><u/><sz val="11"/><color theme="1"/><name val="Calibri"/><family val="2"/><scheme val="minor"/></font><font><sz val="11"/><color theme="1"/><name val="Arial Black"/><family val="2"/></font></fonts><fills count="2"><fill><patternFill patternType="none"/></fill><fill><patternFill patternType="gray125"/></fill></fills><borders count="1"><border><left/><right/><top/><bottom/><diagonal/></border></borders><cellStyleXfs count="1"><xf numFmtId="0" fontId="0" fillId="0" borderId="0"/></cellStyleXfs><cellXfs count="5"><xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/><xf numFmtId="14" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1"/><xf numFmtId="0" fontId="1" fillId="0" borderId="0" xfId="0" applyFont="1"/><xf numFmtId="0" fontId="2" fillId="0" borderId="0" xfId="0" applyFont="1"/><xf numFmtId="0" fontId="3" fillId="0" borderId="0" xfId="0" applyFont="1"/></cellXfs><cellStyles count="1"><cellStyle name="Normal" xfId="0" builtinId="0"/></cellStyles><dxfs count="0"/><tableStyles count="0" defaultTableStyle="TableStyleMedium9" defaultPivotStyle="PivotStyleLight16"/></styleSheet>


The reason for the 3 Calibri declarations appears to be due to the use of "normal", "Bold" and "Bold Underlined" versions of the Calibri font. The bold Calibri font is declared using a "<b/>" following the second "<font>" element tag. Similarly, I suspect "<b/><u/>" represents "Bold Underlined". Unfortunately, Word 2007 does not seem to implement the same XML mechanisms for denoting "Bold" and "Underline" fonts as Excel 2007.

Here's our script's results with Word 2007:

sansforensics@SIFT-Workstation:~$ ./docx-font-extractor.pl -d -f /home/sansforensics/word2k7/word/fontTable.xml
Found 3 results
Found font = Calibri
Found font = Times New Roman
Found font = Cambria
sansforensics@SIFT-Workstation:~$


For confirmation, here's the example listing of "word/fontTable.xml":

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:fonts xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:font w:name="Calibri"><w:panose1 w:val="020F0502020204030204"/><w:charset w:val="00"/><w:family w:val="swiss"/><w:pitch w:val="variable"/><w:sig w:usb0="A00002EF" w:usb1="4000207B" w:usb2="00000000" w:usb3="00000000" w:csb0="0000009F" w:csb1="00000000"/></w:font><w:font w:name="Times New Roman"><w:panose1 w:val="02020603050405020304"/><w:charset w:val="00"/><w:family w:val="roman"/><w:pitch w:val="variable"/><w:sig w:usb0="20002A87" w:usb1="00000000" w:usb2="00000000" w:usb3="00000000" w:csb0="000001FF" w:csb1="00000000"/></w:font><w:font w:name="Cambria"><w:panose1 w:val="02040503050406030204"/><w:charset w:val="00"/><w:family w:val="roman"/><w:pitch w:val="variable"/><w:sig w:usb0="A00002EF" w:usb1="4000004B" w:usb2="00000000" w:usb3="00000000" w:csb0="0000009F" w:csb1="00000000"/></w:font></w:fonts>


Note: Word's completely different XML schema compared to Excel's schema.

Conclusion

We have successfully written a "docx-font-extractor.pl" Perl script to extract the generic font names used from (unzipped) MS Office 2007 .docx and .xlsx files.

Whilst the extracted font names generally correspond to Microsoft's list of installed fonts - they do not match exactly. For example, the MS Office 2007 font table lists "Calibri (TrueType)", "Calibri Bold (TrueType)", "Calibri Italic (TrueType)", "Calibri Bold Italic (TrueType)" and our script can only report the attribute name ie "Calibri".
If we had a 1:1 mapping, then it would be simple enough to declare an array list of fonts for each MS Office release and then search them for each font found (see TODO comments in code).
Unfortunately, it looks like there's more work required to extract any further font metadata from each of the differing Excel/Word font mechanisms. If only MS had made their font attributes consistent throughout Office *sigh*. And having the full font names in the XML would have been nice too (eg "Calibri Bold (TrueType)") *double sigh*. Anyhoo, that's where I'll leave things (for now) ... I'll finish up by wishing you *Sigh O Nara* ;)