Wednesday 23 May 2012

A Perl script plays Matchmaker with ExifTool and SQLite


Imagine their cute little offspring!
- Query-able metadata = The ability to sort/query by several metadata fields (not just time).
- Metadata from an well established/tested source library (ExifTool)
- SQLite cross platform compatibility
- And all available for the one low price on SIFT (free!)

Being Perl, it won't be as fast as a compiled language like C/C++.
Also, at this time only .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf and .jpg files are supported.
As a result, I'm pretty sure you wouldn't want to blindly point it at "C:\" as it could take a while/use a lot of memory.
Instead, say you found a directory of interest (eg a suspect's "naughty" directory). You could extract the metadata to the database, take a hash of the database and then send it on to another analyst for further analysis/reporting. Or you could analyse it yourself using standard SQL queries from either the "sqlite3" client or the Firefox "SQLite Manager" plugin.

As far as I know, this capability does not currently exist (at least for open source) so I think this script could prove handy for several different types of investigations. For example: fraud, e-discovery processing/culling, processing jpgs (eg exploitation).

Having a warped sense of humour, I'm calling this extraction tool "SquirrelGripper".
In Australia, to grab a bloke by his "nuts" (accidentally or on purpose or sometimes both) is also known as "the Squirrel Grip". This has been known to happen in the heat of battle/whilst tackling an opponent during a Rugby or Aussie Rules match. Thankfully, I have never had this done to me - I'd like to think I was just too quick but maybe they just couldn't find them LOL.
The idea behind the name is that "SquirrelGripper" (aka "SG") will lead analysts to the low hanging fruit right at the crux of the matter ;)
In polite company (eg client briefing), one could say that "SquirrelGripper" finds all the little nuts/nuggets of information.
By running this script an analyst will hopefully save time. Rather than having to launch the ExifTool exe multiple times for different files/metadata, they can launch this script once and then perform SQL queries for retrieving/presenting their metadata of interest.

Whilst developing this tool, I contacted some more experienced DFIR'ers for feedback. So I'd just like to say a big "ralphabetical" Thankyou to Brad Garnett, Cindy Murphy, Corey Harrell, Gerald Combs, "Girl, Unallocated" and Ken Pryor. Sorry for clogging your inboxes with my semi-organised/semi-chaotic thoughts. Having an audience for my email updates provided both structure and motivation for my thoughts.

An extra bunch of thanks to Corey Harrell (does this guy ever rest?) who was both my testing "guinea pig" and who also made some excellent suggestions. Most, if not all of his suggestions were incorporated into this latest version of SG. I can't wait to see/read how he will use it in his investigations.

Enough already! Let's get started ...

The Design

As SquirrelGripper has grown to over 1000 lines, I have decided to provide the code via my new Google Code project ("cheeky4n6monkey") and just give a general description in this blog post. Like all my scripts, it is provided "as is" and whilst I have done my best, there may still be bugs in the code. As always, perform your own validation testing before using it live.

We'll start by describing how SquirrelGripper will store metadata.
Currently, there are 9 tables in the created database - FileIndex, XLSXFiles, XLSFiles, DOCXFiles,DOCFiles, PPTXFiles, PPTFiles, PDFFiles, JPEGFiles.

FileIndex is the main index table and contains the following fields:
-"AbsFileName" (primary key) = Absolute path and filename. Used as a common key to FileIndex and other file specific tables.
-"DateTimeMetaAdded" = Date and time SquirrelGripper extracted the metadata for a file.
-"FileType" = ExifTool field indicating file type. We can use this to determine which table should contain the remaining file metadata.
-"CaseRef" = Mandatory User specified case tag string.
-"UserTag" = Optional User specified tag string. Can be used for labelling directories/files for a particular SG launch.

The 8 other tables will be used to record file type specific metadata (eg "Slides" in PPTXFiles = Number of Slides for a .pptx file). I decided to use separate tables for each file type because there are simply too many fields for a single table (over 100 in total).

Most of the metadata fields are stored as TEXT (strings) in the database. Some obvious integer fields are stored as INTs (eg Words, Slides, FileSize). GPS co-ordinates from tagged JPEGs are stored in both TEXT and REAL (decimal) format. If in doubt (eg ExifToolVersion), I made the script store the data as TEXT.
Dates are stored as TEXT in the database and I have tested that they can be processed by SQLite functions. For example, "julianday(CreateDate)" works OK. I have also been able to retrieve/sort records by date.

In the interests of everyone's sanity (including my own) I won't detail any more of the schema. You can see them yourself in the code by searching for the "CREATE TABLE" strings.

The code is structured quite similar to "exif2map.pl". We are using the File::Find package to recursively search for the handled filetypes and then we call Image::ExifTool's "GetFoundTags" and "GetInfo" functions to extract the metadata. Depending on the extracted "FileType" tag, we call the appropriate handling function (eg "ProcessPPT"). These handling functions will read selected fields and if the field is defined, the function will insert/replace the metadata into the user specified SQLite database. If the metadata does not exist, a "Not Present" string will be inserted into the table instead.

Selections of metadata fields were made by throwing darts by estimating which of the fields would be most likely to hold information of forensic interest.
For example, MS Office 2007+ files have a variable list of "Zip" fields which I thought was of limited value so the script doesn't store these. However, the main fields such as "FileName", "CreateDate", "FileModifyDate" etc. are all included.
Having said that, if you think SquirrelGripper should include a particular field that I've left out, please let me know.

Installing SquirrelGripper

The Image::ExifTool Perl package is already installed on SIFT. We have previously used it in our "exif2map.pl" script. However due to some internal label changes to ExifTool, we need to grab the latest version (v8.90) for our script. We can do this on SIFT by typing: "sudo cpan Image::ExifTool".

The last few posts have also detailed our travels with Perl and SQLite. We have been using the DBI Perl package to interface with SQLite databases. So if you haven't already done it, grab the latest DBI package from CPAN by typing: "sudo cpan DBI".

Next, you can download/unzip/copy "squirrelgripper.pl" (from here) to "/usr/local/bin" and make it executable (by typing "sudo chmod a+x /usr/local/bin/squirrelgripper.pl")

Now you should be ready to run SquirrelGripper on SIFT in all its glory.

To run on Windows, install ActiveState Perl and use the Perl Package Manager to download the ExifTool package (v8.90). DBI should already be installed. Next, copy the "squirrelgripper.pl" script to the directory of your choice.
You should now be able to run SG at the command prompt by typing something like:
"perl c:\squirrelgripper.pl -newdb -db nutz2u.sqlite -case caseABC -tag fraud-docs -dir c:\squirrel-testdata\subdir1"
See next section for what each of the arguments mean.


Running SquirrelGripper

For this test scenario, I have various .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf files in the "/home/sansforensics/squirrel-testdata/subdir1" directory.
I have also copied various .jpg files to the "/home/sansforensics/squirrel-testdata/subdir2" directory

It is assumed that a new database will be created for each case. However, the same database can be also used for multiple iterations of the script. Just FYI - you can get a help/usage message by typing "squirrelgripper.pl -h"

The script recursively searches thru sub-directories so please ensure you've pointed it at the right level before launching. It is also possible to mark different sub-directories with different case tags. eg Launch script with one directory using the case tag "2012-04-18-caseA-companyA". Then launch the script a second time pointing to another directory using the case tag "2012-04-18-caseA-companyB". SG can also handle multiple -dir arguments in case you need to extract data from more than one directory (eg "-dir naughty/pics -dir naughty/docs"). If a "-tag" argument is also specified, it will apply to files from both directories.

The first example uses "-newdb" to create the "nutz2u.sqlite" database in the current directory. It also tags all "subdir1" files with the "fraud-docs" user tag (you can see the "UserTag" value in the "FileIndex" table). Currently, the "-db", "-case" and "-dir" arguments are mandatory.
Note: the -dir directory can be an absolute path or a relative one.

squirrelgripper.pl -newdb -db nutz2u.sqlite -case caseABC -tag fraud-docs -dir /home/sansforensics/squirrel-testdata/subdir1/

The output looks like:


squirrelgripper.pl v2012.05.18
Assuming /home/sansforensics/squirrel-testdata/subdir1/ is an absolute path
Directory entry for processing = /home/sansforensics/squirrel-testdata/subdir1/

Now processing /home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book1.xlsx
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book1.xlsx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book1.xlsx inserted into XLSXFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/doj-forensic-examination-dig-evidence-law-enforcers-guide-04-199408.pdf
/home/sansforensics/squirrel-testdata/subdir1/doj-forensic-examination-dig-evidence-law-enforcers-guide-04-199408.pdf inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/doj-forensic-examination-dig-evidence-law-enforcers-guide-04-199408.pdf inserted into PDFFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/word2k7.docx
/home/sansforensics/squirrel-testdata/subdir1/word2k7.docx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/word2k7.docx inserted into DOCXFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/08-069208CaseReport.doc
/home/sansforensics/squirrel-testdata/subdir1/08-069208CaseReport.doc inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/08-069208CaseReport.doc inserted into DOCFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/acme_report.pdf
/home/sansforensics/squirrel-testdata/subdir1/acme_report.pdf inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/acme_report.pdf inserted into PDFFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/Windows Passwords Master 1.5 Handout - Jesper Johansson.ppt
/home/sansforensics/squirrel-testdata/subdir1/Windows Passwords Master 1.5 Handout - Jesper Johansson.ppt inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/Windows Passwords Master 1.5 Handout - Jesper Johansson.ppt inserted into PPTFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book2.xlsx
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book2.xlsx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book2.xlsx inserted into XLSXFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/Powerpoint2k7.pptx
/home/sansforensics/squirrel-testdata/subdir1/Powerpoint2k7.pptx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/Powerpoint2k7.pptx inserted into PPTXFiles table
sansforensics@SIFT-Workstation:~$



The second call assumes the "nutz2u.sqlite" database already exists and tags all "subdir2" files with the "fraud-pics" tag.

squirrelgripper.pl -db nutz2u.sqlite -case caseABC -tag fraud-pics -dir /home/sansforensics/squirrel-testdata/subdir2

The output looks like:


squirrelgripper.pl v2012.05.18
Assuming /home/sansforensics/squirrel-testdata/subdir2 is an absolute path
Directory entry for processing = /home/sansforensics/squirrel-testdata/subdir2

Now processing /home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg
lat = 41.888948, long = -87.624494
/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg inserted into JPEGFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg : No GPS Lat/Long data present
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg inserted into JPEGFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg
lat = 41.888948, long = -87.624494
/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg inserted into JPEGFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg
lat = 36.1147630001389, long = -115.172811
/home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg inserted into JPEGFiles table
sansforensics@SIFT-Workstation:~$


SquirrelGripper has also been similarly tested using ActiveState Perl v5.14.2 on Win XP by myself and Corey Harrell has also tested it using ActiveState Perl v5.12 on Windows 7 (32/64). Please note however, that the majority of my testing was done on SIFT v2.12 and Perl 5.10.0.


Once the metadata has been extracted to the SQLite database we can use SQL queries to find specific files of interest. This can be done via the "sqlite3" client and/or more easily via the Firefox "SQLite Manager" plugin.


Extracting SquirrelGripper results

So now we have our SQLite database chock full of metadata - how can we search it?
We need some basic SQLite commands which we can learn here.
W3Schools and Wikipedia also have some useful general information on SQL queries.

For our first example query we will be finding pictures from a particular camera model and ordering the results by date/time of the original.
Building upon the previous section, we have opened the "nutz2u.sqlite" database by using the Firefox "SQLite Manager" plugin on SIFT.
To do this - open Firefox on SIFT, under the "Tools" menu, select "SQLite Manager" to launch.
Then under the "Database" menu, select "Connect Database", browse to the user specified database (eg "nutz2u.sqlite") and press the "Open" button. Click on the "FileIndex" table tree item on the left hand side and then look under the "Browse & Search" tab. You should now see something like:


nutz2u.sqlite's FileIndex Table

We can see that we've got 12 files of all the different supported file types (see "File Type" column in pic). We can also see the absolute path filename ("AbsFileName"), the Date/Time the data was extracted ("DateTimeMetaAdded"), the case tag field ("CaseRef") and the user tag field ("UserTag").
If we now click on the "JPEGFiles" table tree item and look under the "Browse & Search" tab, we can see which jpeg files had their metadata extracted:

nutz2u.sqlite's JPEGFiles Table

Note: For the file specific tables, you will probably have to use the scroll bar to see all of the fields. Not shown in the pic above are the "Model" and "DateTimeOriginal" fields. We will be using these fields in our SQL query.
To execute a query against the JPEGFiles table, we click on the "Execute SQL" tab and then enter in the following:

SELECT * FROM JPEGFiles WHERE Model='Canon PowerShot SX130 IS' ORDER BY DateTimeOriginal;


We then press the "Run SQL" button and we see that we have found two jpegs that meet our search criteria:

nutz2u.sqlite's SQL Search Results

Closer inspection of the "AbsFileName" fields shows the filenames are "/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg" and "/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg". So given a bunch of .jpgs, we have quickly found multiple pictures that were taken by the same camera model.I thought that was pretty cool eh?

Similarly, we can run other queries against the database such as:

Finding .docs by same author and sorting by page count:
SELECT * FROM DOCFiles WHERE Author='Joe Friday' ORDER BY PageCount;

Sorting Powerpoint files by Revision Number:
SELECT * FROM PPTFiles ORDER BY RevisionNumber;

Finding all "fraud-pics" user tagged .JPEGs:
SELECT * FROM JPEGFiles, FileIndex WHERE JPEGFiles.AbsFileName=FileIndex.AbsFileName AND FileIndex.UserTag='fraud-pics';

Finding the LastModified time and the user responsible (for .xlsx files) BEFORE a given date/time:
SELECT LastModifiedBy, ModifyDate from XLSXFiles where DATETIME(ModifyDate) < DATETIME('2012-05-01 06:36:54')
Note: We are using the SQLite DATETIME function to convert our string date into a format we can perform comparisons with.

Finding the Filename and Keywords (for .pptx files) where the Keywords contains "dea". This will find keywords such as "death" "deal" "idea". The % represents a wildcard:
SELECT FileName, Keywords from PPTXFiles where Keywords LIKE '%dea%'

Finding the Filename and Template names used (for .doc files) when the Template name starts with "Normal.dot":
SELECT FileName, Template FROM DOCFiles WHERE Template LIKE 'Normal.dot%'

Find Unique Company names by performing a UNION of each table's Company fields. For more information on the UNION keyword see here.
SELECT Company FROM DOCFiles UNION SELECT Company FROM DOCXFiles UNION SELECT Company FROM XLSXFiles UNION SELECT Company FROM XLSFiles UNION SELECT Company FROM PPTXFiles UNION SELECT Company FROM PPTFiles

Finding the Filename and GPS Longitude (for .jpg files) where the GPS Longitude is less than "-115.0":
SELECT FileName, GPSLongitude FROM JPEGFiles WHERE GPSLongitude < '-115.0'

There are LOTS more ways of querying the database because there's a shedload of other fields we can use. Check out the script code (search for "CREATE TABLE") or create your own example database (using SG) for more details on what metadata fields are being extracted. Where possible, I have used the same field names as the ExifTool exe prints to the command line.


Final Words

We have written a Perl script ("squirrelgripper.pl") to extract file metadata into an SQLite Database. We have also shown how to perform some basic SQL queries against that database.

In total, for the .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf, .jpeg file types there were over 100 fields to extract. Whilst I double-checked my script, it is possible I've made some errors/overlooked something. It is also possible to process other file types (eg EXE, DLL) but I thought I would wait for further feedback/specific requests before continuing.

UPDATE: Just had this thought - If/when I do add the EXE & DLL functionality, SquirrelGripper could also be used for locating known Malware/indicators of compromise as well. I've created a monster!

I suspect there will be some commonly used SQL queries across certain types of investigations. So analysts could build up a standard set of queries to run across multiple cases of the same type. That could save analysts time / help narrow down files of interest.

If you found this script useful and/or you have any questions, please drop me a Tweet / email / leave a comment. The more I know about if/how forensicators are using SquirrelGripper - the better I can make it. Now that the basic framework is in place, it should be pretty quick to add new fields/file types so long as they are supported by the Image::ExifTool package (v8.90).

Thursday 3 May 2012

Extracting Font metadata from MS Excel 2007 and MS Word 2007 files


Ever wondered how to programmatically extract a list of fonts from MS Excel 2007 or Word 2007?
Me neither ... However, thanks to a Corey Harrell Tweet I read about a case where documents could be proved false based upon the fonts used. That is, the forgers used a font which did not yet exist at the purported creation time. Corey also helpfully provided links that described which fonts are installed with the various MS Office versions (MS Office 2010 , Office 2007 & other MS Office versions).

My first stop was Exiftool (on SIFT v2.12) which extracted plenty of metadata but unfortunately not font information.
So my next stop was a Perl script. But to do this requires some background knowledge of how Excel 2007 and Word 2007 files are structured.

Background

Office 2007 uses a zipped series of XML files to construct the various file types.
These file structures are detailed in the Office Open XML Format specification.
Some helpful summaries are available here and here.
Some Word 2007 specific information is available here.

Another resource is "Digital Forensics with Open Source Tools" by Altheide & Carvey pp.201-203.

The 1 minute briefing is:
- Office Open XML files can be unzipped into a consistent file structure.
- At the top level, there is a "_rels" folder, "docProps" folder, "[ContentTypes].xml" file and an application specific folder (eg "word" or "xl"). Most sub-folders will also contain a "_rels" folder (containing various .xml.rels files) which are then used to construct relationships between the various folders/files (via an XML schema).
- Document metadata such as Subject, Creator are located in "docProps/core.xml". Additional application specific metadata (eg word count, template used) is located in "docProps/app.xml".
- "[ContentTypes].xml" lists all files (aka "document parts") which make up the ZIP archive.
- Within the application specific folder, there is a "_rels" folder and other various folders and XML files which contain the actual user content (eg headers, footers, footnotes, graphics, comments, Word document text, Excel worksheet contents).
- Any media inserted into a document has it's original filename changed to a generic filename (eg "image1.png") but these files can still retain their internal metadata (eg EXIF data).

After unzipping some sample .xlsx and .docx files, I found a list of fonts in the "xl/styles.xml" and "word/fontTable.xml" files respectively. Powerpoint 2007 .pptx files seem to be different - I could not find a central non-theme related file that lists all fonts used :(

XML documents look similar to HTML. Basically, an "element" (or "node") is considered the smallest building block of an XML schema. An element is delineated by a beginning "tag" and an end "tag".  Elements can also contain "attributes" which are indicated by the use of "=" within a "tag".

It'll probably help if I list an (edited) "word/fontTable.xml" excerpt here:

<w:fonts xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:font w:name="Calibri"> ... </w:font></w:fonts>

So we can see that there is a "w:fonts" element which in turn contains a "w:font" element. The"w:font" element has an "attribute" called "w:name" which has the value "Calibri" (which is what we're interested in).

So now we know WHAT we want to read, we need a way of programmatically reading the XML.
I found this helpful nugget of code on the StackOverflow forum.

Enter ... CPAN's XML::XPath!

The Code

#CODE BEGINS ON NEXT LINE
#!/usr/bin/perl -w

# MS Office 2010 fonts
# http://support.microsoft.com/kb/2121313

# MS Office 2k7 fonts
# http://support.microsoft.com/kb/924623

# MS Office2k3, 2k, 97 fonts
# http://support.microsoft.com/kb/837463


use strict;

use Getopt::Long;
use XML::XPath;
#use XML::XPath::XMLParser;

my $version = "docx-font-extractor.pl v2012.04.29";
my $help = 0; # help flag
my $isdocx = 0;
my $isxlsx = 0;
my $fontfile = "";

# TODO my @OFFICE2007_FONTS = ();
# TODO my @OFFICE2010_FONTS = ();


GetOptions('help|h' => \$help,
    'd' => \$isdocx,
    'x' => \$isxlsx,
    'f=s' => \$fontfile);

if ($help || $fontfile eq "" || ($isdocx eq 0 and $isxlsx eq 0) || ($isdocx eq 1 and $isxlsx eq 1) )
{
    print("\n$version\n");
    print("Perl script to list fonts used in an MS Office .docx or .xlsx file\n");
    print("Assumes .docx or .xlsx has already been unzipped to a local directory\n\n");
    print("Example: docx-font-extractor.pl -d -f /home/sansforensics/word2k7/word/fontTable.xml\n");
    print("Example: docx-font-extractor.pl -x -f /home/sansforensics/excelbk1/xl/styles.xml\n");
    exit;
}

my $xpath = XML::XPath->new(filename => $fontfile);
my $nodeset;
my $xmlfontfield;

if ($isdocx)
{
    $nodeset = $xpath->find("/w:fonts/w:font");
    $xmlfontfield = "w:name";
}
elsif ($isxlsx)
{
    $nodeset = $xpath->find("/styleSheet/fonts/font/name");
    $xmlfontfield = "val";
}

print "Found ".$nodeset->size." results\n";

foreach my $node ($nodeset->get_nodelist)
{
    my $fontname = $node->getAttribute($xmlfontfield);
    print "Found font = $fontname\n";

    # TODO Lookup $fontname in list of stored Office fonts

    # TODO Print "The ... font is installed on MS Office ..."

}

#END CODE


Code Summary

We start off with a "GetOptions" and Help section as usual.

Then we create a new "XML::XPath" object passing in the user specified filename (ie "$fontfile") as the argument.

If the user has specified the .docx flag ("-d"), we get our XML::XPath object ($xpath) to "find" the MS Word specific font element ("/w:fonts/w:font"). Note: the "/"s are used to seperate the element names. We then set the "$xmlfontfield" variable to the attribute name we're interested in (ie "w:name"). We'll use this later.

Similarly, if the user has specified the .xlsx flag ("-x"), we get our XML::XPath object ($xpath) to "find" the MS Excel specific font element ("/styleSheet/fonts/font/name") and then set "$xmlfontfield" to the attribute name we're interested in (ie "val").

Our various calls to "$xpath->find" should return results in the form of an XML::XPath::NodeSet object. We can then call XML::XPath::NodeSet's "get_nodelist" to return a list of search results. Next we can  iterate through each "node" and print out the relevant "attribute" value by calling the XML::XPath::Node::Element's "getAttribute" function with the "$xmlfontfield" variable we set previously.

Testing

After unzipping example .xlsx and .docx files to their respective "/home/sansforensics/" sub-folders (eg "unzip /cases/excel2k7-Book1.xlsx -d /home/sansforensics/excelbk1/") we now run our Perl script.

sansforensics@SIFT-Workstation:~$ ./docx-font-extractor.pl -x -f /home/sansforensics/excelbk1/xl/styles.xml
Found 4 results
Found font = Calibri
Found font = Calibri
Found font = Calibri
Found font = Arial Black
sansforensics@SIFT-Workstation:~$


For confirmation, here's the example listing of "xl/styles.xml":

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><fonts count="4"><font><sz val="11"/><color theme="1"/><name val="Calibri"/><family val="2"/><scheme val="minor"/></font><font><b/><sz val="11"/><color theme="1"/><name val="Calibri"/><family val="2"/><scheme val="minor"/></font><font><b/><u/><sz val="11"/><color theme="1"/><name val="Calibri"/><family val="2"/><scheme val="minor"/></font><font><sz val="11"/><color theme="1"/><name val="Arial Black"/><family val="2"/></font></fonts><fills count="2"><fill><patternFill patternType="none"/></fill><fill><patternFill patternType="gray125"/></fill></fills><borders count="1"><border><left/><right/><top/><bottom/><diagonal/></border></borders><cellStyleXfs count="1"><xf numFmtId="0" fontId="0" fillId="0" borderId="0"/></cellStyleXfs><cellXfs count="5"><xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/><xf numFmtId="14" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1"/><xf numFmtId="0" fontId="1" fillId="0" borderId="0" xfId="0" applyFont="1"/><xf numFmtId="0" fontId="2" fillId="0" borderId="0" xfId="0" applyFont="1"/><xf numFmtId="0" fontId="3" fillId="0" borderId="0" xfId="0" applyFont="1"/></cellXfs><cellStyles count="1"><cellStyle name="Normal" xfId="0" builtinId="0"/></cellStyles><dxfs count="0"/><tableStyles count="0" defaultTableStyle="TableStyleMedium9" defaultPivotStyle="PivotStyleLight16"/></styleSheet>


The reason for the 3 Calibri declarations appears to be due to the use of "normal", "Bold" and "Bold Underlined" versions of the Calibri font. The bold Calibri font is declared using a "<b/>" following the second "<font>" element tag. Similarly, I suspect "<b/><u/>" represents "Bold Underlined". Unfortunately, Word 2007 does not seem to implement the same XML mechanisms for denoting "Bold" and "Underline" fonts as Excel 2007.

Here's our script's results with Word 2007:

sansforensics@SIFT-Workstation:~$ ./docx-font-extractor.pl -d -f /home/sansforensics/word2k7/word/fontTable.xml
Found 3 results
Found font = Calibri
Found font = Times New Roman
Found font = Cambria
sansforensics@SIFT-Workstation:~$


For confirmation, here's the example listing of "word/fontTable.xml":

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:fonts xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:font w:name="Calibri"><w:panose1 w:val="020F0502020204030204"/><w:charset w:val="00"/><w:family w:val="swiss"/><w:pitch w:val="variable"/><w:sig w:usb0="A00002EF" w:usb1="4000207B" w:usb2="00000000" w:usb3="00000000" w:csb0="0000009F" w:csb1="00000000"/></w:font><w:font w:name="Times New Roman"><w:panose1 w:val="02020603050405020304"/><w:charset w:val="00"/><w:family w:val="roman"/><w:pitch w:val="variable"/><w:sig w:usb0="20002A87" w:usb1="00000000" w:usb2="00000000" w:usb3="00000000" w:csb0="000001FF" w:csb1="00000000"/></w:font><w:font w:name="Cambria"><w:panose1 w:val="02040503050406030204"/><w:charset w:val="00"/><w:family w:val="roman"/><w:pitch w:val="variable"/><w:sig w:usb0="A00002EF" w:usb1="4000004B" w:usb2="00000000" w:usb3="00000000" w:csb0="0000009F" w:csb1="00000000"/></w:font></w:fonts>


Note: Word's completely different XML schema compared to Excel's schema.

Conclusion

We have successfully written a "docx-font-extractor.pl" Perl script to extract the generic font names used from (unzipped) MS Office 2007 .docx and .xlsx files.

Whilst the extracted font names generally correspond to Microsoft's list of installed fonts - they do not match exactly. For example, the MS Office 2007 font table lists "Calibri (TrueType)", "Calibri Bold (TrueType)", "Calibri Italic (TrueType)", "Calibri Bold Italic (TrueType)" and our script can only report the attribute name ie "Calibri".
If we had a 1:1 mapping, then it would be simple enough to declare an array list of fonts for each MS Office release and then search them for each font found (see TODO comments in code).
Unfortunately, it looks like there's more work required to extract any further font metadata from each of the differing Excel/Word font mechanisms. If only MS had made their font attributes consistent throughout Office *sigh*. And having the full font names in the XML would have been nice too (eg "Calibri Bold (TrueType)") *double sigh*. Anyhoo, that's where I'll leave things (for now) ... I'll finish up by wishing you *Sigh O Nara* ;)