Sunday 12 July 2015

Chunky4n6Monkey!

With some substantial assistance from Boss Rob and inspired by Mission Impossible  ... Enter the Chunky4n6Monkey!

This post is targeted at those particularly interested in Python programming. If you are looking for a forensic wonder-tool post, you could be bitterly disappointed (yet again!).
Special Thanks to Rob (my boss and Hex Ninja Sensei) for kindly sharing his work which was the basis for this post.

After experiencing a few reversing/carving jobs, it seems that there's common theme.
Usually, the analyst wants to search a file for a given set of values (eg magic hex number) and then process the surrounding data accordingly. Complicating matters is that as storage media grow in size (especially in mobile devices), it is not always possible or timely to read the whole contents of a file into memory for searching. While Python does allow you to read files line by line, this is not really conducive to searching for (long) strings that cross line boundaries.

Rockin' Rob's method for handling these large files is to break them up into chunks but read slightly more than a chunk size (ie chunk size + delta). This way if a hit starts at the end of one chunk and crosses the chunk boundary, we can still find it/log it for later.
Note: The delta must be at least the same size as the largest record that is being searched for. Worst case scenario, the very last byte of the chunk contains the first byte of the search term - which means making delta as big as the largest record.

OK, first we are going to look at a theoretical chunky situation, then we will look at developing a utility (chunkymonkey.py) that can help us select one of two search algorithms and also help us to optimize our chunk size. Finally, we will implement our newly selected search algorithm and chunk size in an existing Windows Phone 8 script (wp8-sms.py) and compare it with the previous un-chunkified version. You can grab the chunkymonkey.py script (and the updated wp8-sms.py script) from my GitHub page.
Hehe, "Chunkymonkey" - That's gotta be one of my all time favourite tool names :)

So now let's take a look at a chunky example:

16 Byte Chunky Example

Here we have a file which is divided into theoretical 16 byte chunks plus some extra bytes. The search term we are after is the three bytes 0x010203. They occur three times - once entirely before a chunk boundary at offset 12, once where it overlaps a chunk boundary at offset 31 and once right at the start of a chunk boundary at offset 48.
These three conditions simulate the possible chunk boundary situations. Our new chunkymonkey.py script will read a file chunk by chunk and if the search term starts before the end of a chunk boundary, it will log the file offset for later processing. If the search hit appears after the chunk boundary we ignore it. If the search hit starts after the chunk boundary but within chunk size+delta, we also ignore it as the next round of chunk processing should also pick it up.
We are also going to evaluate a couple of different search methods to see if we can speed our chunk searches up. The first one "all_indices" relies on the string.find method for finding substrings (think of the contents of a file as one big hex string). This was *ahem* "re-used" *ahem* from a recipe listed on code.activestate.com. The second method uses a compiled regular expression pattern. For more on Python regular expressions, you can read the documentation HOWTO.

Here's the code for each:


# Find all indices of a substring in a given string (using string.find)
# From http://code.activestate.com/recipes/499314-find-all-indices-of-a-substring-in-a-given-string/
def all_indices(bigstring, substring, listindex=[], offset=0):
    i = bigstring.find(substring, offset)
    while i >= 0:
        listindex.append(i)
        i = bigstring.find(substring, i + 1)
    return listindex

# Find all indices of the "pattern" regular expression in a given string (using regex)
# Where pattern is a compiled Python re pattern object (ie the output of "re.compile")
def regsearch(bigstring, pattern, listindex=[]):
    hitsit = pattern.finditer(bigstring)
    for it in hitsit:
        # iterators only last for one shot so we capture the offsets to a list
        listindex.append(it.start())
    return listindex
For benchmarking purposes, we are going to call these two functions from sliceNsearch (for "all_indices") and sliceNsearchRE (for "regsearch").
These slice functions are going to read the specified file chunk by chunk and then call their respective search function. If the file size is less than one chunk size, the entire file will be read and searched in one go.
Once the search function returns a list of hit offsets (relative to the current chunk), these offsets will be converted to the equivalent file offsets for later processing.
For comparison, our new script will then also do a full file.read (ie no chunks) and process the resultant file string using the "all_indices" and "regsearch" functions. These wholeread functions can take a while to run so we can comment out those calls (to "wholereadRE" and "wholeread") later.

The goal is to compare the times taken when searching for hits in chunks Vs searching for hits via full file reads.
The secondary aim is to figure out which search function is quicker ie "regsearch" or "all_indices".

Here's the help text for chunkymonkey.py:


c:\Python27\python.exe chunkymonkey.py -h

Running chunkymonkey.py v2015-08-19

usage: chunkymonkey.py [-h] inputfile term chunksize delta

Helps find optimal chunk sizes when searching large binary files for a known
hex string

positional arguments:
  inputfile   File to be searched
  term        Hex Search string eg 53004d00
  chunksize   Size of each chunk (in decimal bytes)
  delta       Size of the extra read buffer (in decimal bytes)

optional arguments:
  -h, --help  show this help message and exit

Now we'll call our new script (chunkymonkey.py) with our 16 byte chunk boundary hex file pictured earlier:

c:\Python27\python.exe chunkymonkey.py 16byte-chunk-with-3byte-delta.bin 010203 16 3
Running chunkymonkey.py v2015-08-19

Search term is: 010203
Chunky sliceNsearch hits = 3, Chunky sliceNsearchRE hits = 3
Wholeread all_indices hits = 3, Wholeread regsearch hits = 3

Both the sliceNsearch and sliceNsearchRE chunky functions found the same hit offsets. To save space, I have commented out the part which prints each hit offset but rest assured that all hits listed were the same.
The wholeread (one big file.read) function calls to the "all_indices" and "regsearch" functions also found the same hits.
This proves that our new chunky functions will find the same search hits as the file.read (wholeread) functions.

So now what?
Python has an inbuilt profiling module (cProfile) which provides timing information for each function call. By using this, we can see which search method and chunk size is the most time efficient.
However, as the example bin file is not very large, let's try finding the optimum search algorithm/chunk size for a 7 GB Windows Phone 8.0 image instead. The test system is a i7 3.4-3.9 GHz with 16 GB RAM and 256 GB SSD running Win7 Pro x64 and Python 2.7.6.

Note: For Python 2, there appears to be a size limitation on (chunksize + delta). It must be less than ~2147483647.
This is probably because a Python int is implemented via a C long which is limited to 2^32 bits (ie max range is +/-2147483647). See also here for further details. Python 3 apparently does not have this limitation. So that kinda limits us to 2 GB chunk sizes at this point :'(

OK so let's try running chunkymonkey.py with a 2000000000 byte (~2GB) chunk size and 1000 byte delta size. The search term is "53004d00530074006500780074000000" ie UTF-16LE for "SMStext".

c:\Python27\python.exe -m cProfile chunkymonkey.py 7GBtestbin.bin 53004d00530074006500780074000000 2000000000 1000
Running chunkymonkey.py v2015-08-19

Search term is: 53004d00530074006500780074000000
Chunky sliceNsearch hits = 21, Chunky sliceNsearchRE hits = 21
Wholeread all_indices hits = 21, Wholeread regsearch hits = 21
         2290 function calls (2229 primitive calls) in 133.211 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 argparse.py:1023(_SubParsersAction)
        1    0.000    0.000    0.000    0.000 argparse.py:1025(_ChoicesPseudoAction)
        1    0.000    0.000    0.000    0.000 argparse.py:1100(FileType)
        1    0.000    0.000    0.000    0.000 argparse.py:112(_AttributeHolder)
        1    0.000    0.000    0.000    0.000 argparse.py:1144(Namespace)
        1    0.000    0.000    0.000    0.000 argparse.py:1151(__init__)
        1    0.000    0.000    0.000    0.000 argparse.py:1167(_ActionsContainer)
        3    0.000    0.000    0.000    0.000 argparse.py:1169(__init__)
       34    0.000    0.000    0.000    0.000 argparse.py:1221(register)
       14    0.000    0.000    0.000    0.000 argparse.py:1225(_registry_get)
        5    0.000    0.000    0.000    0.000 argparse.py:1250(add_argument)
        2    0.000    0.000    0.000    0.000 argparse.py:1297(add_argument_group)
        5    0.000    0.000    0.000    0.000 argparse.py:1307(_add_action)
        4    0.000    0.000    0.000    0.000 argparse.py:1371(_get_positional_kwargs)
        1    0.000    0.000    0.000    0.000 argparse.py:1387(_get_optional_kwargs)
        5    0.000    0.000    0.000    0.000 argparse.py:1422(_pop_action_class)
        3    0.000    0.000    0.000    0.000 argparse.py:1426(_get_handler)
        5    0.000    0.000    0.000    0.000 argparse.py:1435(_check_conflict)
        1    0.000    0.000    0.000    0.000 argparse.py:147(HelpFormatter)
        1    0.000    0.000    0.000    0.000 argparse.py:1471(_ArgumentGroup)
        2    0.000    0.000    0.000    0.000 argparse.py:1473(__init__)
        5    0.000    0.000    0.000    0.000 argparse.py:1495(_add_action)
        1    0.000    0.000    0.000    0.000 argparse.py:1505(_MutuallyExclusiveGroup)
        1    0.000    0.000    0.000    0.000 argparse.py:1525(ArgumentParser)
        5    0.000    0.000    0.000    0.000 argparse.py:154(__init__)
        1    0.000    0.000    0.001    0.001 argparse.py:1543(__init__)
        2    0.000    0.000    0.000    0.000 argparse.py:1589(identity)
        5    0.000    0.000    0.000    0.000 argparse.py:1667(_add_action)
        1    0.000    0.000    0.000    0.000 argparse.py:1679(_get_positional_actions)
        1    0.000    0.000    0.000    0.000 argparse.py:1687(parse_args)
        1    0.000    0.000    0.000    0.000 argparse.py:1694(parse_known_args)

        1    0.000    0.000    0.000    0.000 argparse.py:1729(_parse_known_args)
        4    0.000    0.000    0.000    0.000 argparse.py:1776(take_action)
        1    0.000    0.000    0.000    0.000 argparse.py:1874(consume_positionals)
        1    0.000    0.000    0.000    0.000 argparse.py:195(_Section)
        5    0.000    0.000    0.000    0.000 argparse.py:197(__init__)
        1    0.000    0.000    0.000    0.000 argparse.py:2026(_match_arguments_partial)
        4    0.000    0.000    0.000    0.000 argparse.py:2042(_parse_optional)
        4    0.000    0.000    0.000    0.000 argparse.py:2143(_get_nargs_pattern)
        4    0.000    0.000    0.000    0.000 argparse.py:2187(_get_values)
        4    0.000    0.000    0.000    0.000 argparse.py:2239(_get_value)
        4    0.000    0.000    0.000    0.000 argparse.py:2264(_check_value)
        5    0.000    0.000    0.000    0.000 argparse.py:2313(_get_formatter)
        5    0.000    0.000    0.000    0.000 argparse.py:555(_metavar_formatter)
        5    0.000    0.000    0.000    0.000 argparse.py:564(format)
        5    0.000    0.000    0.000    0.000 argparse.py:571(_format_args)
        1    0.001    0.001    0.002    0.002 argparse.py:62(<module>)
        1    0.000    0.000    0.000    0.000 argparse.py:627(RawDescriptionHelpFormatter)
        1    0.000    0.000    0.000    0.000 argparse.py:638(RawTextHelpFormatter)
        1    0.000    0.000    0.000    0.000 argparse.py:649(ArgumentDefaultsHelpFormatter)
        1    0.000    0.000    0.000    0.000 argparse.py:683(ArgumentError)
        1    0.000    0.000    0.000    0.000 argparse.py:703(ArgumentTypeError)

        1    0.000    0.000    0.000    0.000 argparse.py:712(Action)
        5    0.000    0.000    0.000    0.000 argparse.py:763(__init__)
        1    0.000    0.000    0.000    0.000 argparse.py:803(_StoreAction)
        4    0.000    0.000    0.000    0.000 argparse.py:805(__init__)
        4    0.000    0.000    0.000    0.000 argparse.py:834(__call__)
        1    0.000    0.000    0.000    0.000 argparse.py:838(_StoreConstAction)

        1    0.000    0.000    0.000    0.000 argparse.py:861(_StoreTrueAction)
        1    0.000    0.000    0.000    0.000 argparse.py:878(_StoreFalseAction)

        1    0.000    0.000    0.000    0.000 argparse.py:895(_AppendAction)
        1    0.000    0.000    0.000    0.000 argparse.py:932(_AppendConstAction)
       14    0.000    0.000    0.000    0.000 argparse.py:95(_callable)
        1    0.000    0.000    0.000    0.000 argparse.py:958(_CountAction)
        1    0.000    0.000    0.000    0.000 argparse.py:979(_HelpAction)
        1    0.000    0.000    0.000    0.000 argparse.py:981(__init__)
        1    0.000    0.000    0.000    0.000 argparse.py:998(_VersionAction)
        1    0.310    0.310   11.327   11.327 chunkymonkey.py:115(sliceNsearchRE)
        1    0.000    0.000   44.736   44.736 chunkymonkey.py:167(wholeread)
        1    0.000    0.000   47.431   47.431 chunkymonkey.py:182(wholereadRE)
        1    1.023    1.023  133.211  133.211 chunkymonkey.py:34(<module>)
        5    0.000    0.000   22.229    4.446 chunkymonkey.py:44(all_indices)
        5   16.226    3.245   16.227    3.245 chunkymonkey.py:53(regsearch)
        1    0.296    0.296   28.691   28.691 chunkymonkey.py:63(sliceNsearch)
        1    0.001    0.001    0.001    0.001 collections.py:1(<module>)
        1    0.000    0.000    0.000    0.000 collections.py:26(OrderedDict)
        1    0.000    0.000    0.000    0.000 collections.py:387(Counter)
        3    0.000    0.000    0.000    0.000 gettext.py:130(_expand_lang)
        3    0.000    0.000    0.000    0.000 gettext.py:421(find)
        3    0.000    0.000    0.000    0.000 gettext.py:461(translation)
        3    0.000    0.000    0.000    0.000 gettext.py:527(dgettext)
        3    0.000    0.000    0.000    0.000 gettext.py:565(gettext)
        1    0.000    0.000    0.000    0.000 heapq.py:31(<module>)
        1    0.000    0.000    0.000    0.000 keyword.py:11(<module>)
        3    0.000    0.000    0.000    0.000 locale.py:347(normalize)
        1    0.000    0.000    0.000    0.000 ntpath.py:122(splitdrive)
        1    0.000    0.000    0.000    0.000 ntpath.py:164(split)
        1    0.000    0.000    0.000    0.000 ntpath.py:196(basename)
        5    0.000    0.000    0.000    0.000 os.py:422(__getitem__)
       12    0.000    0.000    0.000    0.000 os.py:444(get)
        1    0.000    0.000    0.000    0.000 re.py:134(match)
       15    0.000    0.000    0.001    0.000 re.py:188(compile)
       16    0.000    0.000    0.001    0.000 re.py:226(_compile)
        4    0.000    0.000    0.000    0.000 sre_compile.py:178(_compile_charset)
        4    0.000    0.000    0.000    0.000 sre_compile.py:207(_optimize_charset)
     24/5    0.000    0.000    0.000    0.000 sre_compile.py:32(_compile)
       13    0.000    0.000    0.000    0.000 sre_compile.py:354(_simple)
        5    0.000    0.000    0.000    0.000 sre_compile.py:359(_compile_info)
       10    0.000    0.000    0.000    0.000 sre_compile.py:472(isstring)
        5    0.000    0.000    0.000    0.000 sre_compile.py:478(_code)
        5    0.000    0.000    0.001    0.000 sre_compile.py:493(compile)
       61    0.000    0.000    0.000    0.000 sre_parse.py:126(__len__)
        4    0.000    0.000    0.000    0.000 sre_parse.py:128(__delitem__)
      109    0.000    0.000    0.000    0.000 sre_parse.py:130(__getitem__)
       13    0.000    0.000    0.000    0.000 sre_parse.py:134(__setitem__)
       49    0.000    0.000    0.000    0.000 sre_parse.py:138(append)
    37/18    0.000    0.000    0.000    0.000 sre_parse.py:140(getwidth)
        5    0.000    0.000    0.000    0.000 sre_parse.py:178(__init__)
       79    0.000    0.000    0.000    0.000 sre_parse.py:182(__next)
       35    0.000    0.000    0.000    0.000 sre_parse.py:195(match)
       69    0.000    0.000    0.000    0.000 sre_parse.py:201(get)
        8    0.000    0.000    0.000    0.000 sre_parse.py:257(_escape)
      9/5    0.000    0.000    0.000    0.000 sre_parse.py:301(_parse_sub)
     10/6    0.000    0.000    0.000    0.000 sre_parse.py:379(_parse)
        5    0.000    0.000    0.000    0.000 sre_parse.py:67(__init__)
        5    0.000    0.000    0.000    0.000 sre_parse.py:675(parse)
        4    0.000    0.000    0.000    0.000 sre_parse.py:72(opengroup)
        4    0.000    0.000    0.000    0.000 sre_parse.py:83(closegroup)
       24    0.000    0.000    0.000    0.000 sre_parse.py:90(__init__)
        5    0.000    0.000    0.000    0.000 {_sre.compile}
        1    0.000    0.000    0.000    0.000 {binascii.hexlify}
        1    0.000    0.000    0.000    0.000 {binascii.unhexlify}
        3    0.000    0.000    0.000    0.000 {getattr}
       26    0.000    0.000    0.000    0.000 {hasattr}
      133    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 {iter}
  342/327    0.000    0.000    0.000    0.000 {len}
        2    0.000    0.000    0.000    0.000 {math.ceil}
        2    0.000    0.000    0.000    0.000 {max}
        8    0.000    0.000    0.000    0.000 {method 'add' of 'set' objects}
      452    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        4    0.000    0.000    0.000    0.000 {method 'close' of 'file' objects}

        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        6    0.000    0.000    0.000    0.000 {method 'extend' of 'list' objects}
       56   22.229    0.397   22.229    0.397 {method 'find' of 'str' objects}
        5    0.000    0.000    0.000    0.000 {method 'finditer' of '_sre.SRE_Pattern' objects}
       78    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'groups' of '_sre.SRE_Match' objects}
        5    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}

        3    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'lstrip' of 'str' objects}

        3    0.000    0.000    0.000    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
        6    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
       10   93.117    9.312   93.117    9.312 {method 'read' of 'file' objects}
        8    0.000    0.000    0.000    0.000 {method 'remove' of 'list' objects}
        7    0.000    0.000    0.000    0.000 {method 'replace' of 'str' objects}
        3    0.000    0.000    0.000    0.000 {method 'reverse' of 'list' objects}
        8    0.000    0.000    0.000    0.000 {method 'seek' of 'file' objects}
       40    0.000    0.000    0.000    0.000 {method 'setdefault' of 'dict' objects}
       42    0.000    0.000    0.000    0.000 {method 'start' of '_sre.SRE_Match' objects}
        3    0.000    0.000    0.000    0.000 {method 'translate' of 'str' objects}
       17    0.000    0.000    0.000    0.000 {method 'upper' of 'str' objects}
       50    0.000    0.000    0.000    0.000 {min}
        2    0.001    0.000    0.001    0.000 {nt.stat}
        4    0.005    0.001    0.005    0.001 {open}
       31    0.000    0.000    0.000    0.000 {ord}
        9    0.000    0.000    0.000    0.000 {range}
        8    0.000    0.000    0.000    0.000 {setattr}
        1    0.000    0.000    0.000    0.000 {zip}

That's a LOT of output eh? Don't worry, we're only interested in the respective lines which contain the sliceNsearch, sliceNsearchRE, wholeread and wholereadRE functions. We're also going to focus on the "cumtime" column. This is the cumulative time spent in the function (not what your twisted mind first thought eh?) and thats the figure (highlighted in red) that we will use to compare the various runs with different chunk sizes.

To save space, here's a table detailing runs with different chunk sizes (rounded to nearest second):

Function processing times by chunksize

Delta was consistently set at 1000 bytes.
From the results above, we can see that the best "cumtime" is for a chunksize of 2000000000 bytes and using the chunky sliceNsearchRE function (which calls the "regsearch" function for each chunk).
Note how the wholeread times are much larger than either the sliceNsearch or sliceNsearchRE times.
Anyhoo, that's all well and good but how much difference can it make to an actual script which also has to process the hits and not just find them?

We modified our previous Windows Phone 8 SMS script (wp-sms.py) to use 2000000000 byte chunks (with 1000 byte delta) and modified it to call the sliceNsearchRE function. We then captured the cProfile stats.

First, we ran the previous unchunkified version of wp-sms.py which yielded these times:

c:\Python27\python.exe -m cProfile wp8-sms-orig.py -f 7GBtestbin.bin -o 7GBtestop.tsv
Running wp8-sms.py v2014-10-05

Skipping hit at 0x2dace9b0 - cannot find next field after SMStext
Skipping hit at 0x2dad6000 - cannot find next field after SMStext
Skipping hit at 0x31611c30 - cannot find next field after SMStext
Skipping hit at 0x4ce99bc0 - cannot find next field after SMStext
Skipping hit at 0x4ce99c00 - cannot find next field after SMStext
Skipping hit at 0x4ce9bf7c - cannot find next field after SMStext
Skipping hit at 0x66947c30 - cannot find next field after SMStext
Skipping hit at 0x6694ebc0 - cannot find next field after SMStext
Skipping hit at 0x6694ec00 - cannot find next field after SMStext
String substitution(s) due to unrecognized/unprintable characters at 0xccf26379

Processed 21 SMStext hits


Finished writing out 12 TSV entries

         21672 function calls in 55.624 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.000    0.000 __init__.py:49(normalize_encoding)

        2    0.000    0.000    0.001    0.001 __init__.py:71(search_function)
        1    0.000    0.000    0.000    0.000 ascii.py:13(Codec)
        1    0.000    0.000    0.000    0.000 ascii.py:20(IncrementalEncoder)
        1    0.000    0.000    0.000    0.000 ascii.py:24(IncrementalDecoder)
        1    0.000    0.000    0.000    0.000 ascii.py:28(StreamWriter)
        1    0.000    0.000    0.000    0.000 ascii.py:31(StreamReader)
        1    0.000    0.000    0.000    0.000 ascii.py:34(StreamConverter)
        1    0.000    0.000    0.000    0.000 ascii.py:41(getregentry)
        1    0.000    0.000    0.000    0.000 ascii.py:8(<module>)
        1    0.000    0.000    0.000    0.000 codecs.py:322(__init__)
        1    0.000    0.000    0.000    0.000 codecs.py:395(__init__)
     1288    0.004    0.000    0.008    0.000 codecs.py:424(read)
       72    0.000    0.000    0.000    0.000 codecs.py:591(reset)
        1    0.000    0.000    0.000    0.000 codecs.py:651(__init__)
     1288    0.001    0.000    0.008    0.000 codecs.py:669(read)
       72    0.000    0.000    0.000    0.000 codecs.py:702(seek)
      106    0.000    0.000    0.000    0.000 codecs.py:708(__getattr__)
        2    0.000    0.000    0.000    0.000 codecs.py:77(__new__)
        1    0.000    0.000    0.000    0.000 codecs.py:841(open)
        1    0.000    0.000    0.000    0.000 gettext.py:130(_expand_lang)
        1    0.000    0.000    0.000    0.000 gettext.py:421(find)
        1    0.000    0.000    0.000    0.000 gettext.py:461(translation)
        1    0.000    0.000    0.000    0.000 gettext.py:527(dgettext)
        1    0.000    0.000    0.000    0.000 gettext.py:565(gettext)
        1    0.000    0.000    0.000    0.000 locale.py:347(normalize)
        3    0.000    0.000    0.000    0.000 optparse.py:1007(add_option)
        1    0.000    0.000    0.000    0.000 optparse.py:1190(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:1242(_create_option_list)
        1    0.000    0.000    0.000    0.000 optparse.py:1247(_add_help_option)

        1    0.000    0.000    0.000    0.000 optparse.py:1257(_populate_option_list)
        1    0.000    0.000    0.000    0.000 optparse.py:1267(_init_parsing_state)
        1    0.000    0.000    0.000    0.000 optparse.py:1276(set_usage)
        1    0.000    0.000    0.000    0.000 optparse.py:1312(_get_all_options)

        1    0.000    0.000    0.000    0.000 optparse.py:1318(get_default_values)
        1    0.000    0.000    0.000    0.000 optparse.py:1361(_get_args)
        1    0.000    0.000    0.000    0.000 optparse.py:1367(parse_args)
        1    0.000    0.000    0.000    0.000 optparse.py:1406(check_values)
        1    0.000    0.000    0.000    0.000 optparse.py:1419(_process_args)
        2    0.000    0.000    0.000    0.000 optparse.py:1516(_process_short_opts)
        1    0.000    0.000    0.000    0.000 optparse.py:200(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:224(set_parser)
        1    0.000    0.000    0.000    0.000 optparse.py:365(__init__)
        3    0.000    0.000    0.000    0.000 optparse.py:560(__init__)
        3    0.000    0.000    0.000    0.000 optparse.py:579(_check_opt_strings)
        3    0.000    0.000    0.000    0.000 optparse.py:588(_set_opt_strings)
        3    0.000    0.000    0.000    0.000 optparse.py:609(_set_attrs)
        3    0.000    0.000    0.000    0.000 optparse.py:629(_check_action)
        3    0.000    0.000    0.000    0.000 optparse.py:635(_check_type)
        3    0.000    0.000    0.000    0.000 optparse.py:665(_check_choice)
        3    0.000    0.000    0.000    0.000 optparse.py:678(_check_dest)
        3    0.000    0.000    0.000    0.000 optparse.py:693(_check_const)
        3    0.000    0.000    0.000    0.000 optparse.py:699(_check_nargs)
        3    0.000    0.000    0.000    0.000 optparse.py:708(_check_callback)
        2    0.000    0.000    0.000    0.000 optparse.py:752(takes_value)
        2    0.000    0.000    0.000    0.000 optparse.py:764(check_value)
        2    0.000    0.000    0.000    0.000 optparse.py:771(convert_value)
        2    0.000    0.000    0.000    0.000 optparse.py:778(process)
        2    0.000    0.000    0.000    0.000 optparse.py:790(take_action)
        3    0.000    0.000    0.000    0.000 optparse.py:832(isbasestring)
        1    0.000    0.000    0.000    0.000 optparse.py:837(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:932(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:943(_create_option_mappings)
        1    0.000    0.000    0.000    0.000 optparse.py:959(set_conflict_handler)
        1    0.000    0.000    0.000    0.000 optparse.py:964(set_description)
        3    0.000    0.000    0.000    0.000 optparse.py:980(_check_conflict)
        1    0.000    0.000    0.000    0.000 os.py:422(__getitem__)
        4    0.000    0.000    0.000    0.000 os.py:444(get)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:18(IncrementalEncoder)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:22(IncrementalDecoder)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:25(StreamWriter)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:28(StreamReader)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:33(getregentry)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:8(<module>)
        1    0.006    0.006   55.624   55.624 wp8-sms-orig.py:101(<module>)
       72    0.003    0.000    0.012    0.000 wp8-sms-orig.py:114(read_nullterm_unistring)
      831    0.001    0.000    0.004    0.000 wp8-sms-orig.py:148(read_filetime)

        2    0.001    0.000   22.113   11.057 wp8-sms-orig.py:174(all_indices)
       21    0.000    0.000    0.003    0.000 wp8-sms-orig.py:184(find_flag)
       12    0.001    0.000    0.004    0.000 wp8-sms-orig.py:200(find_timestamp)
       33    0.000    0.000    0.000    0.000 wp8-sms-orig.py:218(goto_next_field)
       12    0.000    0.000    0.000    0.000 wp8-sms-orig.py:489(<lambda>)
        2    0.001    0.001    0.001    0.001 {__import__}
        1    0.000    0.000    0.000    0.000 {_codecs.lookup}
     2576    0.002    0.000    0.002    0.000 {_codecs.utf_16_le_decode}
     1547    0.001    0.000    0.001    0.000 {_struct.unpack}
        2    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x000000001E29D0F0}
       38    0.002    0.000    0.002    0.000 {built-in method utcfromtimestamp}

        3    0.000    0.000    0.000    0.000 {filter}
      106    0.000    0.000    0.000    0.000 {getattr}
        4    0.000    0.000    0.000    0.000 {hasattr}
       22    0.000    0.000    0.000    0.000 {hex}
        8    0.000    0.000    0.000    0.000 {isinstance}
     3881    0.001    0.000    0.001    0.000 {len}
      639    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        3    0.000    0.000    0.000    0.000 {method 'close' of 'file' objects}

        1    0.000    0.000    0.000    0.000 {method 'copy' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
      634   22.113    0.035   22.113    0.035 {method 'find' of 'str' objects}
       21    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
       38    0.000    0.000    0.000    0.000 {method 'isoformat' of 'datetime.datetime' objects}
        1    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}

        2    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
       35    0.000    0.000    0.000    0.000 {method 'keys' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'lower' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {method 'pop' of 'list' objects}
     4124   33.484    0.008   33.484    0.008 {method 'read' of 'file' objects}
        4    0.000    0.000    0.000    0.000 {method 'replace' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'reverse' of 'list' objects}
       13    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}

     1646    0.002    0.000    0.002    0.000 {method 'seek' of 'file' objects}
        2    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
      993    0.001    0.000    0.001    0.000 {method 'tell' of 'file' objects}
        3    0.000    0.000    0.000    0.000 {method 'translate' of 'str' objects}
        5    0.000    0.000    0.000    0.000 {method 'upper' of 'str' objects}
       13    0.000    0.000    0.001    0.000 {method 'write' of 'file' objects}

        3    0.001    0.000    0.001    0.000 {open}
       46    0.000    0.000    0.000    0.000 {range}
       40    0.000    0.000    0.000    0.000 {setattr}
        1    0.000    0.000    0.000    0.000 {sorted}
     1288    0.000    0.000    0.000    0.000 {unichr}

Note: Total time was 56 seconds. 33 seconds of which were spent in the file.read function.

And after chunkification, it ran a lot quicker!

c:\Python27\python.exe -m cProfile wp8-sms.py -f 7GBtestbin.bin -o 7GBtestop-chunk.tsv
Running wp8-sms.py v2015-08-19

Skipping hit at 0x2dace9b0 - cannot find next field after SMStext
Skipping hit at 0x2dad6000 - cannot find next field after SMStext
Skipping hit at 0x31611c30 - cannot find next field after SMStext
Skipping hit at 0x4ce99bc0 - cannot find next field after SMStext
Skipping hit at 0x4ce99c00 - cannot find next field after SMStext
Skipping hit at 0x4ce9bf7c - cannot find next field after SMStext
Skipping hit at 0x66947c30 - cannot find next field after SMStext
Skipping hit at 0x6694ebc0 - cannot find next field after SMStext
Skipping hit at 0x6694ec00 - cannot find next field after SMStext
String substitution(s) due to unrecognized/unprintable characters at 0xccf26379

Processed 21 SMStext hits


Finished writing out 12 TSV entries

         22706 function calls in 37.401 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.000    0.000 __init__.py:49(normalize_encoding)

        2    0.000    0.000    0.003    0.001 __init__.py:71(search_function)
        1    0.000    0.000    0.000    0.000 ascii.py:13(Codec)
        1    0.000    0.000    0.000    0.000 ascii.py:20(IncrementalEncoder)
        1    0.000    0.000    0.000    0.000 ascii.py:24(IncrementalDecoder)
        1    0.000    0.000    0.000    0.000 ascii.py:28(StreamWriter)
        1    0.000    0.000    0.000    0.000 ascii.py:31(StreamReader)
        1    0.000    0.000    0.000    0.000 ascii.py:34(StreamConverter)
        1    0.000    0.000    0.000    0.000 ascii.py:41(getregentry)
        1    0.000    0.000    0.000    0.000 ascii.py:8(<module>)
        1    0.000    0.000    0.000    0.000 codecs.py:322(__init__)
        1    0.000    0.000    0.000    0.000 codecs.py:395(__init__)
     1288    0.004    0.000    0.008    0.000 codecs.py:424(read)
       72    0.000    0.000    0.000    0.000 codecs.py:591(reset)
        1    0.000    0.000    0.000    0.000 codecs.py:651(__init__)
     1288    0.001    0.000    0.008    0.000 codecs.py:669(read)
       72    0.000    0.000    0.000    0.000 codecs.py:702(seek)
      106    0.000    0.000    0.000    0.000 codecs.py:708(__getattr__)
        2    0.000    0.000    0.000    0.000 codecs.py:77(__new__)
        1    0.000    0.000    0.002    0.002 codecs.py:841(open)
        1    0.000    0.000    0.000    0.000 gettext.py:130(_expand_lang)
        1    0.000    0.000    0.000    0.000 gettext.py:421(find)
        1    0.000    0.000    0.000    0.000 gettext.py:461(translation)
        1    0.000    0.000    0.000    0.000 gettext.py:527(dgettext)
        1    0.000    0.000    0.000    0.000 gettext.py:565(gettext)
        1    0.000    0.000    0.000    0.000 locale.py:347(normalize)
        3    0.000    0.000    0.000    0.000 optparse.py:1007(add_option)
        1    0.000    0.000    0.000    0.000 optparse.py:1190(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:1242(_create_option_list)
        1    0.000    0.000    0.000    0.000 optparse.py:1247(_add_help_option)

        1    0.000    0.000    0.000    0.000 optparse.py:1257(_populate_option_list)
        1    0.000    0.000    0.000    0.000 optparse.py:1267(_init_parsing_state)
        1    0.000    0.000    0.000    0.000 optparse.py:1276(set_usage)
        1    0.000    0.000    0.000    0.000 optparse.py:1312(_get_all_options)

        1    0.000    0.000    0.000    0.000 optparse.py:1318(get_default_values)
        1    0.000    0.000    0.000    0.000 optparse.py:1361(_get_args)
        1    0.000    0.000    0.000    0.000 optparse.py:1367(parse_args)
        1    0.000    0.000    0.000    0.000 optparse.py:1406(check_values)
        1    0.000    0.000    0.000    0.000 optparse.py:1419(_process_args)
        2    0.000    0.000    0.000    0.000 optparse.py:1516(_process_short_opts)
        1    0.000    0.000    0.000    0.000 optparse.py:200(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:224(set_parser)
        1    0.000    0.000    0.000    0.000 optparse.py:365(__init__)
        3    0.000    0.000    0.000    0.000 optparse.py:560(__init__)
        3    0.000    0.000    0.000    0.000 optparse.py:579(_check_opt_strings)
        3    0.000    0.000    0.000    0.000 optparse.py:588(_set_opt_strings)
        3    0.000    0.000    0.000    0.000 optparse.py:609(_set_attrs)
        3    0.000    0.000    0.000    0.000 optparse.py:629(_check_action)
        3    0.000    0.000    0.000    0.000 optparse.py:635(_check_type)
        3    0.000    0.000    0.000    0.000 optparse.py:665(_check_choice)
        3    0.000    0.000    0.000    0.000 optparse.py:678(_check_dest)
        3    0.000    0.000    0.000    0.000 optparse.py:693(_check_const)
        3    0.000    0.000    0.000    0.000 optparse.py:699(_check_nargs)
        3    0.000    0.000    0.000    0.000 optparse.py:708(_check_callback)
        2    0.000    0.000    0.000    0.000 optparse.py:752(takes_value)
        2    0.000    0.000    0.000    0.000 optparse.py:764(check_value)
        2    0.000    0.000    0.000    0.000 optparse.py:771(convert_value)
        2    0.000    0.000    0.000    0.000 optparse.py:778(process)
        2    0.000    0.000    0.000    0.000 optparse.py:790(take_action)
        3    0.000    0.000    0.000    0.000 optparse.py:832(isbasestring)
        1    0.000    0.000    0.000    0.000 optparse.py:837(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:932(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:943(_create_option_mappings)
        1    0.000    0.000    0.000    0.000 optparse.py:959(set_conflict_handler)
        1    0.000    0.000    0.000    0.000 optparse.py:964(set_description)
        3    0.000    0.000    0.000    0.000 optparse.py:980(_check_conflict)
        1    0.000    0.000    0.000    0.000 os.py:422(__getitem__)
        4    0.000    0.000    0.000    0.000 os.py:444(get)
        2    0.000    0.000    0.000    0.000 re.py:188(compile)
        2    0.000    0.000    0.000    0.000 re.py:226(_compile)
        2    0.000    0.000    0.000    0.000 sre_compile.py:32(_compile)
        2    0.000    0.000    0.000    0.000 sre_compile.py:359(_compile_info)
        4    0.000    0.000    0.000    0.000 sre_compile.py:472(isstring)
        2    0.000    0.000    0.000    0.000 sre_compile.py:478(_code)
        2    0.000    0.000    0.000    0.000 sre_compile.py:493(compile)
       24    0.000    0.000    0.000    0.000 sre_parse.py:138(append)
        2    0.000    0.000    0.000    0.000 sre_parse.py:140(getwidth)
        2    0.000    0.000    0.000    0.000 sre_parse.py:178(__init__)
       30    0.000    0.000    0.000    0.000 sre_parse.py:182(__next)
        2    0.000    0.000    0.000    0.000 sre_parse.py:195(match)
       28    0.000    0.000    0.000    0.000 sre_parse.py:201(get)
        2    0.000    0.000    0.000    0.000 sre_parse.py:301(_parse_sub)
        2    0.000    0.000    0.000    0.000 sre_parse.py:379(_parse)
        2    0.000    0.000    0.000    0.000 sre_parse.py:67(__init__)
        2    0.000    0.000    0.000    0.000 sre_parse.py:675(parse)
        2    0.000    0.000    0.000    0.000 sre_parse.py:90(__init__)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:18(IncrementalEncoder)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:22(IncrementalDecoder)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:25(StreamWriter)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:28(StreamReader)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:33(getregentry)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:8(<module>)
        1    0.209    0.209   37.401   37.401 wp8-sms.py:113(<module>)
       72    0.002    0.000    0.011    0.000 wp8-sms.py:131(read_nullterm_unistring)
      831    0.001    0.000    0.003    0.000 wp8-sms.py:165(read_filetime)
        8   16.194    2.024   16.194    2.024 wp8-sms.py:191(regsearch)
       21    0.000    0.000    0.000    0.000 wp8-sms.py:200(find_flag)
       12    0.001    0.000    0.004    0.000 wp8-sms.py:216(find_timestamp)
       33    0.000    0.000    0.000    0.000 wp8-sms.py:234(goto_next_field)
        2    0.697    0.348   37.168   18.584 wp8-sms.py:246(sliceNsearchRE)
       12    0.000    0.000    0.000    0.000 wp8-sms.py:548(<lambda>)
        2    0.003    0.001    0.003    0.001 {__import__}
        1    0.000    0.000    0.002    0.002 {_codecs.lookup}
     2576    0.002    0.000    0.002    0.000 {_codecs.utf_16_le_decode}
        2    0.000    0.000    0.000    0.000 {_sre.compile}
     1547    0.002    0.000    0.002    0.000 {_struct.unpack}
        2    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x000000001E29D0F0}
       38    0.000    0.000    0.000    0.000 {built-in method utcfromtimestamp}

        3    0.000    0.000    0.000    0.000 {filter}
      106    0.000    0.000    0.000    0.000 {getattr}
        4    0.000    0.000    0.000    0.000 {hasattr}
       22    0.000    0.000    0.000    0.000 {hex}
       14    0.000    0.000    0.000    0.000 {isinstance}
     3977    0.001    0.000    0.001    0.000 {len}
        2    0.000    0.000    0.000    0.000 {math.ceil}
     1382    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        3    0.000    0.000    0.000    0.000 {method 'close' of 'file' objects}

        1    0.000    0.000    0.000    0.000 {method 'copy' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        4    0.000    0.000    0.000    0.000 {method 'extend' of 'list' objects}
        2    0.000    0.000    0.000    0.000 {method 'fileno' of 'file' objects}
        3    0.000    0.000    0.000    0.000 {method 'find' of 'str' objects}
        8    0.000    0.000    0.000    0.000 {method 'finditer' of '_sre.SRE_Pattern' objects}
       23    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
       38    0.000    0.000    0.000    0.000 {method 'isoformat' of 'datetime.datetime' objects}
        3    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}

        2    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
       35    0.000    0.000    0.000    0.000 {method 'keys' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'lower' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {method 'pop' of 'list' objects}
     4131   20.281    0.005   20.281    0.005 {method 'read' of 'file' objects}
        4    0.000    0.000    0.000    0.000 {method 'replace' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'reverse' of 'list' objects}
       13    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}

     1654    0.001    0.000    0.001    0.000 {method 'seek' of 'file' objects}
        2    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
      629    0.000    0.000    0.000    0.000 {method 'start' of '_sre.SRE_Match' objects}
        1    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
      993    0.001    0.000    0.001    0.000 {method 'tell' of 'file' objects}
        3    0.000    0.000    0.000    0.000 {method 'translate' of 'str' objects}
        5    0.000    0.000    0.000    0.000 {method 'upper' of 'str' objects}
       13    0.000    0.000    0.001    0.000 {method 'write' of 'file' objects}

        4    0.000    0.000    0.000    0.000 {min}
        2    0.000    0.000    0.000    0.000 {nt.fstat}
        3    0.001    0.000    0.001    0.000 {open}
       24    0.000    0.000    0.000    0.000 {ord}
       48    0.000    0.000    0.000    0.000 {range}
       40    0.000    0.000    0.000    0.000 {setattr}
        1    0.000    0.000    0.000    0.000 {sorted}
     1288    0.000    0.000    0.000    0.000 {unichr}

Note: The number of SMS extracted from the original and chunkified versions of the wp8-sms.py script were consistent.
As a result of this additional testing, I have also adjusted the chunkified version to search further for received SMS timestamps. ie it should now better detect received timestamps.

Exciting times! It took 37 seconds - much less than the previous version's 57 seconds.
Most of the time was spent in sliceNsearchRE and not in processing the hits.
Given the same input file and search/chunk parameters (ie ~2GB chunk with 1k delta), more time was spent in sliceNsearchRE for wp8-sms.py (37 s) than for the previous chunkymonkey.py time (11 s) because sliceNsearchRE is being called twice in wp8-sms.py and only once in chunkymonkey.py. Not sure why the sliceNsearchRE's time is 3 times longer and not closer to 2 times the duration ...

The revised wp8-sms.py script was also run on a single store.vol file (18 MB) and the output results matched the previous version's output. Both scripts processed almost 6000 SMS in ~7 s.

Final Thoughts

A new tool (chunkymonkey.py) was written to help determine the optimum chunk size and search algorithm for finding a hex string in large binary files. Due to Python 2 limitations, the maximum chunk size has to be less than 2147483647 minus the delta size.
Code from the tool was re-used/applied to an existing Windows Phone 8 SMS script (wp8-sms.py) and significantly reduced the processing time.
Further reduction of processing time may be possible in the future by utilizing threads (a way of concurrently calling multiple functions) however, this could make some already hack-tacular code even more complicated.

Over the next few hours/days, I plan to update/chunkify other selected Windows Phone scripts (ie wp8-callhistory.py, wp8-contacts.py) so that they can also run quicker against whole .bin files. I won't bother creating a new post for that though as the principles are already outlined here.

UPDATE (12JUL2015):
Have now updated the "wp8-sms.py", "wp8-callhistory.py" and "wp8-contacts.py" scripts to read large files in chunks.
Updated code is now available from my Github page.

UPDATE (20AUG2015):
Fixed a bug in the latest chunkymonkey.py and chunkified wp8-sms.py, wp8-callhistory.py, wp8-contacts.py scripts.
The bug prevented the last chunk from being parsed properly. Have also updated the results in the post to reflect the revised code performance.
Warning: The offsets for Windows Phone 8.10 appear to be slightly different to those for 8.0. Consequently, the updated wp8 scripts may have issues parsing WinPhone 8.10 images.
Stay tuned for further developments regarding WinPhone 8.10 versions of the scripts ...

Thanks again to Boss Rob for sharing his work!

Now where's that sundae?!

Friday 3 July 2015

Extracting Ones BLOBs From The Clutches Of SQLite

SQLite BLOB work used to be an adventure ... Not anymore!

Did you know that SQLite databases can also hold binary data? BLOB fields can contain pictures, audio, base64 encoded data and any other binary data you care to wobble your gelatinous finger at.
Monkey's recent foray into SQLite led him to the magic of the "pragma table_info" SQLite query (which returns a table's column configuration). This means that we don't have to hard code our queries and/or know the database schema before we query a table. Consequently, given a table name, we can now query that table for any BLOBs and dump them to file for further analysis. This could come in handy when analysing a browser cache history database (for images) or a voicemail database (for recorded messages) or a contact database (for images) or any other stored binary BLOB (eg binary .plists). So it also sounds like a good fit for mobile device forensic analysis ... awww, my favorite - how did you know? :)

Special Thanks to Alex Caithness of CCLForensics for inspiring/laying the groundwork for this script idea via his 2013 post. Monkey salutes you Alex (in the nice way)!

You can download the Python script (sqlite-blob-dumper.py) from my GitHub page.

The Script

Here's the script's help text:

cheeky@ubuntu:~$ python ./sqlite-blob-dumper.py -h
Running sqlite-blob-dumper.py v2015-07-03

usage: sqlite-blob-dumper.py [-h] db table outputdir

Extracts BLOB fields from a given SQLite Database

positional arguments:
  db          SQLite DB filename
  table       SQLite DB table name containing BLOB(s)
  outputdir   Output directory for storing extracted BLOBs

optional arguments:
  -h, --help  show this help message and exit
cheeky@ubuntu:~$


Given an SQLite database filename, the table name containing BLOBs and an output directory, this script will query the table for any BLOB columns, process the BLOB content for file type and then dump the contents to a file in the specified output directory. The file type processing is currently limited to some common mobile device file types but it should be easy to modify for other types in the future (depending on their file signatures). All the processing really does is determine the output filename extension. All BLOBs are extracted - some just have a more user friendly file extension (vs. the default .blob extension).
And before you get on Monkey's back about the lack of service, have you seen how many different file formats there are these days?! ;)

Back to the script ... There's a bit of housekeeping stuff at the beginning of the script eg checking the database file exists, creating the output directory if required.
The real fun begins when we do the "pragma table_info" query. This returns a row entry for each column type in the specified table. Each row entry (ie column type) has the following fields:
ColumnID, Name, Type, NotNull, DefaultValue, PrimaryKey

The Type field should be set to "BLOB" for BLOBs. The Name field is set to the column name.
We can also figure out which column is being used for the Primary Key (ie unique key index for each record) by looking at the PrimaryKey value (which should be set to 1 for the Primary Key).

Now that we know which columns contain BLOBs, we can formulate a SELECT query which looks something like:
SELECT primarykeyname, columnname(s) FROM tablename ORDER BY primarykeyname;
Then we iterate through each of the returned table rows and for each BLOB column, we call the user defined "calculate_filename" function to construct an output filename using the table name, the primary key value (rowid), the BLOB column name and first several bytes of the BLOB.
From the first several bytes of each BLOB, we look for certain file signatures (.jpg, .png, .zip, .bplist, .3gp, .3g2, .amr) and name the output file's extension accordingly. If the BLOB was not one of the previously mentioned types, it is given the default file extension of ".blob".
The file naming convention is:
tablename_row_rowid_columnname.ext

Where .ext can be: .jpg, .png, .zip, .blob (default), .bplist (untested), .3gp (untested), .3g2 (untested) or .amr (untested).

For Base64 encoded BLOBs - unfortunately, there does not appear to be a reliable way of determining if a field is Base64 encoded unless you actually try to base64 decode it and the output "looks valid". Counting the Base64 encoded bytes and monitoring the characters used might find some Base64 encodings but it could also catch some strings which are not necessarily Base64 encoded. So, any Base64 BLOBs will end up with a .blob extension.

For further information on file signatures, check out Gary Kessler's huge compendium of file signatures.

Testing

On Ubuntu 14.04 LTS x64 (running Python 2.7.6), we used the Firefox SQLite Manager plugin to create 3 test databases. One had no BLOB data (testnoblob.sqlite), one had one BLOB column (testblob.sqlite) and the last had three BLOB columns (testblobs.sqlite). Due to lack of test data, we only tested the script with .jpg, .png and .zip BLOBs.
Fun fact: You can also use SQLite Manager to insert existing files as BLOBs into a table.

Here's what the database which does not contain any BLOBs (testnoblob.sqlite) looks like:



Now we run the script against the database:

cheeky@ubuntu:~$ python ./sqlite-blob-dumper.py testnoblob.sqlite main noblobop
Running sqlite-blob-dumper.py v2015-07-03

Creating outputdir directory ...
Primary Key name is: id
No BLOB columns detected ... Exiting
cheeky@ubuntu:~$


Here's the contents of the "noblobop" directory ... NOTHING! Because there's no BLOBs in the database, silly!




Here's what the second database containing one BLOB column (testblob.sqlite) looks like:



Now we run the script against the database containing one BLOB column:

cheeky@ubuntu:~$ python ./sqlite-blob-dumper.py testblob.sqlite main blobop
Running sqlite-blob-dumper.py v2015-07-03

Creating outputdir directory ...
Primary Key name is: id
Detecting BLOB columns = blobby
Extracting ... blobop/main_row_1_blobby.jpg
Extracting ... blobop/main_row_2_blobby.zip
Extracting ... blobop/main_row_3_blobby.blob

Exiting ...
cheeky@ubuntu:~$


Here's the contents of the "blobop" output directory:



We can see the the .jpg, .zip / office document and base64 text BLOBs have all been extracted successfully.
This was also confirmed by checking the file sizes of each output file against its BLOB size in the table.

Finally, here's the database containing three BLOB columns (testblobs.sqlite):



Now we run the script against the three BLOB database:

cheeky@ubuntu:~$ python ./sqlite-blob-dumper.py testblobs.sqlite main blobops
Running sqlite-blob-dumper.py v2015-07-03

Creating outputdir directory ...
Primary Key name is: id
Detecting BLOB columns = blobby, blobby2, blobby3
Extracting ... blobops/main_row_1_blobby.jpg
Extracting ... blobops/main_row_1_blobby2.jpg
Extracting ... blobops/main_row_1_blobby3.png
Extracting ... blobops/main_row_2_blobby.zip
Extracting ... blobops/main_row_2_blobby2.zip
Extracting ... blobops/main_row_2_blobby3.zip
Extracting ... blobops/main_row_3_blobby.blob
Extracting ... blobops/main_row_3_blobby2.blob
Extracting ... blobops/main_row_3_blobby3.blob

Exiting ...
cheeky@ubuntu:~$


Here's the contents of the "blobops" output directory:



Note: The .png BLOB (for row id=1, column = blobby3) has also been successfully extracted along with .jpg, .zip and base64 text BLOBs.

We have also run the script on Windows 7 x64 with Python 2.7.6 with the same results.

So there you have it, repeat after me - "Everything seems to be in order ... I think I'll go across the street for some orange sherbet ..."

Final Thoughts

It is hoped that this script (sqlite-blob-dumper.py) can be used during an SQLite database / mobile forensics analysis to quickly retrieve embedded binary data such as pictures, voicemail recordings, video, binary .plist and .zip BLOBs.

As mentioned in the last post, to avoid changing data, analysts should only interface with copies of an SQLite database and not the original file.

This script does not handle deleted BLOB data.

And ... I'm spent! (For those too young, that's another classic Austin Powers movie reference)