Friday, 13 June 2014

Monkeying around with Windows Phone 8.0

Ah, the wonders of Windows Phone 8.0 ... Failing eyesight, Frustration and Squirrel chasing
UPDATED 22OCT2015 - Updated last section with deleted record observations from a Nokia Lumia 530 device running Windows Phone 8.10.

Currently, there is not much freely available documentation on how Windows Phone 8.0 stores data so it is hoped that the information provided in this post can be used as a stepping stone for further research / possible scripting. Hopefully, analysts will also be able to use this post to help validate any future tool results.

Special Thanks to Detective Cindy Murphy (@CindyMurph), Lieutenant Jennifer Krueger Favour (@rednogn) and the Madison Police Department ("Forensicate Like A Champion!") for providing the opportunity and encouragement for this research.
Unfortunately, due to time contraints and a limited test data set, I wasn't able to write an all-singing/all-dancing script. Instead, some one-off scripts were created to extract/sort the relevant data a lot quicker than it would have taken to do manually. Rather than releasing scripts that are customized for a limited set of test data (which I don't have easy access to any more) - this post will be limited to documenting the data sources/structures.

OK, so no free tool and you're still here reading huh? In Yoda voice: "The nerd runs strong in this one" ;)

Thanks to Maggie Gaffney from Massachusetts State Patrol / Teel Technologies, the initial test data (.bin file) was sourced via JTAG from a Nokia 520 Windows 8.0 phone - a "cheap" smart phone common to prepaid plans. The .bin file was then opened in X-Ways Forensics to parse the 28(!) file system partitions and to export out files of interest. The exported files were then viewed in hex view using Cellebrite Physical Analyzer (love the data interpretation and colour coded bookmarking!). Later, we were also able to get our paws on some test data from a HTC PM23300 Windows Phone 8.0 phone courtesy of JoAnn Gibb from the Ohio Attorney Generals Office. UPDATE: Thanks also to Brian McGarry (Garda) for his testing feedback and help with the SMS and Call Logs. It's awesome knowing people that know people!

Note: The Nokia 520 does not display the full SMS timestamp info (threaded messages display date only).
So while we can potentially re-create the order of threaded messages as per the test phone, we can't easily validate the exact time an SMS message was sent/received. There's a good chance that other Windows Phone 8.0 phones will use the same timestamp mechanism and hopefully they will display the full timestamp.

So where's the data?!

The SMS content, MMS file attachment info and Contacts information are stored (via the 28th Partition) in:

\Users\WPCOMMSSERVICES\APPDATA\Local\Unistore\store.vol

Various .dat files containing MMS content are also stored in sub-directories of:

\SharedData\Comms\Unistore\data

The Call log is stored in:

\Users\WPCOMMSSERVICES\APPDATA\Local\UserData\Phone

The "store.vol" and "Phone" files seem to be ESE Databases (see explanantions here and here) with the magic number of "xEF xCD xAB x89" present at bytes 4-8. Consequently, we tried opening "store.vol" using Nirsoft's ESE Database viewer but had limited success - the SMS message texts were not viewable however other data was. This suggests that maybe the "store.vol" file differs in some way from the ESE specification and/or the tool had issues reading the file.
Joachim Metz has also both documented (here  and here) and written a C library "libesedb" to extract ESE databases. Unfortunately, I didn't discover Joachim's library until after we started poking around .. Anyway, it was a pretty masochistic interesting exercise trying to reverse engineer the "store.vol" file. One possible benefit of this data diving is that it *might* also reveal unallocated/partially overwritten data records that might be ignored by libraries which read the amount of data declared (vs reading all the data present). This is pure speculation though as I don't know if old records are overwritten or just marked as invalid.

Viewing "store.vol" using Cellebrite Physical Analyzer, relevant data was observed for text strings (eg phone numbers, SMS text strings) encoded in UTF-16 LE throughout the file.
As a database file there will be tables. Each table will have columns of values (eg time, text content, flags). A single (table row) record will thus have data stored for each column.
Table data will be organized within the file somehow (eg multiple SMS records organized into page blocks). So it is likely that finding a hit for a specific SMS will lead you to the contents of other SMS messages (potentially around the same timeframe).

The Nokia 520 was actually locked with a 4 digit PIN when we started investigating. Without access to the phone, any manual inspection/validation would have been impossible. It was unknown if the phone would have been wiped if too many incorrect PINs were entered. So any guesses would have to be documented and carefully chosen. It wasn't looking good ... until a combination of thinking outside the box and a touch of luck lead us to an SMS text message (in "store.vol") with the required 4 digit code. Open sesame!

Some things we tried with the data ...

To find specific SMS records we searched for unique/known strings from the SMS text (eg "Look! A Squirrel!"). A single record was found per SMS in "store.vol" and each record also contained a UTF-16-LE string set to "SMStext".

To find contact information, we searched for known phone number strings (eg +16085551234, 123456, 1234). Some numbers were observed in "store.vol" in close proximity to "SMStext" strings while other instances were located close to what appeared to be contact information (eg contact names).

To search for field markers and flags, we compared separate SMS text records and looked for patterns/commonalities in the hex. Sometimes the pattern was obvious (eg "SMStext" occurs in each SMS message) and sometimes it wasn't so obvious (sometimes there is no discernible pattern!).

Figuring out the timestamp format being used was HUGE. Without it, we could not have figured out the order messages were sent/received. Using Cellebrite Physical Analyzer to view the "store.vol" hex, Eagle-eyed Cindy noticed that there were 8 byte groupings occurring before/after the SMS text content. These 8 bytes were usually around the same value range (eg in LE xFF03D2315FE1C701). Which is what you'd expect within a single message. Subsequent messages usually had larger values - which corresponds to messages sent/received at a later time.
Like most hex viewers, Cellebrite Physical Analyzer can interpret a predefined number of bytes from the current cursor position and print a human friendly version. Using this, Calculon Cindy showed an otherwise oblivious monkey that these 8 byte groupings could be interpreted as MS FILETIME timestamps! To be honest, I was expecting smaller 4 byte timestamps - Silly monkey!
By comparing the 8 byte values surrounding a specific SMS text message (eg "Look! A Squirrel!") with the date displayed on the phone for that message, we theorized that our mysterious timestamps were *probably* MS FILETIME timestamps (No. of 100 ns increments since 1 January 1601 in UTC). For example, xFF03D2315FE1C701 = Sat, 18 August 2007 06:15:37 UTC. As the phone did not display the exact time for each SMS, we could only use the order of threaded messages and the date displayed to somewhat confirm our theory. Various SMS sent/received dates on the phone were spot checked against a corresponding "store.vol" entry timestamp date and the date values consistently matched.
UPDATE: FTK 5.4 can also be used to view the database tables in the store.vol and Phone files. Thanks to JoAnn for the tip!
OSForensics also has an ESE database viewer which can be used to view the phone's databases. As an added bonus, it also has a Windows Registry viewer for inspecting the phone's hives. Thanks to Brian for the suggestion!


What the data looks like

After some hex ray vision induced cross-eyedness (who knew that looking at hex is almost like a curse!), we think we've figured out some general data structures for SMS, MMS, Contacts and Call log records. There's still some unknowns/grey areas but it's a start.

- On the data structure diagrams below, "?" is used to denote varying/unknown number of bytes.
- FILETIMEs are LE 8 byte integers representing the number of 100 ns intervals since 1 JAN 1601.
- In general, strings are null terminated and UTF-16-LE encoded (ie 2 bytes per character).

Sent / Received SMS records

There are two types of SMS data structures which are mixed together. Each type of SMS structure contains a UTF-16-LE encoded string for "IPM.SMStext". However, one type contains phone number strings and the other does not.
For later ease of understanding, we'll say these "SMStext" records occur in "Area 1". UPDATE: Area 1 corresponds to the "Message" table.
Initially, monkey was confused about why some SMS records had phone numbers and some didn't. However, by inspecting the unlocked phone, we were able to confirm that the SMS message records with no number corresponded to sent SMS.

Sent "SMStext" record (from Area 1 in "store.vol")

Note 1: Note the lack of Phone number information. From test data, FILETIME values (in red and pink) seemed a little inconsistent. Sometimes FILETIMEs within the same record matched each other and other times they varied by seconds/minutes.
Note 2: The Sent Text string (in yellow) is null terminated and encoded in UTF-16-LE.


Received "SMStext" record (from Area 1 in "store.vol")


Note 1: Received SMS have multiple source phone number strings listed (in orange). These seem to remain constant within a given record (eg PHONE0 = PHONE1 = PHONE2 = PHONE3)
Note 2: Similar to Sent "SMStext" records, the FILETIMEs (in red and pink) within a record might/might not vary.
Note 3: The Received Text string (in yellow) is null terminated and encoded in UTF-16-LE.

To find out the destination phone number for a sent SMS we can make use of the factoid observed by searching "store.vol" for the FILETIMEs from a specific Sent "SMStext" record.
It appears that FILETIMEs 1, 3 & 4 (in pink) from a given Sent "SMStext" record usually occur once in the entire "store.vol". The FILETIME2 value (in red) however, also appears in a second area ("Area 2"). UPDATE: Area 2 corresponds to the "Recipient" table. This area has a bunch of different looking data records each containing the null terminated UTF-16-LE encoded string for "SMS". Also contained in each data record is a phone number string. The "Area 2" SMS records look like:

"SMS" record (from Area 2 in "store.vol")


Note 1: Each "SMS" record contains a UTF-16-LE encoded string for "SMS".
Note 2: From both sets of test data, there seems to be a consistent number of bytes between:
- The FILETIMEX (in red) and "SMS" string (in kermit green) and
- The "SMS" string (in kermit green) and the Phone number string (in orange).

So, each sent "SMStext" FILETIME2 value (from Area 1) should have a corresponding match with an "SMS" record's FILETIMEX value (in Area 2). In this way, we can match a sent "SMStext" message with the destination phone number via the FILETIME2 value. Sounds a little crazy right? But the test data seems to confirm this. Purrr!

Contacts

Contact information is also located in "store.vol". UPDATE: This area corresponds to the "Contact" table. There were 2 main observed data structure types - both contained phone number and name information however, one data type had an extra 19 digit number string. It was later discovered via phone inspection that the records with the extra digit strings corresponded with "Hotmail" address book entries. It would be interesting to see if the 19 digit number corresponded to a unique hotmail user ID of some kind.
The second type of contacts structure was a "Phonebook" entry - presumably these contact types were entered into the phone by the user rather than slurped up from a Hotmail account.
Common to both contact records were multiple occurrences of the same contact name and phone number. OCD phonebook, OCD phonebook, OCD phone book ... ;)

"Hotmail" Contacts record (from "store.vol")

"Phonebook" Contacts record (from "store.vol")

Note 1: The flag value (in red) which can be used to determine if the contact record is a "Hotmail" or "Phonebook" entry.
Note 2: The potential 6 byte magic number (0xFFFFFF2A2A00) for Contact records should make it easier to find each entry. This was discovered by Sharp-eyed Cindy on the last day (by which time monkey had lost the will to live).
Note 3: There is also an "End Marker" which has the following value in hex: [01 04 00 00 00 82 00 E0 00 74 C5 B7 10 1A 82 E0 08]. This value lead to a couple of extra contact records which did not have the 6 byte magic number at the beginning.
Note 4: The 19 digit string (in pink) could be a potential Hotmail ID.

UPDATE: Since this was originally written, new Contact test data has been observed. These have slightly different record structures but all records seem to have the same "End Marker" and the last 3 Unicode string fields. The last and 3rd last strings can thus be extracted for name/phone (and possibly email) information.

MMS data

Further research is required for MMS records (eg linking timestamps and phone numbers to sent files). But here's what we've learned so far ...
Various .dat files containing MMS content (eg there was a .dat file containing a sent JPEG and another .dat file containing the accompanying text) are stored in:

\SharedData\Comms\Unistore\data

under 3 sub-directories: "0", "2" and "7". These folders might correspond to Sent, Received and Draft???
There were multiple .dat files with similar names each seemingly containing info for different parts of the same MMS.

In "store.vol", there are records containing the UTF-16-LE encoded string for "MMS". These records also contain 3 filename strings and a filetype string (possibly the MIME type eg "image/jpeg"). From my jet-lagged memory, I want to say that the filename strings were pointing to the same filename and there were multiple "MMS" entries for a single MMS message (ie each MMS message has three separate files associated with it). But you should probably should check it out for yourself ...
UPDATE: These MMS records correspond to the "Attachment" table.

MMS record (from "store.vol")

Call log

The Call log information is located in the "Phone" file. Each Call log record contains a flag (in blue) to mark whether a call record is Missed / Incoming / Outgoing. The flag values were confirmed via inspection of the phone and corresponding Call log record. There's also Start and Stop FILETIMEs, repeated contact names and repeated phone numbers.
Of potential interest is a 10 digit ASCII encoded string (in grey) and what looks to be a GUID (in light purple). Each call record had the same GUID string value enclosed by "{}".
UPDATE: The GUID appears to be consistent between 3 phones (2 x Nokia Lumia 520 and HTC PM23300). The ASCII ID string has also been observed to be greater/less than 10 digits.

Revised Call log diag (from "Phone")


Summary

So there you have it - we started off knowing very little about Windows Phone 8.0 data storage and now we know a considerable amount more especially regarding SMS records.
Due to time constraints, it was not possible to investigate the non-SMS related data areas (ie MMS, Call log, Contacts) with the same level of detail. However, it's probably better to share what we've discovered now as I don't know when I'll be able to perform further research.
The observations in this post may not be consistent for Windows 8.1 and/or on other models of Windows phones but hopefully this post can still serve as a starting point. As always, check that the underlying data matches your expectations!

It was really awesome having someone else to bounce ideas off when hex-diving. I'm pretty sure I would have missed some important details (eg the FILETIME timestamp) had it not been for another set of eyes. Of course, that's not always going to be possible so I also appreciated the other opportunities to work automonously / with minimal supervision. Someday monkey might have to do this on his lonesome! :o
Initially, it was easy to tie my idea of success with the "I have to code a solution for every scenario/data set". It would have been awesome if I could have done that but the fact was - we didn't have any SMS messages from "store.vol" at the start and after running the one-off SMS script, we had 5000+ messages sorted in chronological order with their associated phone numbers. Success doesn't have to be black and white. It sounds cliche but focusing on little wins each day made it easier to start eating the metaphorical elephant. Now please excuse me, while I adjust my pants ...

UPDATED 22OCT2015 - Deleted record test observations from a Nokia Lumia 530 running Windows Phone 8.10

Upon deletion, the original SMS and Contact records (in Data:\USERS\WPCOMMSSERVICES\APPDATA\Local\Unistore\store.vol) and Call log records (in Data:\USERS\WPCOMMSSERVICES\APPDATA\Local\Userdata\Phone) are OVERWRITTEN with 0x44 values (ASCII "D"). The same ESE database behaviour is suspected to occur with MMS records in store.vol - however due network issues, this has not been verified.

However, under certain conditions, it may still be possible to recover this deleted ESE record data (eg SMS, Contacts, Call log, *MMS not tested) from a Nokia Lumia 530 running Windows Phone 8.10 because duplicate records of the deleted data can potentially be recovered from:
- ESE .log files (eg Data:\USERS\WPCOMMSSERVICES\APPDATA\Local\Userdata\UDM.log and Data:\USERS\WPCOMMSSERVICES\APPDATA\Local\Unistore\USS.log)
- Data:\pagefile.sys

Prior experience with a Nokia Lumia 520 Windows Phone 8.0 device showed that SMS data may also be contained in USStmp.log (in the same directory as the USS.log). See Det. Cindy Murphy's SANS whitepaper for further details.
Also see here for more details on ESE .log files (eg naming conventions).

It is suspected that the more the device is used before imaging, the less likely deleted data will be recoverable.
For example, general phone usage will result in pagefile.sys getting updated and ESE database modifications (eg new records) can potentially change the .log files.
This makes it difficult to state how long deleted records will be recoverable (possibly hours rather than days/weeks?).

24 comments:

  1. Did you ever find an official schematic for the data structure?

    ReplyDelete
    Replies
    1. Hi Christopher,

      No, unfortunately I have not :(
      The schema (eg what fields are stored in what order) is one issue affecting the data layout . How ESE databases then structure the row data is another.

      Cheeky

      Delete
    2. thanks for the quick response. It seems strange that, that info is not readily available. Think it might be proprietary?

      Delete

    3. I'd say so. Having said that, hopefully Windows 10 Mobile will use a similar data scheme.

      Delete
  2. Moinuddin Zaki4 July 2015 at 13:26

    Hi.. I am new to Mobile Forensics.Have been Working on my thesis on Windows Phone 8 forensics. I might sound a lil amateur, Please excuse me for that.
    Regarding the 4 different timestamps found in Area 1 in "store.vol" for recieved SMS data. My observations have been that TimeStamp1[TS1] is the time when the SMS was read by the user on the device. TS2 is when it was recieved on the mobile device. TS3 and TS4 are the senders Timestamps(could be when it was sent ). I am looking into sent SMS data to figure out what TS1,2,3 and 4 signify.

    ReplyDelete
    Replies
    1. Hi Moinuddin,

      Thankyou for sharing your findings.

      Cheers,

      Cheeky

      Delete
  3. @Cheeky4n6Monkey
    Thanks for the help. You have probably figured this out already, but I am the student working with the detective in the US. You are really making me look good lol. kI have a few questions:

    1. Do you have any leads on how to detect more specific message statuses, such as viewed/not viewed and draft?

    2. What made you choose python? I was considering trying to refactor it into C to attempt to address the processing requirements (this should not be taken as criticism of what you did write). Any thoughts on that?

    3. I have been working on a means of parsing the output you show on the other post:

    ----------------------------------SMS Messages----------------------------------
    From: Me Sent at: 2014-10-01 19:34:57
    To: BananaMan<+1-(111)-555-1234>

    This is a sent SMS
    --------------------------------------------------------------------------------
    From: +1-(111)-557-4321 Sent at: 2014-10-01 19:37:07
    To: Me

    Here is a received SMS

    but I am having a little trouble (I am new to Python as well as forensics), The date is currently in UTF-8 and I am having a difficult time getting python to reformat it to the local date-time. I could just write something to manually adjust the time but I would like to know if you know how to do it using Python's datetime api. P.S. Hows that output look?

    ReplyDelete
    Replies
    1. Hi Christopher,

      1. The best way of validating message status flags (that I can think of) would be to have a test phone with every valid combination of flags. eg received and unread, received and read etc. Then grab an image of that phone and validate that the suspected flag fields are set to what you expect them to be. To be accurate, you would also need to limit it to a specific phone and OS version. There are so many combos out there that you can't really say that a particular set of findings holds true for all cases (unless you test for all cases :).

      2. Python was the quickest way of prototyping. It has a lot of libraries you can use and its used quite a bit in forensics so most analysts should be familar with running Python scripts. It also has the advantage of being text so anyone can read the code and not have to speculate on what a compiled exe is doing (Open Source FTW!). With regards to the performance of the current script(s), you could also process .bin files in set sized chunks with multiple Python threads as a way of speeding up the execution. I have not coded any projects using Python threading yet.

      3. That output format looks good to me. For the Local time conversion, I think you might be able to use "datetime.astimezone(tz)". I have not looked at local time conversions much in Python but I think you should be able to get a "datetime" object returned from that call and then call "datetime.isoformat" or "datetime.strftime" to then convert it to a string. The official documentation should have further info.

      Will you be releasing your research and/or code to the forensics community?

      Cheers,

      Cheeky

      Delete

    2. Just had another thought about the output format. For reporting purposes, a lot of analysts like to insert the output into an spreadsheet (then they can sort results by whatever column they want). I have found that outputting to a TAB-separated variable (TSV) file is a way of facilitating this. Comma-separated output may get confused if a message contains commas.

      Cheers,

      Cheeky

      Delete
  4. thanks! we'll have to try some more experimentation on tuesday. Regarding posting my code, I would be happy to. would you like me to commit to your repo given that what I have is basically just an extension of your project?

    ReplyDelete

  5. If you decide to stick with Python, I don't mind merging your new code with the existing stuff. Just be aware that I have limited access to test data.
    Do you have a timeframe/schedule in mind?

    ReplyDelete
  6. That's fine, I am pretty sure that your code is going to work as is. So at this point, all I have to worry about is displaying your output data. All I really need is more sample output, and I can make due without it if it is any trouble.

    No merging is actually necessary to make this work, in its current state. Your code is essentially a preprocessor. It would just come down to one more line at the terminal. Eventually I would like to put together an actual UI. Preferably with HTML, but I can't quite get Django to work. If that ever happens I will undoubtedly need help but that isn't even on the radar yet.


    P.S. I decided against the C refactoring.

    ReplyDelete
    Replies

    1. OK, I'm not sure I understand the scope of your project. Is your goal to display the output in a more user friendly manner? Or are there other areas you wish to research/enhance?

      Opinions on output format will vary depending on the end user. I'm not really sure if I can say whether a particular output is better than another. That's why I try to avoid GUIs LOL.

      FYI Python has cross-platform GUI capability via a third party library - try Googling for "PyQt". You can also later bundle it all into a single Windows exe using "py2exe". But those extra complications/enhancements also drive the code away from being easy-to-read/validate Open Source.

      Delete
  7. oh and I think I misunderstood one of your previous comments. This:
    ----------------------------------SMS Messages----------------------------------
    From: Me Sent at: 2014-10-01 19:34:57
    To: BananaMan<+1-(111)-555-1234>

    This is a sent SMS
    --------------------------------------------------------------------------------
    From: +1-(111)-557-4321 Sent at: 2014-10-01 19:37:07
    To: Me

    Here is a received SMS


    it should look like this: https://drive.google.com/file/d/0B-5Y1OoryQYuTWtOUjdhdHRLSkU/view?usp=sharing

    Here's my email if you want to get in touch: christopherottersen@gmail.com

    ReplyDelete
  8. The advertisement was to recover deleted messages. However I about 2 hours into researching that I found your blog. So since you have pretty much done that already, you kinda pigeon holed me into the nightmare which is GUI lol. To be honest I am really stumbling through all of this. I am a programmer at heart, and I can't get more than about 3.5 hours with the data 2 days per week. But if you could get me some real data I would be thrilled to get into some of the research.
    All that said it seems like the people I have been in contact with have a difficult time with the software part so the UI does appear to be important to them.

    ReplyDelete
    Replies

    1. OK that clears things up a bit :)

      We originally tested the scripts using Madison PD case data when I was over there. We also shared the script with other LEOs who ran it on their own test data. So, unfortunately, I don't personally have any test data that I can share.
      It might be an idea to see if you can get a test phone dump that you can "take home" from your supervisor? For validation purposes, I imagine they'd have a spare Windows Phone to test with.

      Regarding the GUI, perhaps we could separate it into 2 parts? Keep the script as command line but I'll modify it to also output to whatever output file format you need. Then your GUI can read the output file and display it?

      Delete
  9. Just doing the most I can with what I've got

    ReplyDelete
  10. That was the plan, but the take home phone idea got pushed aside when I found your work. Since then it has just been trying to make it run. We managed to start it about about 2 hours before we needed to clear out and there is no way of knowing if it worked until tuesday morning :p.

    My primary goal was to parse your output in such a way as to make it moldable into any ui you want, for instance, I broke the phone number into all of its base components so that they could be compared and shown in any pattern the developer wants, it reads your date and puts it into a datetime object. So from your output format I can plug the data into an ui I want (I'm very proud of this). All that remains is doing some testing on the phone and possibly making small modifications.

    Using this I can write a GUI which is essentially a shell for your existing scripts.

    So I think everything in that last paragraph is already done.

    ReplyDelete
  11. what percentage of your runtime would you say is taken up by actually finding, not parsing the data? because I reconsidered the whole C thing and I have a script which is finding the 6 byte magic number, parsing 3GB dummy data in less than 15 minutes. It is currently just incrementing a counter when it finds what it's looking for, but I imagine I could have it spit out lists of addresses as well in less than 20.

    ReplyDelete
    Replies

    1. I haven't tried measuring/optimising it. Originally, the script was intended to only process single store.vol files which are not that big. Running it successfully against a whole .bin file was an unexpected surprise. C should definitely be quicker to run than Python.

      Delete
    2. Just ran the wp8-sms.py script against a complete 7 GB .bin image file from a Windows Phone 8 device. It parsed 6000 hits in 290 seconds.

      See the 7/7/15 update text for more details in:
      http://cheeky4n6monkey.blogspot.com.au/2014/10/windows-phone-80-sms-call-history-and.html

      Delete
    3. ohhh. ssd.

      I spent a decent part of the weekend writing something to do that in c. I am testing it against a randomly generated 3GB bin file and it locates but does not store or process results in about 1:45 but its on an hdd. So that would amount to about 4 min?

      Delete
    4. Update: the hdd should process's 8GB in 15 seconds. my program appears to be able to locate entries in 240 (including file reading). I might still try to figure out how to use threads which could bring the execution time down to about a minute.

      Conclusion: I have wasted my weekend.

      Delete
  12. Ok. In the end I about a week of my life was spent trying to solve a problem that took 1 hours worth of modding to your code to resolve. The problem was that the OS would not give python access to enough memory to call fb.read() on the entire file. To get around this, I broke it up and ran your mainline in a loop extracting 100000000L bytes at a time. This fixed the memory problems on my laptop and should be adjustable for any system. Note: runtime < 1:30. Fine work. Now for those drafts........

    ReplyDelete