Saturday 13 August 2016

Google S2 Mapping Scripts

Sorry Monkey - there is just no point to mapping jokes ...

Cindy Murphy's recent forensic forays into Pokemon Go (here and here)
 have inspired further monkey research into the Google S2 Mapping library. The S2 library is also used by Uber, Foursquare and Google (presumably) for mapping location data. So it would probably be useful to recognise and/or translate any S2 encoded location artifacts we might come across in our forensic travels eh? *repeats in whispered voice* travels ...
After a brief introduction to how S2 represents lat/long data, we will demonstrate a couple of multi-platform (Windows/Linux) Python conversion scripts for decoding/encoding S2 locations (using sidewalklabs' s2sphere library).

S2 Mapping (In Theory)

The main resources for this section were:
The TLDR - its possible to convert a latitude/longitude from a sphere into a 64 bit integer. The resultant 64 bit integer is known as a cellid.
But how do we calculate this 64 bit cellid? This is achieved by projecting our spherical point (lat, long) onto one of 6 faces of an enclosing cube and then using a Hilbert curve function to specify a grid location for a specified cell size. Points along the Hilbert curve that are close to each other in value, are also spatially close to one another. This diagram from Christian's blog post better illustrates the point:

Points close to each other on the Hilbert curve have similar values. Source: Christian Perone's Blog

In the above diagram, the Hilbert curve (darker gray line) goes from the bottom LHS to the bottom RHS (or vice versa). Each grid box contains one of those Y-shaped patterns. The scale at the bottom represents the curve if it was straightened out like a piece of string.
If you now imagine the grid becoming finer/smaller but still requiring one Y-shape per box, you can see how a smaller cell grid size requires a finer resolution for points on the curve. So the smaller the cell/grid size, the higher the number of bits required to store the position. For example, a level 2 cell size only needs 4 bits where as the maximum level 30 requires 60 bits. Level 30 cell sizes are approximately 0.48-0.93 square centimetres in size depending on the lat/long.

Fun fact: Uber apparently uses a level 12 cell size (approx. 3.3 to 6.4 square kilometres per cell). 
Second fun fact: The Metric system has been around for over 100 years so stop whining about all the metric measurements already *looks over sternly at the last of the Imperial Guards in the United States, Myanmar and Liberia*.

Ahem ... so here's what a level 30 cellid looks like:

Level 30 cellid structure. Source: Octavian Procopiuc's GoogleDoc

The first 3 bits represent which face of the enclosing cube to use and the remaining 60 bits are used to store the position on the Hilbert curve. Note: The last bit is set to 1 to mark the end of the Hilbert positioning bits.

When cellids are converted into hexadecimal and have their least significant zeroes removed (if present), they are in their shortened "token" form.
eg1 cellid = 10743750136202470315 (decimal) has a token id = 0x951977D377E723AB
eg2 cellid = 9801614157982728192 (decimal) = 0x8806542540000000. The 16 hex digits can then be shortened to a token value of "880654254". To convert back to the original hex number, we keep adding least significant zeroes to "880654254" until its 16 digits long (ie 64 bits).
Analysts should anticipate seeing either cellids or token ids. These might be in plaintext (eg JSON) or may be in an SQLite database.

Note: Windows Calculator sucks at handling large unsigned 64 bit numbers. According to this, its limited between -9,223,372,036,854,775,808 and 9,223,372,036,854,775,807. So a number like 10,743,750,136,202,470,315 made it return an incorrect hex representation after conversion.
This monkey spun his paws for a while trying to figure out why the token conversions didn't seem to make sense. The FFs Solution - use the Ubuntu Calculator for hex conversions of 64 bit integers instead.

The Scripts

Two Python 2.7+ scripts were written to handle S2 conversions and are available from my GitHub here. They have been tested on both Windows 7 running Python 2.7.12 and Ubuntu x64 14.04 running Python 2.7.6.

s2-latlong2cellid.py converts lat, long and cellid level to a 64 bit Google S2 cellid.

s2-cellid2latlong.py converts a 64 bit Google S2 cellid to a lat, long and S2 cellid level.

IMPORTANT: These scripts rely on the third-party s2sphere Python library. Users can install it via:
pip install s2sphere
(on Windows) and:
sudo pip install s2sphere
(on Ubuntu)

Here's the help text for s2-latlong2cellid.py:
python s2-latlong2cellid.py -h
Running s2-latlong2cellid.py v2016-08-12

usage: s2-latlong2cellid.py [-h] llat llong level

Converts lat, long and cellid level to a 64 bit Google S2 cellid

positional arguments:
  llat        Latitude in decimal degrees
  llong       Latitude in decimal degrees
  level       S2 cell level

optional arguments:
  -h, --help  show this help message and exit

Here's the help text for s2-cellid2latlong.py:
python s2-cellid2latlong.py -h
Running s2-cellid2latlong.py v2016-08-12

usage: s2-cellid2latlong.py [-h] cellid

Convert a 64 bit Google S2 cellid to a lat, long and S2 cellid level

positional arguments:
  cellid      Google S2 cellid

optional arguments:
  -h, --help  show this help message and exit


Testing

A handy online S2map visualizer was written by David Blackman - the Lead Geo Engineer at Foursquare (and formerly of Google ie he knows his maps). See also the Readme here. S2map also has several other mapping dropbox options besides the default "OSM Light". These include "Mapbox Satellite" which projects the cellid onto an aerial view.

We start our tests by going to GoogleMaps and noting the lat, long of an intersection in Las Vegas.

Pick a spot! Any spot!

We then specify that lat, long as input into the s2-latlong2cellid.py script (with level set to 24):
python s2-latlong2cellid.py 36.114574 -115.180628 24
Running s2-latlong2cellid.py v2016-08-12

S2 cellid = 9279882692622716928

We then put that cellid into s2map.com:

Level 24 test cellid plotted on s2map.com.


Note: The red arrow was added by monkey to better show the plotted cellid (its tiny).
So we can see that our s2-latlong2cellid.py gets us pretty close to where we originally specified on GoogleMaps.

What happens if we keep the same lat, long coordinates but decrease the level of the cellid from 24 to 12?
python s2-latlong2cellid.py 36.114574 -115.180628 12
Running s2-latlong2cellid.py v2016-08-12

S2 cellid = 9279882742634381312

Obviously this is a different cellid because its set at a different level, but just how far away is the plotted level 12 cellid now?

Level 12 test cellid plotted on s2map.com.

Whoa! The cell accuracy has just decreased a bunch. It appears the center of this cellid is completely different to the position we originally set in GoogleMaps. It is now centred on the Bellagio instead of the intersection. This is presumably because the cell size is now larger and the center point of the cell has moved accordingly.

To confirm these findings, we take our level 24 cellid 9279882692622716928 and use it with s2-cellid2latlong.py.

python s2-cellid2latlong.py 9279882692622716928
Running s2-cellid2latlong.py v2016-08-12

S2 Level = 24
36.114574473973924 , -115.18062802526205

We then plot those coordinates on GoogleMaps ...

Level 24 test cellid 9279882692622716928 plotted on GoogleMaps via s2-cellid2latlong.py


ie Our s2-cellid2latlong.py script seems to work OK for level 24.

Here's what it looks like when we use the level 12 cellid 9279882742634381312:
python s2-cellid2latlong.py 9279882742634381312
Running s2-cellid2latlong.py v2016-08-12

S2 Level = 12
36.11195989469266 , -115.17705862118852

Level 12 test cellid 9279882742634381312 plotted on GoogleMaps via s2-cellid2latlong.py


This seems to confirm the results from s2map.com. For the same lat, long, changing the cellid level can significantly affect the returned (centre) lat, long.

We also tested our scripts against s2map.com with a handful of other cellids and lat/long/levels and they seemed consistent. Obviously time contraints will not let us test every possible point.

Final Thoughts

Using the s2sphere library, we were able to create a Python script to convert a lat, long and level to an S2 cellid (s2-latlong2cellid.py). We also created another script to convert a S2 cellid to a lat, long and level (s2-cellid2latlong.py).
The higher the cellid level, the more accurate the location. You can find the cellid level by using the s2-cellid2latlong.py script.
Plotting a cellid with s2map.com is the easiest way of visualizing the cellid boundary on a map. Higher levels (>24) become effectively invisible however.

To locate potential S2 cellids we can use search terms like "cellid" or variations such as "cellid=". If its stored in plaintext (eg JSON), those search terms should find it. No such luck if its encrypted or stored as a binary integer though.

While there are other S2 Python libraries, this Monkey decided to use sidewalklabs s2sphere library based on its available documentation and pain-free cross platform support (pip install is supported).

Other Google S2 Python libraries include:
https://github.com/micolous/s2-geometry-library
(As demonstrated in Christian's blog and also used in the Gillware Pokemon script. This seems to be Linux only)
and
https://github.com/qedus/sphere
(Has the comment: "Needs to be packaged properly for use with PIP")

Some other interesting background stuff ...
Interview article with David Blackman (Foursquare)

Matt Ranney's (Chief Systems Architect at Uber) video presentation on "Scaling Uber's Real-time Market Platform" (see 18:15 mark for S2 content)

Uber uses drivers phone as a backup data source (claims its encrypted)


In the end, creating the Python conversion scripts was surprisingly straight forward and only required a few lines of code.
It will be interesting to see how many apps leave Google S2 cellid artifacts behind (hopefully ones with a high cellid level). Hopefully these scripts will prove useful when looking for location artifacts. eg when an analyst finds a cellid/token and wants to map it to a lat/long or when an analyst wants to calculate a cellid/token for a specific lat, long, level.