| Svelte Hacker News

points by pdkl95 7 years ago

> how would you use that information to identify a natural person?

That depends a lot on 1) the other data that submitted in the same set of analytics events. and 2) the data found in other databases that might correlate with the data in #1.

> how long does it take?

How long does it take to run a SELECT statement that JOINs a handful of large tables? This could be any amount of time, but I suspect anybody with a lot of resources like Google can probably run this kind of query (e.g. map all analytics records to personal gmail accounts) ad-hoc in minutes. A better idea would be to integrate the correlation into the handling of analytics events.

> How much does it cost?

How much does it cost to run a large query on your DB? The only real expenses would derive from the volume of analytics events want to process per second. Mapping a single analytics event to existing databases would be approximately free.

> Is it reasonably likely to be used?

I have very little doubt that at least Google and FB do this kind of re-correlation in some situations. I have no how common the practice would be.

These questions suggest you might be missing just how trivial this problem is to solve. Google already has massive databases that identify a "natural person" (like a gmail account associated with a mobile telephone number for 2FA). Unrelated to GA, the databases handling regular gmail activity can store [IP addr, other TCP/IP headers, HTTP headers, accurate (~1s) timestamps] simply because your browser made a HTTP request over a TCP socket to fetch the text of your email.

With all those resources available, Google receives a GA event, notices aip=1, and dutifully sets the least significant 8 bits to 0. At that point they simply use the other 24 bits to search the recent logs for matching HTTP requests. This may already select a unique account, but in general it -probably selects about 200 to 500. (256 from the ambiguity of not using 8 bits of address, multi0plied by the average number of gmail users behind the same NATed address)

That was the easy part, which defines the real problem as finding the real account out of a selection of a few hundred. So start trying to correlate the rest of the available data. Did the GA event contain a UserAgent string that is unique with respect the few hundred in our search space? If that wasn't unique, repeat with every other HTTP header. If still not unique, try longer tuples where the entire tuple must match. Repeat for any other available data.

I could get into the interesting ways you could exploit non-random IP numbers (how does your router rewrite TCP Source Port? Do your TCP Initial Sequence Numbers reveal your OS?[1]), but that level of analysis probably isn't necessary. An important question at thi8s point is how much error is acceptable? Even if the previous searches did not result in a unique match, they probably reduced the search space down to only a handful of candidates. Start apply Bayes Theorem[3] or other statistical analysis methods; is there a match with an acceptable confidence? What about a larger network[4] of inferences?

There are many ways to approach the problem of finding the correct record out of a few hundred; I'm only sketching a fairly straightforward method. I'm sure Google and FB can do fancier things with better techniques such as machine learning. The point is that 24 bits of identifying entropy is a lot. It's already so close to being a unique identifier, constructi8ng an actual unique ID only requires adding a few bits of entropy, which probably available in the surrounding metadata and/or session data.

[1] The ISN shouldn't reveal much in modern OS. However, reading this[2] paper about how they used to be broken was really enlightening when I read it when it was originally published. The visuals demonstrate clearly how easily your difficult/random searches can collapse into a trivial search space.

[2] http://lcamtuf.coredump.cx/oldtcp/tcpseq/print.html

[3] https://en.wikipedia.org/wiki/Bayesian_inference

[4] https://en.wikipedia.org/wiki/Bayesian_network