If we had one big map of geotags of every little expression on social media, we could track the spread of disease, get in front of major national emergencies and build living models of bustling cities and their daily events. The problem with making that
big data dystopia bright future a reality is that most people don’t geotag everything they do online, leaving most of the database of tens of billions of public tweets so far this decade totally useless for mapping purposes.
Well, problem solved: Malibu-based research firm HRL Laboratories created an algorithm that analyzes tweets to determine where the users live and where the tweets were coming from. From a fire hose of tweets from 101,846,236 Twitter users, they were able to geotag over 80 percent of all public tweets down to within a couple of miles.
When you can geotag millions of tweets, it opens a world of possibilities, including the ability to predicting where a crime is likely to occur.
Though Pew polls have shown that people mostly consider their personal geographical location as sensitive data, the composition of their network of friends isn’t considered so sensitive and private. Unfortunately, your list of friends is all it takes to hone in on exactly where you are.
“With respect to accuracy, using social networks to infer location makes sense only if a user’s friends are primarily located within the same geographic region,” Ryan Compton, who co-authored the study, told the Observer. “Our method is most accurate when a users’ friends are not geographically dispersed.”
In non-researcher speak, that means that the more your friends and acquaintances are from your area, the easier it is to determine what that area is. To be clear, they didn’t look at follows and un-follows—they were looking at mentions, the people who users are actually talking back and forth with actively on Twitter. And they only watched for mentions that were reciprocated, which means it wasn’t general tweets they studied, but public communications.
The process here is called deanonymization, and it’s one of the largest growing threats to online privacy. Even if as a consumer you have, say, anonymous purchases on Amazon, anonymous health records, and anonymous social media accounts, someone can take all of that information together and tie them together to make a sophisticated profile of your habits and personal information, even if each individual service told you that you were protected.
“In the age of ubiquitous surveillance, where everyone collects data on us all the time, anonymity is fragile,” cybersecurity guru Bruce Schneier wrote in a recent op-ed for Passcode. “We either need to develop more robust techniques for preserving anonymity, or give up on the idea entirely.”
As for the data compiled by the study, HRL Laboratories has no intention on releasing the geotagged database of many millions of deanonymized tweets, but all of the math behind the creation of the database is up for grabs.
“Reproducing the work is definitely possible,” Mr. Compton said, “though it takes some time and technical skill.”
View the whole study below, if you’ve got the computing power and want to build a global map of millions of the unwittingly surveilled: