I Know Where You Were Last Summer: London's public bike data is telling everyone where you've been
I'll also explore how this dataset could be linked with other datasets to identify the actual people who made each of these journeys, and the privacy concerns this kind of linking raises.
It probably won't surprise you to learn that there is a publicly available Transport For London dataset that contains records of bike journeys for London's bicycle hire scheme. What may surprise you is that this record includes unique customer identifiers, as well as the location and date/time for the start and end of each journey. The public dataset currently covers a period of six months between 2012 and 2013.
What are the consequences of this? It means that someone who has access to the data can extract and analyse the journeys made by individual cyclists within London during that time, and with a little effort, it's possible to find the actual people who have made the journeys.
To show what's possible with this data, I built an interactive map to vizualize a handful of selected profiles.
Please note: the purpose of this article is to expose the risks that can come with open datasets. However I've held off from actually trying to find the people behind this data, mostly because of the privacy concerns but also because (thankfully) it requires a fair bit of effort to actually identify individuals from the data...
Below, you'll find a map of all journeys made by one specific cyclist (commuter X), selected because they're one of the top users of a familiar bicycle hire station near where I work:
|Bike journeys map - commuter X|
Each line represents a particular journey, the size of the line showing the number of times that journey was made. The size of the circle represents the number of different destinations that the cyclist has travelled to and from that bike station. Purple lines indicate there were journeys in both directions, while orange lines (with arrows) indicate journeys that were one-way only.
Bigger, therefore, implies the route or station has greater significance for the person.
NOTE: if you think you might be this person, and you're unhappy having your personal journey data here, please contact me and I will remove the offending map. Then contact TFL (as I have) and tell them to remove customer record numbers from the data.
So what can we tell about this person?
First impressions suggests that they probably live near Limehouse, work in Kings Cross, and have friends or family in the Bethnal Green / Mile End areas of London. This story is strengthened if we filter down to journeys made between 4.00am and 10.00am:
|Commuter X - morning journeys|
We can see that this person only travels to Kings Cross in the morning, when departing from the Limehouse area or from Bethnal Green. So a morning commute from home, and/or a partner's abode? Applying a similar filter for the afternoon and evening shows return journeys, so the commuting hypothesis becomes stronger still.
Like me, you're probably starting to feel a bit uncomfortable at this point - after all I'm putting a story to this person's data, and it's starting to sound quite personal.
What's more interesting (and worrying) is that:
- I'm not really trying very hard, and a deeper inspection of dates, times, locations etc. can reveal far more detail
- There's enough here to start thinking about putting a name to the data.
All that's needed to work out who this profile belongs to is one bit of connecting information.
A Foursquare check-in could be connected to a bike journey, though it would be difficult to connect it to the cycle scheme. More likely would be a time-stamped Facebook comment or tweet, saying that the Kings Cross boris bike station is full. Or a geo-coded Flickr photograph, showing someone riding one of the bikes...
Any seemingly innocuous personal signal would be enough to get a detailed record for someone's life in London ... travelling to work, meeting up with friends, secret trysts, drug deals - details of any of these supposedly private aspects of our lives can be exposed.
Here's another profile, chosen because of the volume of journeys made:
|Complex bike journey map|
Hopefully you can see the richness of the information that is available in the TFL dataset. Every connection on the map represents something of significance to the cyclist, each bike station has some meaning. As well as being a digital fingerprint that can be linked to personally identifiable information, the journey data is a window on this person's life.
On a final note, I'd like to point out that there are positives to releasing such data, which can be seen (for example) in the following map:
|Commuter destinations around Victoria|
The above map shows commuter journeys from a bike station near embankment to various stations around Victoria. These are journeys made between approximately 4.00pm and 5.30pm - so return commutes from work, presumably followed by a train journey from Victoria southwards. Here, there is one point of departure but three destinations, probably because Victoria Rail Station is a major transport hub, so the bike stations nearby will be popular and may often fill up.
The point is that there are benign insights that can be made by looking at individual profiles - but the question remains whether these kind of insights justify the risks to privacy that come with releasing journey data that can be associated with individual profiles.
Leaflet.js - web mapping library
Cloudmade - map tiles
Transport For London - datasets of Boris Bike data