I Know Where You Were Last Summer: London's public bike data is telling everyone where you've been
I'll also explore how this dataset could be linked with other datasets to identify the actual people who made each of these journeys, and the privacy concerns this kind of linking raises.
It probably won't surprise you to learn that there is a publicly available Transport For London dataset that contains records of bike journeys for London's bicycle hire scheme. What may surprise you is that this record includes unique customer identifiers, as well as the location and date/time for the start and end of each journey. The public dataset currently covers a period of six months between 2012 and 2013.
What are the consequences of this? It means that someone who has access to the data can extract and analyse the journeys made by individual cyclists within London during that time, and with a little effort, it's possible to find the actual people who have made the journeys.
To show what's possible with this data, I built an interactive map to vizualize a handful of selected profiles.
Please note: the purpose of this article is to expose the risks that can come with open datasets. However I've held off from actually trying to find the people behind this data, mostly because of the privacy concerns but also because (thankfully) it requires a fair bit of effort to actually identify individuals from the data...
Below, you'll find a map of all journeys made by one specific cyclist (commuter X), selected because they're one of the top users of a familiar bicycle hire station near where I work:
|Bike journeys map - commuter X [interactive version]|
Each line represents a particular journey, the size of the line showing the number of times that journey was made. The size of the circle represents the number of different destinations that the cyclist has travelled to and from that bike station. Purple lines indicate there were journeys in both directions, while orange lines (with arrows) indicate journeys that were one-way only.
Bigger, therefore, implies the route or station has greater significance for the person.
NOTE: if you think you might be this person, and you're unhappy having your personal journey data here, please contact me and I will remove the offending map. Then contact TFL (as I have) and tell them to remove customer record numbers from the data.
So what can we tell about this person?
First impressions suggests that they probably live near Limehouse, work in Kings Cross, and have friends or family in the Bethnal Green / Mile End areas of London. This story is strengthened if we filter down to journeys made between 4.00am and 10.00am:
|Commuter X - morning journeys [interactive version]|
We can see that this person only travels to Kings Cross in the morning, when departing from the Limehouse area or from Bethnal Green. So a morning commute from home, and/or a partner's abode? Applying a similar filter for the afternoon and evening shows return journeys, so the commuting hypothesis becomes stronger still.
Like me, you're probably starting to feel a bit uncomfortable at this point - after all I'm putting a story to this person's data, and it's starting to sound quite personal.
What's more interesting (and worrying) is that:
- I'm not really trying very hard, and a deeper inspection of dates, times, locations etc. can reveal far more detail
- There's enough here to start thinking about putting a name to the data.
All that's needed to work out who this profile belongs to is one bit of connecting information.
A Foursquare check-in could be connected to a bike journey, though it would be difficult to connect it to the cycle scheme. More likely would be a time-stamped Facebook comment or tweet, saying that the Kings Cross boris bike station is full. Or a geo-coded Flickr photograph, showing someone riding one of the bikes...
Any seemingly innocuous personal signal would be enough to get a detailed record for someone's life in London ... travelling to work, meeting up with friends, secret trysts, drug deals - details of any of these supposedly private aspects of our lives can be exposed.
Here's another profile, chosen because of the volume of journeys made:
|Complex bike journey map [interactive version]|
Hopefully you can see the richness of the information that is available in the TFL dataset. Every connection on the map represents something of significance to the cyclist, each bike station has some meaning. As well as being a digital fingerprint that can be linked to personally identifiable information, the journey data is a window on this person's life.
On a final note, I'd like to point out that there are positives to releasing such data, which can be seen (for example) in the following map:
|Commuter destinations around Victoria [interactive version]|
The above map shows commuter journeys from a bike station near embankment to various stations around Victoria. These are journeys made between approximately 4.00pm and 5.30pm - so return commutes from work, presumably followed by a train journey from Victoria southwards. Here, there is one point of departure but three destinations, probably because Victoria Rail Station is a major transport hub, so the bike stations nearby will be popular and may often fill up.
The point is that there are benign insights that can be made by looking at individual profiles - but the question remains whether these kind of insights justify the risks to privacy that come with releasing journey data that can be associated with individual profiles.
Leaflet.js - web mapping library
Cloudmade - map tiles
Transport For London - datasets of Boris Bike data
Um, what? The dataset contains a Bike ID, not a customer ID. You are tracking bikes not individual customers.ReplyDelete
I think Siddle's point is that given enough overlapping data, it might be possible to identify certain bike users. He mentions, for instance, time-stamped and geo-coded photos.Delete
The actual bike data that you download from the TFL website contains customer record numbers - the maps really are showing profiles for people.Delete
It may of course be a mistake, and I've tried telling this to TFL. In the meantime, it's possible for someone to download and analyse your movements - if they can identify your profile.
Really good work James. Now if you can overlay ''twitter'' locations facebook updates you can open a dot.com and be bought out in 5 years for a Billion dollars. GCHQ and the NSA are hiring as well.ReplyDelete
Indeed, though a billion dollars seems a bit low IMO.ReplyDelete
This is a really interesting analysis JS. The Open Data and Privacy project (of OKF+ORG) is raising similar questions of whether privacy risks are being sufficiently managed if individuals can make the sorts of inferences from open datasets such as you have done here. We propose to outline some basic guideline which data publishers must follow in order to ensure such privacy lapses are minimized. Follow us on Twitter to stay updated @OpenDataPrivacyReplyDelete
I just downloaded the dataset. Bunch of xlsx spreadsheets, and I see the "Unique ID/Customer Record Number" field in there. It's not shown on the list of fields on this doc site. I'm reliably informed (by Ollie Obrien, creator of this awesome bikeshare map), that this field wasn't in earlier versions of the dataset. Maybe they've added it by mistake.ReplyDelete
I wouldn't say it's a disastrous breach of privacy. As I tried to imagine the more nefarious use case, I thought about a stalker. He spots his target docking her boris bike and makes a note of the time. Bingo! he can correlate that to find her customer ID and see where she's been ...except the data's only published for a 6 month period in the past right? And ultimately it only reveals where somebody's been boris-biking. It's not like you have their home address, as you would if you just followed them home for example.
Even so it's pretty interesting how big databases involving people's location can be more personal than one might at first imagine, and only by adding one seemingly anonymous numeric column.
Assuming you are speaking about the data here: http://www.tfl.gov.uk/info-for/open-data-users/our-feedsReplyDelete
Under the Network statistics -> Barclays Cycle Hire statistics
The documentation states:
Details of all Barclays Cycle Hire journeys.
The journey information includes:
Journey ID, Bike ID, Start date, Start time, End date, End time, Start docking station, Start docking station ID, End docking station, End docking station ID
Yet in the excel file we see the headings:
Rental Id Billable Duration Duration Unique ID/Customer Record Number Subscription Id Bike Id End Date EndStation Id EndStation Logical Terminal EndStation Name endStationPriority_id Start Date StartStation Id StartStation Logical Terminal StartStation Name startStationPriority_id EndHourCategory Id StartHourCategory Id BikeUserType Id
Excellent article about bike hire in London and the technology which can be used to track us.ReplyDelete
The fact that third parties might identify these customers is a secondary danger. The primary danger is that the operator of the system already knows who they are, and so does Big Brother.ReplyDelete
The system must be redesigned not to collect this information at all. See http://gnu.org/philosophy/surveillance-vs-democracy.html.
Dr Richard Stallman
President, Free Software Foundation (gnu.org, fsf.org)
Internet Hall-of-Famer (internethalloffame.org)
Thanks for the comment. It's a good point - misuse of this data by government agencies (etc) is a risk.Delete
Also I guess it says something about the state of the world that I'm scared to click that link you provided, in case I raise a red flag somewhere :/
What tool did you use to visualise the data?ReplyDelete
Hi James - I note that the most recent cycle hire data from tfl offers data under the following headings:ReplyDelete
Is the issue resolved now? Did TfL ever get back to you about this post?
I'm interested because I'm writing a report on open data and TfL at the moment, would love to hear from you - email@example.com
Find out about what critical data can be revealed inside numerous US open records. The majority of you understanding this likely have no idea what is "openly accessible" about you to anybody that is occupied with doing some basic exploration. Make publicReplyDelete
I appreciated your work very thanks quad kopenReplyDelete