Harvesting Data from the Steam Community API
IntroductionThe Steam community API is a web service that provides public access to information about Steam users, their games, achievements, and other related information. In this blog posting I'll describe some of the interesting data you can access, as well as how to model, retrieve, and process that data. I'll also show you how to generate a few fun, simple rankings and statistics for a group of steam gamers.
This is primarily a technical article, but it concludes with the results of a simple analysis performed over a small number of friends and aquaintances on Steam, which may be of interest to the non-technically inclined.
The examples shown here can be reproduced using the sample code found in this GitHub repository. It's a work in progress, but hopefully provides enough insight so you can either repeat the results or build your own equivalent.
Accessing the APIThe first thing to know is that Steam community data is accessed using a RESTful web service, through a number of related endpoints. Many of the endpoints don't require authentication, but some require you to register for a key which you then provide as a parameter when interacting with the API.
You'll find links to the API documentation below - see the first link for details on how to get a key:
Steam Web API Documentation (high level)
Steam Web API Reference
Steam Web API Self documenting API endpoint
How to access Community Data
The "Web API" supports both XML and JSON formats, while the closely related "Community Data" endpoints only support XML - it appears the latter are just public pages with an additional parameter of
xml=1. In the rest of this posting, I provide XML examples for consistency, but the JSON resources seem to be equivalent. All of the URIs described below can be accessed using the HTTP GET verb, and in all cases appear to be browser-friendly (try clicking the examples).
There are also one or two client libraries available for different languages, notably steam condenser which is available for Java, PHP, and Ruby. Unfortunately I hit a bug caused (I believe) by changes to the behaviour of the steam API, and ultimately decided to using directly HTTP given that the API is relatively straightforward.
Available dataWhat kind of data can be accessed via the API? Some of the most interesting types of data are user profiles and user game lists, along with user achievements, which many users choose to make public. It's also possible to retrieve global achievement lists for games, which include percentages showing the proportion of players with the game who have a given achievement.
To make it easier to work with the data, it helps to establish a core domain model. That is, a set of concepts and relationships describing the problem domain. This helps with understanding the data, reasoning about how to process it, and further down the line how describe the data in code.
Given that we're interested in users, games, and achievements, the domain model is fairly simple:
|Simple UML domain model for player data. Diagram courtest ObjectAid.|
The above diagram was generated from core domain classes in the sample project, using a view-only UML modeling tool called ObjectAid. Aside from the three main concepts, you'll see relationships representing the fact that users have games, that games have achievements associated with them, and that users have acheivements either held or yet to be achieved. You'll also see a few attributes for key data such as steam ID, game name, etc.
Data retrievalBefore retrieving data from the various API endpoints, you'll need to find one or more Steam IDs. There are a few different ways of referring to Steam users, these include personas (nicknames), login account names, identifiers reported by game servers that start
STEAM_, and 64 bit community IDs.
We're interested in the 64 bit steam community variants, which unfortunately require a little effort to obtain. A good starting point is your profile page which can be accessed via an "id" or via the unique profile ID - the behaviour of the "id" endpoint is ambiguous, but it appears to attempt to resolve users by their registered nicknames. I ended up viewing the page source on my friend list page or on specific profiles in order to obtain community IDs. It's also worth noting that the API endpoint for retrieving friend lists may be the most reliable method.
If you have a Steam ID in one of the other formats, you might look into one of the sites dedicated to converting between the various ID formats (example). However none of the sites I found generated 64 bit community IDs so friend lists may be the best option.
Once you have an ID or two, you can start to pull down some data. Below, you'll find an outline of key API interactions needed to get player data, game lists, and achievement information.
First up, user profile data. You can get individual player summaries using a simple variant on the user profile link - just add
xml=1and you'll receive a computer readable version (example). Alternatively, use the Steam Web API to retrieve player summaries in batch, as follows:
key=[YOUR KEY HERE]&steamids=[STEAM 64 IDS]&format=[xml OR json]
For the sample stats shown at the end of this posting, I just needed the user's persona name which I retrieved using the second method shown above.
User game lists
A user's game list can also be retrieved using a simple variation on a web page URI. In this case, add /games?xml=1 to the end of a profile page URI (example) and you should have a complete list of games owned by the player.
The key data need you'll need to pull out of the response is the
appID- that is, the unique identifier for the game. The response also includes other player data you might find useful, including game names, play time, etc.
User achievement lists
Once you've retrieved a list of unique game IDs for a particular player, it's possible to start retrieving something a little more interesting - individual achievements for those games.
Again, you can obtain a player's achievements for a game in two ways: using a profile page URI, and using the Web API. Helpfully, you'll find links to a player's game achievements in the game list response. Adapt these by adding
xml=1and you're away (example - warning: possible game spoilers for XCOM: Enemy Unknown). I actually used the alternative provided by the Web API, as follows:
appid=[GAME ID]&steamid=[STEAM 64 ID]&key=[YOUR KEY HERE]&format=[xml OR json]
The key information you'll need here are the unique identifiers for the achievements (
apiname) and the flag indicating whether or not the player holds that achievement (
achieved, values 1 or 0).
Global achievement data
The final endpoint that's worth reviewing retrieves global achievement lists and percentages for games. This data only appears to be available via an unauthenticated endpoint through the Web API (example), as follows:
gameid=[GAME ID]&format=[xml OR json]
This achievement data is useful both for cross-referencing with user achievement data, and for comparing individual achievements with global levels. The code in the sample project pulls down this data and uses it to validate and normalize user achievement lists.
Putting it all togetherSo by now, it's hopefully clear what kind of data you can access via the Steam Community API, and how to retrieve it using the various HTTP endpoints. But that's not quite enough to be able to start working with the data.
Below, you'll find an outline of the steps required to put together a cohesive data model that can then be analysed, persisted, and processed further:
For one or more Steam 64 identifiers:
- Retrieve user profile data, create a user record.
- Read the list of user games (capturing game IDs), associate them with the user.
- Read user achievements per game (capturing game ID, plus achievement ID and status) associate with user.
The end result should be a collection of users, each with an associated set of games, and for each user/game pair, a set of achievements both held and yet to be achieved. In the sample project, I use Java classes to hold user, game, and achievement entities, along with a few Collection objects to record game lists and achievement outcomes. The sample code also retrieves global achievement data for validation and normalization.
|Image courtesy of smarnad / FreeDigitalPhotos.net|
|6||Amnesia: The Dark Descent||0|
|6||Counter-Strike: Global Offensive||193|
|6||Counter-Strike source: Beta||154|
|6||Half-Life 2: Deathmatch||0|
|6||Half-Life 2: Lost Coast||0|
|6||Left 4 Dead 2||69|
|6||Super Meat Boy||49|
No surprises there then. Portal and Portal 2 being highly popular, followed by a few of the top indie games and several stock Valve games. The only slightly puzzling thing being that Half-Life
Next, who are the biggest achievers. Names have been anonymized to protect the innocent:
|Olly at home DOTT||789/6550||12%|
Congratulations Chas, the runaway winner with one thousand achievements, and the highest proportion of possible achievements held. Coming in close behind are Olly and Shreddies, Olly holding the higher number of achievements but Shreddies having achieved a higher relative proportion. Bringing up the rear, the wooden spoon award goes to Cuppa has both the lowest number of achievements overall, and the lowest overall proportion.
These particular stats are just for fun and are shouldn't be taken too seriously, but they do hint at some more compelling uses of the data. For example, it would be interesting to go beyond pure rankings and further analyse the achievements held by gamers from a particular social group. But that's the topic of a future post.
A final note on data reliabilityOne final thing to mention is that my experience with the quality of data exposed by the Steam Web API has been mixed. The data exposed by the by the community pages (i.e. the public pages, with xml=1 added) seems more reliable than the data provided Steam Web API. One reason might be that the Web API provides less filtering, while the profile page is designed to be human readable and thus may be subject to greater filtering or curation.
Next time, I'll be discussing those issues in more detail as well as discussing the importance of, and problems associated with, obtaining reliable data.