Friday, 16 November 2012

Anomalies in Steam Community data

In a recent post I introduced the Steam Community API, and showed how to retrieve gamer data and perform a few simple but fun analyses.

While writing the posting, I came across several problems associated with the data that's returned. If you're thinking about using Steam Community data, it's worth bearing these anomalies in mind because of the impact they'll have on downstream processing and further analysis.

Frustratingly, the quality of the data available through the Steam Community API is quite variable - in particular there are many discrepancies between global achievement data compared to achievement data for individual players. I also came across several global achievement rates that were clearly invalid, and in some cases found that global achievement records for games were totally missing.

The net result: it's hard to trust that the data that's returned. It is still possible to analyze returned data, but you're going to need strong validation, normalization (e.g. of player achievements against a 'gold standard'), and potentially multiple attempts to retrieve equivalent data to ensure what you have is accurate.

Below, you'll find a (non-exhaustive) list of data quality issues I came across, along with some examples, and a little discussion about the problems they introduced and workarounds I used.

Disclaimer: the issues described here reflect my experience while using the Steam API to retrieve data in bulk, to allow me to analyse data for a large number of gamers. Your experience may differ - if so, please let me know.

"Test" achievements

One of the most obvious problems you might find are spurious achievements associated with games. These include a few that can easily be filtered ("null" or empty names) as well as others that are more problematic - such as many 'test' achievements. For example:

TEST_ACHIEVEMENT
TestAchievement
Achievement_Test

Those readers familiar with the Valve games Portal and Portal 2 will realize why these can't be easily filtered - many valid achievement names include some variation of the word "test", e.g. portal_escape_testchambers.

I only spotted these when querying global achievement statistics, so it's possible that problem may just be a filtering issue on that particular API endpoint.

Another example can be seen when compare my (woeful) achievements for AaAaAA!!! - A Reckless Disregard for Gravity with the global achievement data. You should see one extra achievement testo2 which a vanishingly small number of people have achieved - more than likely because it's a leftover artifact from when the game was integrated into Steam.

Out of date achievement lists

Another closely related issue is that personal achievement lists can get out of step with global achievement lists, casting doubt on the reliability of any comparisons made between player achievements.

For example while processing the global record for The Legend of Grimrock, I noticed two additional achievements in comparison to my record:

FIND_ALL_TREASURES, complete_game_normal

It's worth noting that the FIND_ALL_TREASURES achievement appears in lower case in my record, but the complete_game_normal entry was missing completely. As a result, it's necessary to normalize all achievement records for players before making any comparisons, which unfortunately means making assumptions about why entries are missing (e.g. that the game hasn't been played recently) and how to fix the data.

Interestingly, since last viewing the global stats for this game, the data has become one of the ...

Missing global records

A more severe, though in some ways easier to handle issue is that global stats for some games are just missing - though this seems to be an intermittent issue.

The aforementioned Legend of Grimrock is currently one of them, as was Civilization V a couple of weeks ago. This seems to be an API specific issue, because the equivalent website for The Legend of Grimrock shows many achievements as I write this.

It seems that obtaining global stats for games is a bit of a hit-and-miss affair, so be careful with any apparently empty lists of achievements you may see and don't assume that such responses are correct when considering further processing.

Achievements with huge percentages

The final issue I've come across is that some of the global stats for achievements are simply incorrect. For example RAGE by iD Software has several achievements held by over 730,000% of players.

Thankfully this particular issue is easily detected, and offending achievements can be filtered easily.

Photo credit: wilhei55 / Foter / CC BY

No comments:

Post a Comment