Visualizing Gamer Achievement Profiles using R
In this post, I'll describe how to go about visualising and interpreting gamer achievement data using R, the open source tool for statistical computing. Specifically, I'll show how you can create gamer achievement profiles based on publicly available achievement records from the Steam community API.
The visualisations and data interpretation will hopefully be of interest to a general audience, but for the more technically inclined reader I've included the steps required to create the visualisations. If you're mainly interested in the analysis and interpretation, you might want to skip ahead to the Achievement Rate Distributions section.
If you're not a coder, don't be put off - R really is straight forward. The following histogram, for example, can be created from a data set using just two lines of code:
This histogram shows global achievement rates (in percentage points) for all Steam achievements - more on this below.
The Steam community API provides both individual and global achievement records. For individual gamers, you can retrieve the lists of achievements they hold on a game-by-game basis. For the community as a whole, the API provides access to the global achievement rate - that is, the percentage of players who hold that particular achievement.
Using the approach described in a previous blog post, it's relatively easy to obtain these data sets, though a little time consuming when it comes to reading the global achievement rates for all games.
The global achievement data set that I created looks like this:
The data is simply one line per game achievement, with three whitespace-delimited columns corresponding to the Game ID, the Achievement ID, and the global achievement rate. You'll notice that the achievement IDs are quoted using the pipe character, which is necessary because some achievement IDs include spaces or quote characters.
The achievement data for specific gamers is quite similar:
Here, each line corresponds to one achievement held by a gamer, whose identity is indicated in the first column. I also chose a different quote character here because some gamer IDs happened to include the pipe character.
The visualisations and data interpretation will hopefully be of interest to a general audience, but for the more technically inclined reader I've included the steps required to create the visualisations. If you're mainly interested in the analysis and interpretation, you might want to skip ahead to the Achievement Rate Distributions section.
If you're not a coder, don't be put off - R really is straight forward. The following histogram, for example, can be created from a data set using just two lines of code:
Achievement Data
So what gamer data are we talking about?The Steam community API provides both individual and global achievement records. For individual gamers, you can retrieve the lists of achievements they hold on a game-by-game basis. For the community as a whole, the API provides access to the global achievement rate - that is, the percentage of players who hold that particular achievement.
Using the approach described in a previous blog post, it's relatively easy to obtain these data sets, though a little time consuming when it comes to reading the global achievement rates for all games.
The global achievement data set that I created looks like this:
The data is simply one line per game achievement, with three whitespace-delimited columns corresponding to the Game ID, the Achievement ID, and the global achievement rate. You'll notice that the achievement IDs are quoted using the pipe character, which is necessary because some achievement IDs include spaces or quote characters.
The achievement data for specific gamers is quite similar:
Here, each line corresponds to one achievement held by a gamer, whose identity is indicated in the first column. I also chose a different quote character here because some gamer IDs happened to include the pipe character.
R Basics
R is popular tool among data miners because, among other things, it provides an easy way to generate "publication ready" charts such as histograms and scatter plots.
Getting up and running with R is simple. You can download an installation image via the R project homepage. Once installed and started, R provides console for issuing commands, as shown below:
To load a data set, you can use read.table:
The above reads the contents of a data file (ach-rates-full.txt) in table format into memory, accessible via the variable name achrates in this case. The parameters indicate that the file includes a header line, and that column values are quoted using the pipe character.
To view the data, simply type the name of the variable followed by carriage return and R will print out the contents. Use dim to obtain the dimensions of the data, e.g.:
I also found the subset function to be handy. You can use it to create a new dataset, based on some criteria such as user name or game ID. For example to obtain all global achievement rates for Half Life 2, you can type:
Getting up and running with R is simple. You can download an installation image via the R project homepage. Once installed and started, R provides console for issuing commands, as shown below:
To load a data set, you can use read.table:
achrates <- read.table("ach-rates-full.txt", header=T, quote="|")
The above reads the contents of a data file (ach-rates-full.txt) in table format into memory, accessible via the variable name achrates in this case. The parameters indicate that the file includes a header line, and that column values are quoted using the pipe character.
To view the data, simply type the name of the variable followed by carriage return and R will print out the contents. Use dim to obtain the dimensions of the data, e.g.:
> dim(achrates)
[1] 30081 3
I also found the subset function to be handy. You can use it to create a new dataset, based on some criteria such as user name or game ID. For example to obtain all global achievement rates for Half Life 2, you can type:
ar.hl2 <- subset(achrates, Game == 220)
That's all you need in order to read a data set, view the contents, and to select a subset. But let's move onto something more interesting, and generate a few histograms...
Achievement Rate Distributions
To generate a histogram of values from your data, use the hist function. The histogram shown at the start of this post (and repeated just below) was generated from the global achievement data as follows:
This generates a simple, no-frills histogram of the global achievement rates for every achievement in Steam.
How to interpret the data? Fundamentally, the data appears to show that the vast majority of achievements in Steam are held by only a small percentage of players for each game. This isn't so surprising, given that many games on Steam are for casual gamers. Also many games can be bought in bundles, which can lead to many games either being left unplayed, or played just once or twice - certainly that's the case for the games in my Steam account. It's also worth noting that a few achievements seem to have been created for test purposes, so will naturally only be held by a tiny proportion of gamers (i.e. the game developers, more than likely).
Digging into the data a little deeper provides further insight into the playing habits of Steam gamers. The following lines generate a histogram for a particular user, based on individual achievement data:
Two of the gamers in my social circle (let's call them Mario and Luigi) have quite distinct profiles of the type of achievements they tend to get.
Mario has over a thousand achievements, coming from a total of 23 games. The histogram of global rates for his achievements looks similar to the overall distribution:
So Mario holds many achievements that are not typically held by other gamers for the games he plays. Luigi on the other hand has about 450 achievements, coming from 35 games. His histogram looks like this:
The difference is quite apparent: the achievements that Luigi gets tend to be those held by a good proportion of other gamers, and he has fewer of the hard to get achievements.
Broadly speaking, the above profiles describe two quite distinct types of gamer.
The first - Mario - has a few games that he plays all the time. Mario gets most or all of the achievements, clocks up lots of game hours, and perhaps tends to the e-sports end of the gaming spectrum - playing multi-player games with friends or adversaries over the net.
The second - Luigi - has more games, and tends to dip in and out them. This type of gamer is perhaps more interested in the game experience or story, rather than obtaining every achievement or exploring every area of a game. A Luigi gamer fits more in the category of casual gamer.
Of course these are my interpretations of the data from a few simple data plots, and would need to be backed up with further data capture and analysis to hold any serious weight.
But hopefully they hint at what might be possible with such data. One can imagine, for example, building classification systems that are able to categorise gamers based on their achievement profile. Such categorisations could be used to generate recommendations or targetted adverts, friend suggestions etc. There may also be other rich sources of related data available to further enhance the gaming ecosystem.
Note on data quality
In a previous blog post, I drew attention to a few issues present in data retrieved from the Steam community API, and some of these cropped up again while I was creating the visualisations here. As such, the set of global achievement rates may not be complete or may have some spurious entries from test achievements which may slightly increase the skew towards low achievement rates.
hist(achrates$Rate)
This generates a simple, no-frills histogram of the global achievement rates for every achievement in Steam.
How to interpret the data? Fundamentally, the data appears to show that the vast majority of achievements in Steam are held by only a small percentage of players for each game. This isn't so surprising, given that many games on Steam are for casual gamers. Also many games can be bought in bundles, which can lead to many games either being left unplayed, or played just once or twice - certainly that's the case for the games in my Steam account. It's also worth noting that a few achievements seem to have been created for test purposes, so will naturally only be held by a tiny proportion of gamers (i.e. the game developers, more than likely).
Digging into the data a little deeper provides further insight into the playing habits of Steam gamers. The following lines generate a histogram for a particular user, based on individual achievement data:
gamerdata <- read.table("user-ach-rates.txt", header=T, quote="~")
gd.user <- subset(gamerdata, User == "SomeUser")
hist(gd.user$Rate)
Two of the gamers in my social circle (let's call them Mario and Luigi) have quite distinct profiles of the type of achievements they tend to get.
Mario has over a thousand achievements, coming from a total of 23 games. The histogram of global rates for his achievements looks similar to the overall distribution:
So Mario holds many achievements that are not typically held by other gamers for the games he plays. Luigi on the other hand has about 450 achievements, coming from 35 games. His histogram looks like this:
The difference is quite apparent: the achievements that Luigi gets tend to be those held by a good proportion of other gamers, and he has fewer of the hard to get achievements.
Interpretation
The first - Mario - has a few games that he plays all the time. Mario gets most or all of the achievements, clocks up lots of game hours, and perhaps tends to the e-sports end of the gaming spectrum - playing multi-player games with friends or adversaries over the net.
The second - Luigi - has more games, and tends to dip in and out them. This type of gamer is perhaps more interested in the game experience or story, rather than obtaining every achievement or exploring every area of a game. A Luigi gamer fits more in the category of casual gamer.
Of course these are my interpretations of the data from a few simple data plots, and would need to be backed up with further data capture and analysis to hold any serious weight.
But hopefully they hint at what might be possible with such data. One can imagine, for example, building classification systems that are able to categorise gamers based on their achievement profile. Such categorisations could be used to generate recommendations or targetted adverts, friend suggestions etc. There may also be other rich sources of related data available to further enhance the gaming ecosystem.
Note on data quality
In a previous blog post, I drew attention to a few issues present in data retrieved from the Steam community API, and some of these cropped up again while I was creating the visualisations here. As such, the set of global achievement rates may not be complete or may have some spurious entries from test achievements which may slightly increase the skew towards low achievement rates.
Comments
Post a Comment