A few weeks ago, a friend of mine, Michael
Cook, retweeted a request for up-to-date
discounts on the Steam store:
I want to pay someone to help me automatically scrape steam for all
the data here, and then format it correctly.
— Lewie Procter (@LewieP) November 8,
I was interested, so I tweeted back. As part of the deal, though, I
asked Lewie if I could open source both the code and the data using
ScraperWiki. He was all for it.
Let me tell you a bit about ScraperWiki. It’s a simple idea: you write a
script that scrapes something (usually a web site), and they run it
every day for you. You then have a database you can query using SQL, as
well as all sorts of meta-information like how long the last run took
and how many pages it hit. It’s not perfect, but it does do the job
So I wrote a scraper in
Ruby. It’s not the best
code I’ve ever written, and it has no tests. You use the web browser to
write your script, and testing is mostly resigned to just running it
once in a while to see if it works. I’m actually quite partial to this
approach, as web sites can easily change from underneath you, and so
unit tests would be pointless for data extraction. For manipulation and
storage, they’d be fairly useful. I could write the code separately and
run the tests on my own continuous integration server, but I couldn’t
see an easy way to hook ScraperWiki up to a GitHub repository or
something. Unless I tell it to scrape GitHub and execute the code that
it finds, but that’s way too meta, even for me.
I ended up scraping all prices for all games, not just the discounted
ones. I was hitting the pages anyway, and I figured it’d be more useful
for someone, even if we didn’t actually need the data. This means we’re
pulling in approximately 1800 games in three different countries—5400
price records per day. ScraperWiki handles this pretty well. Hopefully
it’ll keep it up over time as this database grows to a massive scale.
The code runs pretty quickly, partially because the Ruby installation
that ScraperWiki provides has Typhoeus and Nokogiri installed (for
downloading web pages and parsing HTML respectively). These libraries
are ridiculously fast compared to most of their peers, and they’re two
of the many tools that make Ruby excellent for any software development
pertaining to the web.
Once you’ve got the data into a database, the next step is to turn it
into something useful. For Lewie, that was a list of all the discounted
games with their latest
prices so he could
post it up on SavyGamer. I wrote this in Ruby
too, querying my database and spitting out HTML. It was pretty simple to
do—you’re given a single function,
ScraperWiki::select, and you just
trade SQL for rows. You don’t have to just spit out HTML either. Give it
your own content type, and dump whatever you like to standard output.
I’m writing this post mostly because Michael had never heard of
ScraperWiki before, and I was shocked. It’s an excellent tool for
scraping something, and by default, the code and data is all public. If
you take a look at the scraper I wrote, you can use their API to query
all Steam prices for the last four days, and over time, that database
will grow. Perhaps in 2013, this will be a really useful data source,
and I’m really glad to have been a part of building it.