I took a week vacation from Standoutjobs, which of course means spending some time hacking on personal projects.
One task was to scrape the SAQ site to get the entire database of wines. Downloading with Ruby and wget / curl is child’s play. Building a scraper to retrieve data from those files should be just as easy - see e.g. Ruby Screen-Scraper in 60 Seconds. Basically, copy the XPath for the data you want from firebug, and paste inside your script - and you’re pretty much done.
This is where I hit gotcha #1. Firebug uses Firefox’s normalized html, which results in an extra tbody in the XPath, even if it’s not there in the HTML received by curl. “Oh, no problem, I’ll just remove the tbody from the Xpath”. That was wishful thinking: sometimes nested tables on saq.com have a tbody on only one of the tables, so you really have to check.
So now I’m happily going through the 7195 files that were downloaded. Some throw nil exceptions for elements that aren’t present. I fix the script, re-run the import. Rinse, repeat.
Until of course, I hit gotcha #2. Unlike missing data or XPath issues, I get this befuddling error:
TypeError: can't convert nil into String from /usr/local/lib/ruby/gems/1.8/gems/hpricot-0.6/lib/hpricot/parse.rb:51:in `scan'
A bit of google-fu reveals it is a known bug: Hpricot can’t handle some files that are multiples of 16384 bytes. The fix is easy once you know:
echo ' ' >> my_specially_sized_file
Through the process I added links to del.icio.us with ‘gotcha’ and ‘hpricot’ as keywords, and noticed others had done the same thing for other projects. That could be a very handy resource when starting a project with a new set of tools.
2 comments ↓
hi daniel, i ran into the same problem you did, where i used firebug to get a xpath, then tried to retrieve data using hpricot, and failed at getting any results. i understand from your article that i can’t use the firebug-generated xpaths. I was wondering how you actually got the xpaths you needed. is there firebug-esque program out there? did you manually find the xpath by examining the html? any help you can give would be greatly appreciated, as I am working on a senior thesis designing a website with rails.
thanks so much,
lawrence
Hi Lawrence,
Well, I still did end up using Firebug to do most of the heavy-lifting - only I had to examine the HTML as it is in the document (which you can get via curl or wget). Usually, removing ‘tbody’ did the trick.
Some interactive debugging with script/console or irb can let you isolate most problems - start with partial XPaths to see what you can retrieve.
I’ve sometimes resorted to rather ugly hacks to get scraping jobs done. Combinations of CSS + XPath selectors, up/down/next from easily selectable elements: anything that gets data out with the least amount of work. In particularly nasty and malformed documents, I’ll even use regular expressions.
Good luck!
Leave a Comment