manicwave

Surf the wave

NewsHeap - Parsing RSS in Ruby

Permalink

So you want to build an RSS Feed Reader. First we need to be able to parse RSS. Not to spec (0.91, 1.0, or 2.0), but the real world stuff that is distributed as RSS.

What we need is a liberal parser that gets RSS enough to extract content and normalize it into a usable form. Mark Pilgrim has done some great work in Python with his ultra-liberal RSS parser.

Pilgrim's RSS parser builds upon a python module for SGML Parsing. The Ruby world provides a port of this as HTML/SGML Parser.

Good. It's fairly easy to port Python to Ruby. My colon hurt a little, but the experience was all good until I got to the open_resource method. The python code was providing a uniform method to access data to parse. The comment said:

   This function lets you define parsers that take any input source
    (URL, pathname to local or network file, or actual data as a string)
    and deal with it in a uniform manner.  Returned object is guaranteed
    to have all the basic stdio read methods (read, readline, readlines).
    Just .close() the object when you're done with it.

Ruby 1.6 doesn't provide a uniform stream interfaces across URLs, files, and strings. Ruby 1.7 introduces StringIO, but I made the decision to factor out the aquisition of data from the parsing of data. The python interface is: def parse(uri, etag=None, modified=None, agent=None, referrer=None): and the new Ruby interface is simply def parse(uri):

Here's the RSS Parser - it contains a class called HTTPGetter that does etag, last-modified and gzip handling. A typical usage of the RSS parser would be:

urls = ['http://www.pocketsoap.com/rssTests/rss1.0withModules.xml',
              'http://www.pocketsoap.com/rssTests/rss1.0withModulesNoDefNS.xml',
              'http://www.pocketsoap.com/rssTests/rss1.0withModulesNoDefNSLocalNameClash.xml',
              'http://www.pocketsoap.com/rssTests/rss2.0noNSwithModules.xml',
              'http://www.pocketsoap.com/rssTests/rss2.0noNSwithModulesLocalNameClash.xml',
              'http://www.pocketsoap.com/rssTests/rss2.0NSwithModules.xml',
              'http://www.pocketsoap.com/rssTests/rss2.0NSwithModulesNoDefNS.xml',
              'http://www.pocketsoap.com/rssTests/rss2.0NSwithModulesNoDefNSLocalNameClash.xml']  
    r = RssParser::new() 
    getter = HttpGetter.new()
    urls.each { | url | 
      print "#{url}\n"
      result = {}
      data = getter.readData(url,result)
      result = r.parse(data,result)
      pp (result)
   }

Of course readData supports a variety of parameters to make it a nice RSS netizen e.g. def readData(source, result, etag=nil, modified=nil, agent=nil, referrer=nil )

If you just run ruby rss-parser.rb it will run a series of tests from Simon Fell's RSS Tests. If you want to test a single feed, ruby rss-parser.rb http://some.url/rss.xml or such.

The parser has been tested with Ruby 1.6.7 and 1.7.3 on windows. There are some differences between 1.6 and 1.7 - the notable ones are the intro of StringIO and pp (pretty printing) in 1.7. Both of these are available with the shim, a library of post 1.6 enhancements backported to 1.6.

I've decided to support 1.6 natively, so you will see code like:

# abstract the differences between 1.7 and 1.6 w/o requiring the shim library
    begin
      require "stringio" if not defined? StringIO
      body = StringIO.new(data)
    rescue LoadError
       require "tempfile"
       body = Tempfile.new("CGI")
       body.binmode
       body.write(data)
       body.flush
       body.pos = 0
    end     
    stream = body

    gzReader = Zlib::GzipReader.new(stream)

Step 1 is complete. We can parse RSS, we get a hash with several entries, items an array of items, each of which is a hash containing a title, description and link, channel which is a hash of channel information, including the title, description and link, modified, the last modification timestamp and possibly an etag.

In the next installment, we need to add some support for OPML. What better way to test our parser than to consume the OPML file from your current aggregator. We'll wrap some kind of command-line driver around OPML, add a control file to maintain etags and modified timestamps, paving the way to slap an initial UI on NewsHeap.