Hoopla!

now with extra whiz-bang!

Hoopla!

Scraping the Craiglist blog: FAIL

April 03, 2008 · 1 comment

I heard from Jason Calacanis this morning that Craigslist has a new blog. The downside? No feed. A little searching turned up that Josh Catone (a great guy I met through RailsForum) pieced together a feed that just has the title and date of the blog entries for those of us who’re feed-reader dependent.

But I don’t want to have to visit the site, I just want it to appear in Google Reader like TechCrunch and RobotWalrus do. So I tried to piece together a ruby script that would scrape the blog and turn it into a feed. Conclusion? Total failure.

Here’s the code that should work:
require 'rubygems'
require 'hpricot'
require 'activesupport'
require 'rss/maker'
require 'net/http'

blog = Hpricot.parse(Net::HTTP.get(URI.parse('http://blog.craigslist.org')))
main_table_cell = (blog / 'td').find {|td| td.attributes['width'] == '625' }

feed = RSS::Maker.make('1.0') do |rss|
  rss.channel.about         = "Craigslist Blog" 
  rss.channel.title         = "Craigslist Blog" 
  rss.channel.description   = "Craigslist Blog" 
  rss.channel.link          = "http://blog.craigslist.org/" 
  (main_table_cell / 'a').select {|a| '' == a.inner_text }.each do |anchor|
    intro         = anchor.next_sibling
    header        = (intro / 'h2').first
    date          = Date.parse(header.next.inner_text.scan(/Posted (.*) by/).flatten.first)
    author_link   = header.next_sibling
    comments_link = author_link.next_sibling
    permalink     = comments_link.next_sibling

    contents = []
    paragraph = intro
    while paragraph = paragraph.next_sibling do
      contents << paragraph.inner_html
    end
    contents << "<a href='#{comments_link.attributes['href']}'>Comments</a>" 

    item          = rss.items.new_item
    item.author   = author_link.inner_text
    item.title    = header.inner_text
    item.link     = permalink.attributes['href']
    item.date     = date
    item.description = '<p>' + contents.join('</p><p>') + '</p>'
  end
end

puts feed

But the Craigslist blog has the least valid html I’ve seen since the 90’s. This script barfs out pretty quickly because of wildly inconsistent placement of <p>, <a>, and even <hr> tags. It would have to be a much bigger script to try to outsmart the CHTML (Crappy Hypertext Markup Language) on the craigslist blog.

So here’s my petition to Craigslist: let me read! Please! Even something so simple as wrapping each post in a <div class='post'> would fix everything.

Tags:············

  • 1 feed43 // Apr 04, 2008 at 01:14 AM

    just use feed43.com