Fork me on GitHub
Hoopla! - now with extra whiz-bang home

I heard from Jason Calacanis this morning that Craigslist has a new blog. The downside? No feed. A little searching turned up that Josh Catone (a great guy I met through RailsForum) pieced together a feed that just has the title and date of the blog entries for those of us who're feed-reader dependent.

But I don't want to have to visit the site, I just want it to appear in Google Reader like TechCrunch and RobotWalrus do. So I tried to piece together a ruby script that would scrape the blog and turn it into a feed. Conclusion? Total failure.

Here's the code that should work:

require 'rubygems'
require 'hpricot'
require 'activesupport'
require 'rss/maker'
require 'net/http'

blog = Hpricot.parse(Net::HTTP.get(URI.parse('http://blog.craigslist.org'))) main_table_cell = (blog / 'td').find {|td| td.attributes['width'] == '625' }

feed = RSS::Maker.make('1.0') do |rss| rss.channel.about = "Craigslist Blog" rss.channel.title = "Craigslist Blog" rss.channel.description = "Craigslist Blog" rss.channel.link = "http://blog.craigslist.org/" (main_table_cell / 'a').select {|a| '' == a.inner_text }.each do |anchor|

intro         = anchor.next_sibling
header        = (intro / 'h2').first
date          = Date.parse(header.next.inner_text.scan(/Posted (.*) by/).flatten.first)
author_link   = header.next_sibling
comments_link = author_link.next_sibling
permalink     = comments_link.next_sibling

contents = []
paragraph = intro
while paragraph = paragraph.next_sibling do
  contents << paragraph.inner_html
end
contents << "<a href='#{comments_link.attributes['href']}'>Comments</a>"

item          = rss.items.new_item
item.author   = author_link.inner_text
item.title    = header.inner_text
item.link     = permalink.attributes['href']
item.date     = date
item.description = '<p>' + contents.join('</p><p>') + '</p>'

end end

puts feed

But the Craigslist blog has the least valid html I've seen since the 90's. This script barfs out pretty quickly because of wildly inconsistent placement of

, , and even


tags. It would have to be a much bigger script to try to outsmart the CHTML (Crappy Hypertext Markup Language) on the craigslist blog.

So here's my petition to Craigslist: let me read! Please! Even something so simple as wrapping each post in a

would fix everything.