I heard from Jason Calacanis this morning that Craigslist has a new blog. The downside? No feed. A little searching turned up that Josh Catone (a great guy I met through RailsForum) pieced together a feed that just has the title and date of the blog entries for those of us who’re feed-reader dependent.
But I don’t want to have to visit the site, I just want it to appear in Google Reader like TechCrunch and RobotWalrus do. So I tried to piece together a ruby script that would scrape the blog and turn it into a feed. Conclusion? Total failure.
Here’s the code that should work:
require 'rubygems'
require 'hpricot'
require 'activesupport'
require 'rss/maker'
require 'net/http'
blog = Hpricot.parse(Net::HTTP.get(URI.parse('http://blog.craigslist.org')))
main_table_cell = (blog / 'td').find {|td| td.attributes['width'] == '625' }
feed = RSS::Maker.make('1.0') do |rss|
rss.channel.about = "Craigslist Blog"
rss.channel.title = "Craigslist Blog"
rss.channel.description = "Craigslist Blog"
rss.channel.link = "http://blog.craigslist.org/"
(main_table_cell / 'a').select {|a| '' == a.inner_text }.each do |anchor|
intro = anchor.next_sibling
header = (intro / 'h2').first
date = Date.parse(header.next.inner_text.scan(/Posted (.*) by/).flatten.first)
author_link = header.next_sibling
comments_link = author_link.next_sibling
permalink = comments_link.next_sibling
contents = []
paragraph = intro
while paragraph = paragraph.next_sibling do
contents << paragraph.inner_html
end
contents << "<a href='#{comments_link.attributes['href']}'>Comments</a>"
item = rss.items.new_item
item.author = author_link.inner_text
item.title = header.inner_text
item.link = permalink.attributes['href']
item.date = date
item.description = '<p>' + contents.join('</p><p>') + '</p>'
end
end
puts feed
But the Craigslist blog has the least valid html I’ve seen since the 90’s. This script barfs out pretty quickly because of wildly inconsistent placement of <p>, <a>, and even <hr> tags. It would have to be a much bigger script to try to outsmart the CHTML (Crappy Hypertext Markup Language) on the craigslist blog.
So here’s my petition to Craigslist: let me read! Please! Even something so simple as wrapping each post in a <div class='post'> would fix everything.
1 feed43 // Apr 04, 2008 at 01:14 AM
just use feed43.com