A few weeks ago I wrote my first two gems and released them on rubygems.org. The first one about I will talk today is messie. I often want to crawl web pages in tiny little projects I do, so I thought: “Why not write a gem that I can use for every project and that handles crawling of web pages in an abstract manner”.
Features
By now, messie is still < 1.0 (currently 0.3.2) so it shurely isn’t perfect yet, but it already has a lot of features that might be very useful to you.
- set your own User Agent header
- set other custom headers via a fancy API
- messie will follow a HTTP redirection max. 3 levels deep
- HTTPS is supported
- get the full HTML or just a text version
- records the response time of the page
Roadmap
- get a array of links on the page to continue crawling recursively
- enable caching support (304 Not modified)
- read binary data and convert it to text (e.g. PDFs)
Try it!
To install it, just use gem:
gem install messie
Read the README.md at github.com to get on from here. If you need help just file an issue on github or fork the repository and send me a pull request, I’d be happy about it
You should probably add a link to the GitHub page.
Yes, you’re right, I added the link. Thanks
Hi, have you looked at anemone (https://github.com/chriskite/anemone)?
Yes, I first thought of using anemone, but then decided it’s not as lightweight as I need it to be. It relies on a complete stack with Redis, MongoDB and so on to run and I just don’t wanted that. Messie is a lot more flexible, it just gives a nice API and you can do whatever you want with it. Anemone is a full stack, while messie is just a lightweight API on top of Net::HTTP.
Hey, this actually looks really cool & useful for a sideproject I wanted to complete. One question though, does it automatically visit all the links? Or if I wanted to spider a page would I have to do something like this?
seen = Set.new
def visit_page(url)
page = Messie::Page.crawl(url) do
…
end
(page.links.to_set – seen).each {|x| visit_page(x) }
seen += page.links
end
visit_page(“google.com”)
Hey Dan F,
at the moment you have to do this but I will add support for such scenarios in the future
Why not use mechanize instead of reinventing the wheel? Looking through your code, mechanize supports the following features that you’ll spend days reimplementing:
Persistent connections for improved speed
Validated SSL connections for proper security (you’re lacking VERIFY_PEER)
robots.txt parsing (mechanize uses webrobots)
Correct handling of links (normalization, base element, HTTPS scheme capitalization)
Broken page encodings and content encodings.
The refresh header
And many, many more workarounds for minor bugs in servers and pages.
Hi Eric,
yes you are right, but I just found mechanize a couple of days after I started to develop messie. I’ll have a look at it
Hey, good stuff making your own scraper gem, they’re fun to build. I built http://github.com/dchuk/Arachnid might want to check it out to see how I implemented bloom filters in my recursive crawling system. Hope it helps.
You might also want to check out wombat (http://github.com/felipecsl/wombat) that I built on top of Nokogiri. It is a Ruby DSL to specify web crawlers.