A few weeks ago I wrote my first two gems and released them on rubygems.org. The first one about I will talk today is messie. I often want to crawl web pages in tiny little projects I do, so I thought: “Why not write a gem that I can use for every project and that handles crawling of web pages in an abstract manner”.
By now, messie is still < 1.0 (currently 0.3.2) so it shurely isn’t perfect yet, but it already has a lot of features that might be very useful to you.
- set your own User Agent header
- set other custom headers via a fancy API
- messie will follow a HTTP redirection max. 3 levels deep
- HTTPS is supported
- get the full HTML or just a text version
- records the response time of the page
- get a array of links on the page to continue crawling recursively
- enable caching support (304 Not modified)
- read binary data and convert it to text (e.g. PDFs)
To install it, just use gem:
gem install messie
Read the README.md at github.com to get on from here. If you need help just file an issue on github or fork the repository and send me a pull request, I’d be happy about it
My Coderwall Badges