messie – a web crawler written in Ruby

on thewebdev.de

A few weeks ago I wrote my first two gems and released them on rubygems.org. The first one about I will talk today is messie. I often want to crawl web pages in tiny little projects I do, so I thought: “Why not write a gem that I can use for every project and that handles crawling of web pages in an abstract manner”.

Features

By now, messie is still < 1.0 (currently 0.3.2) so it shurely isn’t perfect yet, but it already has a lot of features that might be very useful to you.

  • set your own User Agent header
  • set other custom headers via a fancy API
  • messie will follow a HTTP redirection max. 3 levels deep
  • HTTPS is supported
  • get the full HTML or just a text version
  • records the response time of the page

Roadmap

  • get a array of links on the page to continue crawling recursively
  • enable caching support (304 Not modified)
  • read binary data and convert it to text (e.g. PDFs)

Try it!

To install it, just use gem:


gem install messie

Read the README.md at github.com to get on from here. If you need help just file an issue on github or fork the repository and send me a pull request, I’d be happy about it :)

11 Responses to “messie – a web crawler written in Ruby”

    • Dominik Liebler

      Yes, I first thought of using anemone, but then decided it’s not as lightweight as I need it to be. It relies on a complete stack with Redis, MongoDB and so on to run and I just don’t wanted that. Messie is a lot more flexible, it just gives a nice API and you can do whatever you want with it. Anemone is a full stack, while messie is just a lightweight API on top of Net::HTTP.

      Reply
  1. Dan F

    Hey, this actually looks really cool & useful for a sideproject I wanted to complete. One question though, does it automatically visit all the links? Or if I wanted to spider a page would I have to do something like this?

    seen = Set.new

    def visit_page(url)
    page = Messie::Page.crawl(url) do

    end

    (page.links.to_set – seen).each {|x| visit_page(x) }
    seen += page.links
    end

    visit_page(“google.com”)

    Reply
    • Dominik Liebler

      Hey Dan F,

      at the moment you have to do this but I will add support for such scenarios in the future :)

      Reply
  2. Eric Hodel

    Why not use mechanize instead of reinventing the wheel? Looking through your code, mechanize supports the following features that you’ll spend days reimplementing:

    Persistent connections for improved speed

    Validated SSL connections for proper security (you’re lacking VERIFY_PEER)

    robots.txt parsing (mechanize uses webrobots)

    Correct handling of links (normalization, base element, HTTPS scheme capitalization)

    Broken page encodings and content encodings.

    The refresh header

    And many, many more workarounds for minor bugs in servers and pages.

    Reply
    • Dominik Liebler

      Hi Eric,

      yes you are right, but I just found mechanize a couple of days after I started to develop messie. I’ll have a look at it :)

      Reply

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>