highscore – a lightweight ruby library that finds and ranks keywords in a string
I wrote about messie here the other day and what I use it for is that I want to automatically download pages and find keywords in their content. Say for instance, this article should have the keywords “keywords, highscore, ruby” and so on. Just a list of tags that describe this page’s content.
Highscore exactly does that. Give it a string and a blacklist (you don’t want to have words like “like, you, want” in your keywords or would you?) and it gives you all the important words back. You can configure highscore to rate words based on their characteristics, e.g. upper case words should have a double weight, longs words (say, the threshold is 20 chars) should have half, and so on.
Features
- get the top n words
- blacklist words via a file, array or string
- default blacklist in the gem
- merge different Keywords objects (e.g. from different sources)
- configureable to rank different types of words differently (uppercase, long words, etc.)
- String has it’s own keywords method that you can use
Roadmap
- detect the language of a string via “indicator words”
- define a blacklist file per language (based on language detection)
- fine grain the default blacklist (just a few words atm)
Example
keywords = "This is just a very basic example, look at the readme to see what's possible using highscore.".keywords keywords.rank.each do |word| puts word.text end
Try it!
gem install highscore
You can find highscore on rubygems.org and the sources and bug tracker are on Github. If you want to add features, feel free to fork the repository and send me a pull request
7 Responses to highscore – a lightweight ruby library that finds and ranks keywords in a string
Leave a Reply Cancel reply
tags
agile algorithms blog books C codingstyle couchdb daemon databases datastructures erlang fun function functional programming gem getting started gist git github javascript jruby learning linux mayflower mysql open source performance php postgresql pragmatic programmers programming redis ruby rvm shell sinatra slides snippets sprintf subversion syntax highlighting textmate theory unix zshMy Coderwall Badges




















Great work! I’ve been intending to build something like this for a few years but haven’t ever tried. Can’t wait to try it out
NoMethodError: undefined method `upcase’ for 98:Fixnum
from /Users/roberthead/.rvm/gems/ree-1.8.7-2011.03@shopdragon/gems/highscore-0.4.0/lib/highscore/content.rb:52:in `keywords’
Thanks for the bug report, will fix that as soon as possible and release a new version
Nice, but I miss the ‘whitelist’ a little bit. This is because I need to search just ONLY for the keywords on the ‘whitelist’ instead of ‘skipping the words on the blacklist’.
Hi Jeroen, I already filed an issue on github for this and will implement this feature as soon as possible. Thanks for your feedback
Hey Jeroen, I just released highscore 0.5.0 and you can now use a whitelist instead of a blacklist! You can simply get it via ruby gems.
[...] of pre-existing tools. The rather aging mediawiki library for Ruby nonetheless still works, and Highscore does good enough text summarization. Combine that with the Crunchbase API snapshot in 2009 and we [...]