Don’t forget your stems, smokey

Maybe I’ve been in this industry too long. Maybe I’m a complete moron. Maybe it’s a little shade of both and some of a third. Who knows. The point is that everywhere I seem to look, nobody knows about one of the most useful word matching algorithms. Not text matching as that’s a horse of a different color. Words. Word. The difference is subtle, but it is something that seems to come up a few times a year over the past decade for me, where if I had known about it earlier in my career I would probably have more hair now. The difference is this: word matching, you want to match near-exact gramatical terms. Not necessarily something that rhymes or sounds similar (that’s a something else entirely), but words that are one in the same with different suffixes. Yeah, it’s a little weird and out there, but it’s a problem that seems to bolster its head about every once in a while and knowing about Word Stemming will make your life just a little easier. Again, maybe everyone learned about this in their infancy and I’m just an idiot. Though, every time I use it in a solution, I surprise at least one person… hence why I bring it up now.

WTF is word stemming

If you haven’t already figured it out, a word stem is the root of a word with the suffix negated. Where “plumb” is the root (or word stem) of “plumbing”, “plumber” and “plumbers”. From the surface, it seems simple enough, right. Simply nothing more than:

/(?:ing|er|s).*//g

(with some added suffixes) to every word… right? Well, sure. But, this is called Suffix Stripping and I’M NOT TALKING ABOUT SUFFIX STRIPPING, I’M TALKING ABOUT WORD STEMMING. Shit man, pay attention.

Word stemming is nothing more advanced than an undergraduate in grammatical heuristics. There is quite a bit of interesting fluctuation when you sit down and start sketching out ideas of what a good implementation is. The “plumb” example from earlier is pretty straight forward. However, what about “probe” and “probate”. Both have unique and unrelated definitions (in the English dictionary), yet “ate” as an appropriate word-suffix. *shakes his head* I don’t want to go down this road. I’ve been there and it hurts. Even more is that Martin Porter already did in 1979 (published in 1980). And from this was born the Porter Stemming Algorithm, of which nearly every varient or similar approach in the last 30 years has been based on. Imperfect, yes, but the best, sort of (it’s the best when using English, arguably… but we won’t get into internationalization).

Code it!

Who cares about it’s history, how can I use it? Well the greatest part about this algorithm is that it was designed by a total geek and one that knows how to program in multiple languages. Furthermore, as a fan of open source, he has blessed the world with the code for usage of his algorithm in numerous languages (and many more by other contributors). Personally, I’ve implemented the algorithm in both Java and Perl; there was little sweat lost in doing so in either language. Here, as I seem to be having a trend going, I will demonstrate with some publicly released Perl modules I have.

First, you gotsta start with

. At the time of my research, this was the best implemented representation of the algorithm (and didn’t suck too bad). From there, it’s nothing more than a simple method call:

my ( $stem ) = Text::English::stem( $word );

Yeah. That’s it. And, well, now you know. And knowing is at least 64 198ths of the battle. At least.

Other Stuff YOU NEED TO KNOW.

Something about capital letters really gets me going. Also, Here are some other, random, references I have for you:

  • Bot::BasicBot::Pluggable::Module::Retort
    • This is a Perl module I created and “maintain” which will examine words as they fly through IRC channels, retorting on those who are keyed in the system. The module takes the stem of the word for all evaluation, rather than the literal word. IT’S SUPER AWESOME BECAUSE IT DOES THAT. IT MAKES ME LOOK MAGIC AND SMART. It could do the same for you. Magic.
  • http://www.google.com/search?q=plumber
    • Yeah, that’s totally Google. You can search my Internet with the Google. Also, there is some derivative of word stemming happening in their searches and bolding as well. Hard to pick out all of the details, but it’s pretty apparent they are doing so. And, you know, when Google does something, it must be cool. Well, except for Google Buzz. Everything but that.

Tags: , , , ,

Leave a comment