[Xapian-devel] Adding Support for Krovetz Stemmer Algo in Xapian

James Aylett james-xapian at tartarus.org
Sun Nov 30 18:33:23 GMT 2014

[Please keep replies on the mailing list so that everyone can help and benefit.]

On 30 Nov 2014, at 17:51, Abhishek Singh Kushwah <abhishek18kushwah at gmail.com> wrote:

> Two of the implementation of algorithms has already been rejected previously due to licenses both being the implementation of porter but our xapian use implementation in snowball which i assume is under GPL.

Snowball (and the stemmer implementations shipped with it) is under a BSD license.

> Tell me how can a stemmer algo possible so lengthy be incorporated in a bit-size project if we have to code it from the scratch.

Krovetz isn’t actually particularly lengthy for a hand-coded algorithm; it’s about 1000 lines (and then almost another 6000 lines of dictionaries). I think the problem here is that Krovetz doesn’t seem amenable to implementing directly in Snowball, which means more work. The original paper <http://people.scs.carleton.ca/~armyunis/projects/KAPI/Krovetz.pdf> doesn’t describe the algorithm particularly concisely, but it doesn’t seem hugely difficult or time-consuming to implement, although there are always concerns about efficiency in stemming algorithms.

We’d need a dictionary or dictionaries from somewhere; I’m not clear from a quick skim of the paper what we’d need to do to construct useful ones. Also note from <http://www.comp.lancs.ac.uk/computing/research/stemming/general/krovetz.htm> that Krovetz, in IR, is often combined with other stemmers; at the moment we don’t provide a way of “chaining” stemmers together. (This could separately be a bite-sized project, however, as it doesn’t sound terribly complex.)

If you can get an explicit license grant from the copyright holder of the Krovetz stemmer (which seems to be either the University of Massachusetts or the Applied Computer Systems Institute of Massachusetts, Inc. — it’s unclear from the Krovetz source code), then (providing it’s compatible) we could accept a derived version directly into Xapian. The problem is the ambiguity about licensing, which is made worse by pointing to <http://www.lemurproject.org/license.html> which asserts yet another copyright holder (albeit also asserting BSD, so if the other two claims are taken care of cleanly then it’ll work out).


 James Aylett, occasional trouble-maker

More information about the Xapian-devel mailing list