[Xapian-devel] contribution to "Add more stemming algorithms"
Olly Betts
olly at survex.com
Tue Feb 18 22:41:53 GMT 2014
On Tue, Feb 18, 2014 at 10:08:20PM +0800, Hurricane Tong wrote:
> I am trying to contribute to the "bite-site" project, "Add more
> stemming algorithms".
> I implement the Lancaster (Paice/Husk) stemming algorithm by building
> a class named StemLancaster extending
> the StemImplementation, with the guide in
> http://www.comp.lancs.ac.uk/computing/research/stemming/index.htm.
> I think this class can be added to the default API for the potential
> users who are interested in this algorithm.
> There is the source code, https://github.com/HurricaneTong/Xapian,
> would you like to give me some suggestions about the source code, and
> can this code be added to the source code of Xapian after necessary
> modifying ?
| This class is implemented based on an ANSI C implementation by Andy
| Stark
Unfortunately there's no licence provided for that implementation, which
sadly means we can't use it in Xapian. I had a quick look and I think
your code is pretty clearly a derivative work of Andy Stark's.
Last year another student provided a Paice/Husk implementation based on
this same code, so I think we need to add a warning to the project idea
that we can't use this code unless someone is able to contact Andy Stark
and get an explicit licence (which looks hard as there are no contact
details for him on the download, and it's a relatively common name).
> Besides, I indexed about 5000 documents from wikipedia with Brass and
> Chert, and execute about 40000 single term search.
> With the brass database, it costs 5.66s, and with the chert database,
> it costs 5.57s, ( In virtual machine VBox ). it seems that brass is
> slower in this condition.
It's expected that brass is currently slower to index, due to the
positional data storage changes. I'm hopeful we can regain that
loss (and more) by optimising how data is stored in memory while
indexing.
Cheers,
Olly
More information about the Xapian-devel
mailing list