[Xapian-discuss] Stemming

James Aylett james-xapian at tartarus.org
Wed Feb 9 15:28:11 GMT 2005


On Tue, Feb 08, 2005 at 07:40:25PM +0100, Jean-Francois Dockes wrote:

> Given that the database volumes are not going to be gigantic, it would be
> easy to build the stem->SetOfWords database at the end of indexing, by
> extracting and stemming the whole term list from the Xapian db (it takes a
> few seconds for my 300,000 terms db). 

Right.

> I am wondering though if I could use the xapian backend to handle
> the storage. Would it be absurd, for example, to have pseudo
> documents indexed by something like a unique STM:stemvalue term, and
> to store the word list in the document data ? Or would you suggest
> another way ?

I'd advise /either/ having a different database for it (so you don't
need STM:stemvalue, just 'stemvalue') /or/ just using the stemmed
terms to index the documents, but add in another term which you can
filter on the /lack/ of for normal searches.

The reason the second one might be worth considering is that putting
it within the same database might compress the termlist better -
although I can't actually remember how termlist compression works, so
it might not. (At the least, it will help where stemmed terms exactly
match unstemmed words indexing the 'regular' documents.)

> Or is this all just wrong, and I should stem during indexing like
> omindex ?

It probably depends on what machines these are designed to run
on. Stemming at index time will probably chew less disk space, so on
low (ish :-) memory machines that will probably work better than the
larger database you'll get by not stemming (just because stemming
conflates terms, but also the terms will be shorter on
average). Particularly important if you see people typing in quick
queries regularly, but not constantly (so they use another application
in the meantime, pushing some of the Xapian database out of file
buffers).

On the other hand, search-time stemming and query expansion gives you
advantages in not needing to detect the language of everything you
stem right now. For a personal search tool, that might be a big bonus.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list