[Xapian-discuss] Stemming

Jean-Francois Dockes jean-francois.dockes at wanadoo.fr
Tue Feb 8 18:40:25 GMT 2005


Hello,
I am building a personal search tool, based on xapian-core and qt. I am
experimenting with not stemming at indexing time (for a personal system,
the database size will not usually be an issue), and handling it at query
time.

The idea is to stem the user's query term and find the set of database
terms that stem to the same value (more or less like what is in the "Using
stemming in IR" paragraph in the stemming page on xapian.org). The query
can then be (optionally) expanded to the stem siblings.

Given that the database volumes are not going to be gigantic, it would be
easy to build the stem->SetOfWords database at the end of indexing, by
extracting and stemming the whole term list from the Xapian db (it takes a
few seconds for my 300,000 terms db). 

I could then store the result using any indexed file manager like gdbm or
whatever.

I am wondering though if I could use the xapian backend to handle the
storage. Would it be absurd, for example, to have pseudo documents indexed
by something like a unique STM:stemvalue term, and to store the word list
in the document data ? Or would you suggest another way ?

Or is this all just wrong, and I should stem during indexing like omindex ?

Incidentally, if somebody is interested in taking a look at the software
(it is still very incomplete, but may already be somewhat useful in some
cases), it is at http://perso.wanadoo.fr/dockes/recoll/).

Regards,
Jean-Francois Dockes








More information about the Xapian-discuss mailing list