[Xapian-discuss] best method for stemming

Wed Feb 8 12:18:28 GMT 2006

On Tue, Feb 07, 2006 at 05:15:49PM -0500, Alex Deucher wrote:
> I'm willing to sacrifice disk space for faster lookups.

For a large corpus you're likely to be I/O bound, so a larger database
will tend to mean slower lookup because Xapian is more likely to need
to actually read information from disk rather than find it in the OS
disk cache.

> Is it better to stem while indexing or to stem the query and treat it
> like a wildcard or am I off all together?

Stemming while indexing is a pretty common approach.  At query time you
also stem all query terms.  The plus points are that you have a reduced
number of terms (and so a smaller database).  The main downside is that
you can't search for unstemmed terms, which can be a problem sometimes
especially is a proper name is conflated with a common word.

Alternatively, you could stem nothing at index time and then for search
terms which you want to stem, stem them, and then run them through an
"unstemming" algorithm to produce a list of terms they could have come
from.  Then OR this list together.  Unfortunately nobody has written
the "unstemmer" yet.  Also this means more work at search time than
the first approach, but that may not really matter.  I've not tried
the idea, so I can't say for sure.

Omega's approach is to store everything stemmed and selected terms
unstemmed (those from capitalised words).  Unstemmed terms have an
"R" prefixed.  This allows name searches to work.

> Right now as I iterate through the document, I stem the words and add
> the stem to the index at the same position as the non-stemmed word.

I've not tried this approach, but is seems plausible.  It means more
term postings for R terms than Omega generates, but storing a term with
many postings is more efficient per posting so it may not be so bad.

You can probably get away with not storing positional information for
the stemmed terms (use add_term instead of add_posting).  That means
that NEAR and phrase searches will only work on unstemmed terms, but
that's not a big restriction and will save a lot of diskspace - the
position table is generally the largest.

Note that the QueryParser currently only really supports Omega's
approach, stemming everything, or stemming nothing.  I regard that
as a bug, but some thought is need as to how best to make it
more configurable.

Cheers,
    Olly