[Xapian-discuss] Giving a choice of stemming

Olly Betts olly@survex.com
Thu, 27 May 2004 04:01:06 +0100


On Wed, May 26, 2004 at 04:04:17PM +0100, Francis Irving wrote:
> In particular, I'd like to be able to make it so a plain query:
>     fox hunting
> would do stemming, but one in quotes (using QueryParser) would match
> only exact words:
>     "Mark Fisher"
> 
> Do I need to add postings with a prefix "stemmed:" or "unstemmed:" for
> every word, and munge inputs, or is there a cleverer way?

The usual convention is that no prefix means stemmed, while a capital R
prefix means unstemmed.  All words are lowercased, so there's no
ambiguity.

By default, capitalised words are indexed to both a stemmed term
(R-prefix) and an unstemmed term (no prefix).  Uncapitalised words just
produce a stemmed term.  A capitalised word in a query searches for the
unstemmed (R-prefix) term.  This means that searches for names work as
users expect (no problems with a search for Mark Fisher matching
documents about fish markings).

You could generate R terms for all words, though your databases would
swell quite a bit.

> More generally, what are your guys experiences of users feelings about
> stemming?

Most seem to like it (or don't notice!)

Pretty much all the negative comments I've heard are with searching for
names - the scheme described above is aimed at addressing that.

The other negative stemming comment is when using relevance feedback:

http://www.xapian.org/search.php?P=postlist

People don't like the stemmed terms which appear (e.g. "calcul." in this
example).  Omega tries quite hard to avoid these (notice that many of
the suggested words don't have a trailing dot, so they aren't stemmed
forms) but without building an "unstem" map from the source data, it
can't always manage to avoid stemmed forms.

Academic studies also favour stemming.

Cheers,
    Olly