[Xapian-discuss] query time stemming and term weights

Wed Nov 16 21:51:55 GMT 2005

On Wed, Nov 16, 2005 at 08:14:31PM +0100, Jean-Francois Dockes wrote:
> The problem is with term frequencies. When doing the stemming at index
> time, the term frequency will be for the stem, more or less the sum of derived
> terms frequencies.

Strictly speaking "same or less" rather than "more or less"...

> My concern is that, when doing the stemming at search time, each derived
> term will have its own frequency, and the results are going to be biased
> towards those that occur less often (which is not desired because the user
> did not explicitely search for them).
> 
> Maybe I don't understand the issue and this is not a problem ?

The derivation of the probabilistic weights (which are the ones which
you get when you use TradWeight) assumes that terms are independent.
You have to really or the maths just becomes intractable.

Now even in general, this assumption isn't exactly true, but it's
particularly shakey for terms which have have the same stem.

So that perhaps suggests that it's better to use the frequency of the
stem as you suggest.  Or at least that you are right to consider the
issue!

The pragmatic view is that the derivation of the probabilistic weights
is just giving you a plausible candidate weighting formula to compare to
other approaches - the reason it is a good formula to use isn't really
because of the maths behind in, but because in it performs well in
practical trials.  A theoretical justification is reassuring, but
doesn't make up for poor actual results!

BM25 weights (the default in Xapian) build upon the traditional
probabilistic weighting formula in ways which do have some theoretical
justification, but it seems to me that the guiding principle in the
progression from the original probabilistic formula through BM11 and
BM15 to BM25 is "gives better results".

For example, everybody now ignores the constant c, giving a power to
which f and K are used, which featured in the original BM25 - Stephen
Robertson et al remarked in an early paper that powers other than 1 were
"not helpful" (section 3.2, page 3):

http://trec.nist.gov/pubs/trec3/papers/city.ps.gz

So my suggestion would be to do some tests and see if retrieval
effectiveness is actually made better/worse or left unchanged by
stemming at search vs index time.  I'd definitely be interested to hear
the results of any such tests.

> Else would there be a way so that the aggregate term frequency is used
> for each of the derived terms ?

Not directly I think - if it's useful, a "SynonymPostList" isn't too
hard to write.  You can probably even use the "correct" combined term
frequency by generating terms from the stems and using those to get
the term frequency.  Or just use an estimated term frequency based on
probabilities:

http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=50

Longer term, I'm interested in the idea of stemming at search time
(at least as an option).  It has several benefits such as allowing an
exact word search without having to index "raw" terms too, and allowing
choice of stemming language at search time.

Cheers,
    Olly