[Xapian-discuss] query time stemming and term weights

Wed Nov 16 19:14:31 GMT 2005

I am developping a personal/desktop search tool for which I am
experimenting with doing no stemming during the indexing, but instead
having a stem database (or several for different languages), used for
expanding the query terms at search time.
 (ie: user query: flooring -> stem: floor
     -> final query for: [floored flooring floorings floors])

I have thought of a possible problem with weighting when using this
approach, I am not really confident in my knowledge of how things are
computed, so I am not sure that this is an actual issue.

The problem is with term frequencies. When doing the stemming at index
time, the term frequency will be for the stem, more or less the sum of derived
terms frequencies.

My concern is that, when doing the stemming at search time, each derived
term will have its own frequency, and the results are going to be biased
towards those that occur less often (which is not desired because the user
did not explicitely search for them).

Maybe I don't understand the issue and this is not a problem ? Else would
there be a way so that the aggregate term frequency is used for each of the
derived terms ?

Or should I go back to performing stemming during indexing ?

Cheers,
J.F. Dockes

-- 
Recoll: desktop search for Unix. http://www.recoll.org