[Xapian-discuss] about stemming

Olly Betts olly at survex.com
Sun Apr 2 16:07:31 BST 2006


On Sun, Apr 02, 2006 at 10:27:37AM +0530, durga bidaye wrote:
> Suppose footballer and footballs were given as terms to be indexed
> and both were stemmed to footbal. Now when we gave "footballs" as the query
> then we will get both, document containing footballs and document containing
> footballer, as search results with equal ranking(in absence of other factors
> like within document frequency,etc).

Correct.

> But ideally it should have given document containing "footballs"
> higher ranking and the one containing footballer lower ranking.

I don't follow why.  Both "footballs" and "footballer" indicate that a
document is "about terms that stem to 'footbal'".

Perhaps you think that "footballer" indicates less "aboutness" than
"footballs"?  I think that's a highly subjective judgement - it may
be true sometimes but in other cases the reverse is true.  For example,
consider the query: footballers' wives - there "footballer" indicates
more relevance than "footballs".

> Isn't there a mechanism in xapian which makes this kind of ranking
> possible?

If you really want to do that, you can set a higher "wdfinc" when adding
postings for "footbal" when it comes from "footballs" than when it comes
from "footballer".

But you'll need to compile a list of which unstemmed forms indicate more
"aboutness" than others, and I'm unconvinced it's really a sensible
approach.

Cheers,
    Olly



More information about the Xapian-discuss mailing list