[Xapian-discuss] Re: document weight

Thu May 26 02:07:17 BST 2005

On Thu, May 26, 2005 at 12:29:03AM +0000, Sabrina Shen wrote:
> Olly Betts <olly <at> survex.com> writes:
>  
> > If the highest ranking document doesn't match all terms, we simply
> > multiply by less than 100%.  The score to multiply by is determined
> > by looking at which terms match.

> I'm a little confused. More specifically, what do you mean be "by looking
> at which terms match"? For example, if we search with terms t1, t2, t3, 
> a document D1 contains t1 and t2, and we get term
>  weigt tw1 for D1 with t1, tw2 for D1 with t2. Using BM25, 
> finally we get document weight DW1. Similarly, a document D2 contains t2
>  and t3, and we get term weigt tw2' for D2 with t2, tw3' for D2
>  with t3. Using BM25, finally we get document weight DW2. 
> How could we estimate final percent score?

Assuming document D1 has the highest weight (i.e. DW1 >= DWi for any i)
then we look at the termweight ceilings for the terms (the values
returned by Weight::get_maxpart()).  We assign D1 a score of:

(maxpart1 + maxpart2) / (maxpart1 + maxpart2 + maxpart3) * 100%

So if D1 matches all terms, that's 100%.  If it matches fewer, it gets
a score which reflects how much weight the terms that index it could have
given it compared to how much weight a document matching all terms could
have got.

All other documents get scaled accordingly, so D2 gets a score of:

DW2 / DW1 * (maxpart1 + maxpart2) / (maxpart1 + maxpart2 + maxpart3) * 100%

There's no particular theoretical justification for this, but it seems a
reasonable approach, and seems to give appropriate scores.

There's actually not much theoretical justification for the percentage scores
at all - the probabilistic formulae say a document scoring 60% is better than
one scoring 30% but the manipulations which happen during their derivation
aren't linear, so there's no reason to think the 60% document is twice as good
in any useful sense.  But percentage scores are something users can visualise
much better than raw numeric weights and empirically the relative percentage
scores do seem to provide useful information.

> Is  MSet::convert_to_percent where I should look into?

Only if you're prepared to trace back a lot!

The place to look is in matcher/multimatch.cc.  There's a monster
function called get_mset(), and towards the end (line 877 in the
latest SVN trunk, and probably in 0.9.0 too) there's this code:

        if (matching_terms < termfreqandwts.size()) {
            // OK, work out weight corresponding to 100%
            double denom = 0;
            for (i = termfreqandwts.begin(); i != termfreqandwts.end(); ++i)
                denom += i->second.termweight;

            DEBUGLINE(MATCH, "denom = " << denom << " percent_scale = " << percent_scale);
	    Assert(percent_scale <= denom);
            denom *= greatest_wt;
            Assert(denom > 0);
            percent_scale /= denom;
        } else {
            // If all the terms match, the 2 sums of weights cancel
            percent_scale = 1.0 / greatest_wt;
        }

All terms matching the top document is a pretty common case, so we short-cut
that.

Cheers,
    Olly