[Xapian-discuss] Lucene ranking formula

Olly Betts olly at survex.com
Thu Nov 11 04:10:47 GMT 2004


On Wed, Nov 10, 2004 at 11:29:24AM +0000, James Aylett wrote:
> For the record, Lucene's scoring algorithm is, roughly:
> 
>   score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)

Implementing a Xapian::Weight subclass for this would be pretty easy.
I'm not sure if there's much point, though it might make a good worked
example for documenting how to implement your own weighting scheme in
Xapian.

> Make that:
>   
>   score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
>                   boost_t) * coord_q_d
> 
> where
> 
>   boost_t    : the user-specified boost for term t
>   coord_q_d  : number of terms in both query and document / number of
>                terms in query

I suspect you'd need to tweak the matcher to allow coord_q_d to be used
like this in Xapian.  The matcher handles the components of the weight
individually, and it needs to know them before it knows how many query
terms match a particular document.  It can sometimes reject a document
based on partial weight information before it has even looked at whether
all of the terms match (because it's possible that even if they all
match, they can't give the document enough score to beat the best 10
(or how every many) already seen.

Actually, you can return a very large value for the maximum weights,
which will disable this optimisation.  Matches will run a bit more
slowly, but it would provide an easy way to evaluate Lucene's weighting
scheme against BM25 and any other weighting scheme you can implement for
Xapian.

Alternatively, for an AND query you can just ignore this as it's then a
constant factor.

Cheers,
    Olly



More information about the Xapian-discuss mailing list