[Xapian-discuss] floating-point issues with set_sort_by_relevance_then_value? (1.2.3, BM25 k1=0)

Marinos Yannikos mjy at pobox.com
Mon Nov 1 00:59:42 GMT 2010


I am using BM25 with k1=0 and min_normlen=1 to get weights unaffected by 
document length and term frequency in the document (min_normlen=1 isn't 
necessary I guess) and am expecting single-term weights to be identical for all 
matches. I have added a document value to steer such general search queries and 
it works fine, except that for some search terms, I get results like:

             weight (BM25)               value
-----------------------------------------------
1. xxx (6.3564210045800955128925125 + 4.000000)
2. xxx (6.3564210045800955128925125 + 4.000000)
3. xxx (6.3564210045800955128925125 + 3.500000)
4. xxx (6.3564210045800946247140928 + 7.000000)
5. xxx (6.3564210045800946247140928 + 6.500000)
6. xxx (6.3564210045800946247140928 + 6.000000)
7. xxx (6.3564210045800946247140928 + 6.000000)
8. xxx (6.3564210045800946247140928 + 6.000000)
9. xxx (6.3564210045800946247140928 + 6.000000)
10. xxx (6.3564210045800946247140928 + 6.000000)
	
The weights then always seem to differ after the 14th/15th fractional digit and 
only a small number of results is affected (3 out of ~16000 with a slightly 
lower weight in one case, 4 out of ~70000 with a slightly higher one in 
another). Platform is Debian Lenny 64bit, AMD Opteron CPUs, core-1.2.3 patched 
to r15140 and using chert. This also happens with complex queries where groups 
of results are expected to have identical weights.

FIX: I found a simple fix for this issue, at least for my test cases:

I added

     if (param_k1 == 0) RETURN(termweight);

to the beginning of BM25Weight::get_sumpart in 
trunk/xapian-core/weight/bm25weight.cc:166

This apparently prevents floating point precision issues in the last line of 
get_sumpart() [which calculates termweight * wdf_double * 1 / wdf_double]. It 
also speeds up my case slightly. ;-)

In order to prevent more such issues, it might be a good idea to round weights 
to a few fractional digits (10 should be enough) before using them as sort keys.

Regards,
  Marinos




More information about the Xapian-discuss mailing list