[Xapian-discuss] BM25

Sat Oct 30 08:11:04 BST 2004

On Fri, Oct 29, 2004 at 10:30:25AM +0100, James Aylett wrote:
> That's just a wild stab in the dark - I'm assuming that
> Xapian::Enquire::set_weighting_scheme() does actually work, and that
> the variables in Xapian::BM25Weight are used;

This isn't covered by the testsuite (which is bug#8) so you might want to
test that assumption.  The problem is knowing what to test - I guess we
can check that changing the parameters gives different weights, and that
BM25Weight with the weights which should give TradWeight does actually
give the same results as TradWeight.  That'd be better than no tests at
all.  I'll look into it.

> True. I was just wondering, if it turns out that SWIG doesn't play
> well with inlined constructors (can't think why it wouldn't, and
> there's no explicit mention in the manual).

It's easy to move methods into matcher/bm25weight.cc to test your
hypothesis, 

> > IIRC, the formula is adjusted by a constant factor to make sure
> > something is never negative.  But yes, that should be documented.
> 
> The BM11 term is effectively divided by (1-L), so that makes
> sense.

(1-L) isn't constant (L is the normalised length of a document).
We actually add C.s which is constant for a given query (C is a
constant, s is the query length).  Rather oddly that has the effect of
dividing that term by (1-L)/2 !

I've updated the docs a bit:

http://cvs.xapian.org/*checkout*/xapian/xapian-core/docs/intro_ir.html?content-type=text/html
http://cvs.xapian.org/*checkout*/xapian/xapian-core/docs/bm25.html?content-type=text/html

I think it's unhelpful that our "C" incorporates a factor of 2 from our
tweak to the extra term, so our parameters don't match those of BM25.
I think we should change that.

It'd also be nice to be able to undo the addition of C.s so we actually
returned true BM25 weights to the user.  IIRC, the addition of C.s is
needed for the matcher, but a Weight object could probably have a method
to adjust the final weight of a document (which for BM25 would subtract
C.s).  Worth investigating anyway.

I also noticed that Xapian defaults C to 0, which it didn't originally.
Here's where it was changed:

http://cvs.xapian.org/xapian/xapian-core/matcher/bm25weight.cc.diff?r1=1.17&r2=1.18

I think I remember the issue.  Webtop had problems with some queries
being very slow.  Digging inside the matcher we found that the
additional term was causing problems because it created extra "slack"
between the actual weights and maximum weights in the matcher, which in
a few cases could stop optimisations firing and cause massive slowdowns.

Now Webtop used the Muscat 3.6 backend, which didn't store document
lengths, so L was always 1 and the extra term therefore constant.  So
in this case the slack would be constant, whereas with Quartz it would
be less for some documents, and more for others.  It's also possible
that other changes in the past 4 years will have an effect on this.

We ought to do some testing to see whether our choice of BM25 parameters
is good or not.  I don't recall how we chose them originally - it may be
they came from Stephen Robertson, but it rather looks like we just used
the parameters which would give TradWeight, but with 0.5 for D which
gives the halfway point between BM11 and BM15.

Andy MacFarlane did some evaluation work in the BrightStation days, but
I don't recall if he got any results before they shut down development.
Results with BM25 in other systems, are probably applicable if we can
find some.

I also notice Mikael Johansson has attached some evaluation results
to bug#8:

http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=8

I must have missed the bugzilla email when he did.  Sounds like he's
interested in doing evaluation work with different parameters and
weighting schemes, which would be very useful.  I'll contact him about
it.

Cheers,
    Olly