[Xapian-discuss] Lucene ranking

Olly Betts olly at survex.com
Thu Oct 28 20:08:06 BST 2004


On Thu, Oct 28, 2004 at 06:01:39PM +0100, James Aylett wrote:
> On Thu, Oct 28, 2004 at 04:25:57PM +0100, Olly Betts wrote:
> 
> > > Kevin Burton has posted about poor ranking in Lucene preferring
> > > shorter documents over longer ones[1].
> > 
> > It's not totally clear he's right actually.  After all, which would you
> > prefer - a document which tells you what you want to know?  Or 3 copies
> > of the same document appended to each other?
> > 
> > The example is just a bit too artificial.
> 
> Yes, but the fact that it's been bugging him for a while suggests he's
> encountered it in more meaningful cases. (Perhaps.)

It does sound like he's encountered it in real world use.  I was just
saying it's hard to reason reliably based on this example.  Anyway,
Xapian seems to get this right and fixing Lucene is somebody else's
problem!

> Having now looked at the BM25 documentation again, and almost
> understood it (:-), I think I see what's going on here. (I just tried
> fiddling with the constructor parameters of Xapian::BM25Weight to no
> avail - this was through wrappers, which may be something to do with
> the fact that much of the important bits of this class including the
> constructor are inline.

Why should that cause problems with the bindings?

> How often are we constructing these things that inline constructors
> are needed?)

Not especially often, but since one constructor simply initialises to
fixed values and the other clips parameters to valid ranges and
initialises members with them, they're good candidates for inlining.
The range checks will disappear if you initialise with constant values,
which is a common case.

The other 3 inlined methods are virtual, so there's probably little
point having them in the header, since the object will almost always
be used as a Weight rather than a BM25Weight once it is constructed.
So the compiler won't ever actually be able to inline them.

> Incidentally, it would help if someone edited intro_ir.html so that it
> was more easily readable alongside bm25.html; the latter would further
> benefit from (a) talking about Xapian in preference to
> Muscat3.6/Euroferret

I think Martin Porter wrote these a long time ago and they've never been
updated.  I'll take a look soon.

> (b) making it obvious how we get from BM25
> as we document it to the formula that Xapian::BM25Weight
> implements. After staring at bm25.html for about ten minutes I've
> finally figured out that it is actually telling me that BM25 /is/ what
> we're using (with E=1), but with the BM11 term frobbed so that it
> doesn't disappear on L=1. That could be a little clearer :-)

IIRC, the formula is adjusted by a constant factor to make sure
something is never negative.  But yes, that should be documented.

Cheers,
    Olly



More information about the Xapian-discuss mailing list