[Xapian-discuss] Lucene ranking

Thu Oct 28 18:01:39 BST 2004

On Thu, Oct 28, 2004 at 04:25:57PM +0100, Olly Betts wrote:

> > Kevin Burton has posted about poor ranking in Lucene preferring
> > shorter documents over longer ones[1].
> 
> It's not totally clear he's right actually.  After all, which would you
> prefer - a document which tells you what you want to know?  Or 3 copies
> of the same document appended to each other?
> 
> The example is just a bit too artificial.

Yes, but the fact that it's been bugging him for a while suggests he's
encountered it in more meaningful cases. (Perhaps.)

> > Anyone know what Lucene is doing here? Their FAQ doesn't mention what
> > weighting scheme they use, and I don't have time to investigate
> > further right now ...
> 
> I'd guess there's a mechanism in Lucene's weighting scheme to counteract
> the natural tendency of weighting schemes to prefer long documents (as
> there is in BM25 which Xapian uses).  Without such a mechanism long
> documents will tend to rank highly because they tend to have high
> within-document-frequency.  Long documents match disproportionately many
> queries anyway (because they typically contain more distinct words).  It
> sounds like perhaps their mechanism is a little too aggressive.

Having now looked at the BM25 documentation again, and almost
understood it (:-), I think I see what's going on here. (I just tried
fiddling with the constructor parameters of Xapian::BM25Weight to no
avail - this was through wrappers, which may be something to do with
the fact that much of the important bits of this class including the
constructor are inline. How often are we constructing these things
that inline constructors are needed?)

Incidentally, it would help if someone edited intro_ir.html so that it
was more easily readable alongside bm25.html; the latter would further
benefit from (a) talking about Xapian in preference to
Muscat3.6/Euroferret, and (b) making it obvious how we get from BM25
as we document it to the formula that Xapian::BM25Weight
implements. After staring at bm25.html for about ten minutes I've
finally figured out that it is actually telling me that BM25 /is/ what
we're using (with E=1), but with the BM11 term frobbed so that it
doesn't disappear on L=1. That could be a little clearer :-)

> Incidentally, it's odd that a document containing only one occurence of
> the only search term doesn't match at 100% in Lucene.  How much better
> could it be?  (Well, actually "foo" is arguably a totally useless result
> since I already know it - but I doubt that idea is built into their
> weighting scheme).

:)

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org