[Xapian-discuss] Lucene ranking

Thu Oct 28 16:25:57 BST 2004

On Thu, Oct 28, 2004 at 03:46:38PM +0100, James Aylett wrote:
> Kevin Burton has posted about poor ranking in Lucene preferring
> shorter documents over longer ones[1].

It's not totally clear he's right actually.  After all, which would you
prefer - a document which tells you what you want to know?  Or 3 copies
of the same document appended to each other?

The example is just a bit too artificial.

> Anyone know what Lucene is doing here? Their FAQ doesn't mention what
> weighting scheme they use, and I don't have time to investigate
> further right now ...

I'd guess there's a mechanism in Lucene's weighting scheme to counteract
the natural tendency of weighting schemes to prefer long documents (as
there is in BM25 which Xapian uses).  Without such a mechanism long
documents will tend to rank highly because they tend to have high
within-document-frequency.  Long documents match disproportionately many
queries anyway (because they typically contain more distinct words).  It
sounds like perhaps their mechanism is a little too aggressive.

Incidentally, it's odd that a document containing only one occurence of
the only search term doesn't match at 100% in Lucene.  How much better
could it be?  (Well, actually "foo" is arguably a totally useless result
since I already know it - but I doubt that idea is built into their
weighting scheme).

Cheers,
    Olly