[Xapian-discuss] Strange Weighting issue

Richard Boulton richard at tartarus.org
Tue Sep 8 15:05:21 BST 2009


2009/9/8 John Wards <jwards at whiteoctober.co.uk>

> Our client has asked us for the ability to boost documents on a
> document type basis. So we have a scale or 1-5 for the client to pick
> from and in the background we had a sort of logarithmic scale between
> 1 and 100.
>

(I assume that $indexer in your code is an instance of
Xapian::TermGenerator)

The issue here is that the second parameter to TermGenerator, although it is
called "weight", is actually an multiplier applied to the frequency of the
term within the document.  In other words, setting the weight in this
situation to 100 is equivalent to repeating the title 100 times (except that
the positional information generated will be different).  This increases the
frequency of the term within the document, but also increases the length of
the document.

The default Xapian weighting formula has a term which reduces the weight of
large documents; the theory being that a small document with relevant
information in it is better than a large one, because the small one is
likely to be more tightly focussed on the topic.  As you're seeing, this
compensation can mean that the same document repeated 100 times is
considered less good than a single repetition of the document.

While the "weight" parameter you're using can be useful for giving
particular fields a bonus weight compared to others, in general it is too
blunt an instrument to be used to reflect document-wide weights.

Instead, if you're using the 1.0 release series, I recommend using a single
extra term, added to all the documents, but with a weight chosen according
to the document's importance.  Using a single term shouldn't disrupt the
document length so much, but should allow you to weight your searches
appropriately (you'll need to include that term in all your queries, of
course: combine your existing queries with the weight term using the
AND_MAYBE operator).

If you're using the 1.1 development series, you can store an arbitrary
numeric weight for each document, and use a ValueWeightPostingSource to add
this weight to the standard weight of the document.  Using this technique,
you can specify exactly what weight bonus you want to be added to each
document.

If I re-index every document with a weight of 0 and leave the other
> document as a weight of 100 it appears as the top document.
>

This makes sense - the documents indexed with a weight of 0 will be given a
total weight of zero, so the one with 100 weight will come top.

-- 
Richard


More information about the Xapian-discuss mailing list