[Xapian-devel] Participation in GSOC

Olly Betts olly at survex.com
Mon Apr 4 03:22:45 BST 2011


Dan's given a good general answer, but to pick up on a few details of
your suggestions:

On Wed, Mar 30, 2011 at 12:07:26AM +0200, Michael Thomas wrote:
> * word-distance weighting: so documents wich contain the query terms
> with close distance to each other get higher scores

The tricky part of this is doing it efficiently - if you have to read
all the positional data for every term in the query for every potential
match, this isn't likely to scale to really large databases.  So you
want to be able to cull as many candidates as you can based on other
factors before considering this.  There are similar issues for phrase
searches.

> * location based weighting: terms, that appear in the top of the
> document are generally more important

This is already possible by giving terms at the start a wdf boost -
like in the second example here:

http://trac.xapian.org/wiki/FAQ/ExtraWeight

It's pretty common to apply this technique to the title, and (with a
smaller boost) to any summary or abstract.

> * size based weighting: longer documents tend to be more important, than
> shorter ones, as they contain more words

Document length is already factored in - the b parameter of B25Weight
tunes this.

You don't need to explicitly give extra weight to longer documents,
as they get it already by virtue of being longer - a 12 page document
will naturally have a higher wdf for relevant terms than a 1 page
document.  So in fact you want to counter this effect if anything.
In BM25Weight, b=0 means "no adjustment", while b=1 scales wdf down
proportional to document length.  The default is 0.5.

Cheers,
    Olly



More information about the Xapian-devel mailing list