[Xapian-discuss] word pair indexing and querying

richard at lemurconsulting.com richard at lemurconsulting.com
Thu Sep 21 18:11:48 BST 2006


This is interesting, and something which has come up a couple of times now
(though I'm still not entirely sure why people want it - I'll come to that
later).

What Xapian does (for the simple case of a search which is a set of terms
ORred together) is to search through the database for all documents
containing the terms in the query, and calculate a score for each document
based on the terms in the query which are contained in the document.

The document length is taken into account to some extent, but what you
appear to want is for a document containing terms which are _not_ in the
query to be heavily penalised.

This isn't the usual intention with a Xapian search - the idea is that even
if only some parts of the document are relevant, those parts are worth
returning.

I think we _could_ produce the kind of result you want using a custom
weighting object (or, possibly just using the right parameters to the
standard BM25Weight object).  However, the document weights are normalised
after the match process by checking how many of the query terms occur in
the top ranked document: thus a query which contains only one term will
always give a score of 100% to its top ranked document.

In the test case you're working on, if there was a document in the database
which contained _only_ the term "centre", this docuument would be returned
by the search at 100%, and the "job centre" document would be returned with
(slightly) lower score.

I have to ask though: what is wrong with a search for "centre" returning a
document about "job centre"?  Would you object to a search for "job"
returning the document?  Do you think your users would really be confused
by a search for "centre" returning  a document about "job centre"?  If you
can explain why this is a problem, we'll be more able to do something about
it.

-- 
Richard



More information about the Xapian-discuss mailing list