[Xapian-discuss] Matching exact phrases only

James Aylett james-xapian at tartarus.org
Wed Aug 9 11:06:19 BST 2006


On Tue, Aug 08, 2006 at 06:49:32PM +0000, Chris Good wrote:

> Lets say we have records containing the following:
> 
> big red bus
> red letter day
> sent red bill
> blue
> red sky at night shepherds delight
> blue taxi
> empty blue taxi
> sky consulting
> blue sky consulting
> sky blue
> 
> A search for "Red", this would match 'big red bus' 'red letter day' and
> 'send red bill', all of which would yield a 100% match.  
> 
> A search for "blue" meanwhile would have a range of scores, from 100% down.
> Likewise "sky consulting" would yield 'sky consulting' downwards. 

Okay, I think I understand - the three red results are given the same
percentage score because the documents are the same length.

If so, you want to either reconfigure the BM25 weighting scheme, or
write your own (probably the former). That will require some hacking
in Omega, but not much - the main thing is figuring out what
parameters you want. I'd do that by plumbing it into python so you can
quickly change variables and test.

I think you want to downplay the termweight and term wdf values in
BM25, and play up term wqf (possibly), normalised document length and
query length. Changing the k2_ parameters to BM25Weight's constructor
to non-zero may be enough for you - that introduces a n_q / (1 + L_d)
multiplier to weight calculation, where n_q is the size of the query
and L_d is the normalised document length.

The b parameter may help you as well, which fiddles with the relative
importance of term wdf and document length. You may get less
unexpected results by playing with that, but you need to do the tuning
with a reasonable set of queries to figure out what the right approach
is for you.

(I don't actually have any practical experience of this; if Richard is
around he may be able to provide a more helpful and definitive
answer.)

<http://xapian.org/docs/apidoc/html/classXapian_1_1BM25Weight.html>

James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list