[Xapian-discuss] word pair indexing and querying

James Aylett james-xapian at tartarus.org
Thu Sep 21 14:29:29 BST 2006


On Thu, Sep 21, 2006 at 02:17:40PM +0100, Mark Hagger wrote:

> The problem with the DEFAULTOP=AND approach is that then a query for
> "garden centres bristol" or indeed "bristol garden centres that sell red
> plants" will not match the "garden centres" record, for obvious reasons.

True.

> In essence there are a number of cases where I'd like to add boolean
> keywords to the index for a record that are actually multi-word
> keywords, ie any of the individual words in isolation of the multi-word
> sequence are not enough to give a (good) match, but still allow an
> overall OR type query.

You could do this by adding terms that were generated from the
multi-word keywords, and upweight them -- wdf>1 in
Document::add_posting() -- then magically add them in to your
queries. You'd need to customise both indexing and query
generation. (You don't need boolean there - indeed, boolean will give
you precisely the wrong behaviour, I think.)

Have you considered playing with the parameters to the BM25 weighting
scheme? This currently isn't exposed through omega in a configurable
way, although it could be fairly easily.

> Consider the example of a "wifi hotspot" record, I'd notionally like the
> 5 keywords:
> 
> wifi
> wi fi
> wi fi hotspot
> wi fi hot spot
> wifi hot spot
> 
> But clearly it would be less than useful for a query for "wi cake
> sellers" to match this record, nor indeed a search for "red spot on
> chin" to match.

There are two ways of approaching this - the above is one. You can
also do it by not trying to exclude the less helpful matches, and
merely trying to ensure that more relevant matches get a higher rank
(ie: that Xapian actually gives them higher relevance in the match
set).

If you have a very specific query domain, you could identify
significant keywords and upweight them (you need a modified indexer
for this). There may also be something you can do by parametrising
BM25Weight, again (or perhaps a combination of the two).

It's difficult to suggest general approaches to this, of
course. Others may have some more useful comments at this stage :-)

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list