[Xapian-discuss] word pair indexing and querying
olly at survex.com
Fri Sep 22 08:50:55 BST 2006
On Fri, Sep 22, 2006 at 07:06:06AM +0000, Chris Good wrote:
> Olly Betts wrote:
> > Is it OK (indeed desirable) to return "mediocre" matches provided they
> > have low scores?
> That would be fine, low scoring results are filtered out anyway.
How about literally generating (probably in a semi-automated way) a list
all the queries which should match a particular document in the
database, and add terms for each of them (with normalisation - I'd
suggest at least dropping case and punctuation, and probably stemming;
perhaps also sort the list of words, though there are examples where
word order matters, such as "bath oil" vs "oil bath", and you can always
include all permutations of the words where it's useful). So a document
would have terms like "job centr" and "footbal pitch".
Then process the query by generating all the consecutive groups of words
within it (including the whole query, and also single words if you chose
any in the previous step).
Xapian should take care of ranking multiple matches in a reasonable way
and you should get scores which discriminate between good and mediocre
matches in the kind of ways you want.
The really nice feature is you can say exactly why a particular document
does or doesn't match, and add/remove an entry in the underlying query
list to correct for undesirable false postives and false negatives.
More information about the Xapian-discuss