[Xapian-discuss] word pair indexing and querying

Olly Betts olly at survex.com
Thu Sep 21 18:00:48 BST 2006


On Thu, Sep 21, 2006 at 05:42:38PM +0100, Mark Hagger wrote:
> Except thats not really going to work very well, here's an example query
> on one of our development databases:
> 
> http://staging.gjm.info/cgi-bin/omega?P=centre&DB=Business52-GB&FMT=xml
> 
> This gives a 100% relevance hit against "job centre", so not much scope
> for a cut-off there, and for the record in this application I'd need
> this query to produce either nothing or at worst a low relevance hit
> against "job centre".

TBH, for a query as vague as "centre", "job centre" seems a good match
to me.  As a user, I'm not sure what I'd be expecting to get for a query
for "centre"...

Currently if we have a match which includes all the terms in the query
we peg that as 100% and scale other matches proportionally.  If the
highest scoring match doesn't include all the terms in the query, we
make its percentage score depend on the weights of the terms which
do and don't match.

For some weighting schemes, working out percentages of the
"max_possible" weight (Enquire::get_max_possible()) might be a better
approach, but for BM25, max_possible is generally substantially higher
than any weight you get in real situations which is why we use the
scheme above.

> (I would point out that this database has very little in it, just under
> 100 records.)
> 
> It is starting to look suspiciously as if xapian just isn't going to be
> the way to go here, in truth even the biggest dataset that I'd be
> playing with here won't be more than about 100k records.

You seem to have very short documents and a very particular idea of what
constitutes a good match.

With so few records, you can afford to run multiple Xapian queries
or perform post-processing of results which might help.  If you're happy
with the ordering of the hits, you could for example look at the
weights returned and how many terms match, and max_possible and compute
your own relevance percentage and when to stop showing matches.

Cheers,
    Olly



More information about the Xapian-discuss mailing list