[Xapian-discuss] word pair indexing and querying
Daniel Ménard
Daniel.Menard at Bdsp.tm.fr
Fri Sep 22 09:21:09 BST 2006
Mark Hagger a écrit :
>Consider the example of a "wifi hotspot" record, I'd notionally like the
>5 keywords:
>
>wifi
>wi fi
>wi fi hotspot
>wi fi hot spot
>wifi hot spot
>
>But clearly it would be less than useful for a query for "wi cake
>sellers" to match this record, nor indeed a search for "red spot on
>chin" to match.
>
>
Perhaps that computing the distance (the similarity) between the user
request and the keywords of the hits would help?
The hits would be post-processed with something like this (pseudo code
using the api, not omega):
- restrict the number of hits returned by xapian, by using
enquire->set_cutoff() and perhaps forcing the first term of the request
to be present (add a '+' sign before).
- get the number of terms in the user request (R= query->get_length())
then, iterate over the mset (ordered by pertinence) and for each hit :
- get the number of matching terms for this hit (call it M). Xapian does
not seem to have something like get_matching_terms_count(), but
iterating with enquire->get_matching_term_begin/end and counting will do
the job.
- compute M/R : percentage of matching terms of the hit which were
present in the request. If this is 'too low', forget this hit.
else, go on doing the same, but with the document keywords:
- compute K, the number of terms in the keywords fields of the current
hit (you will have to tokenize and count by yourself, perhaps storing
this during indexing)
- compute R/K : percentage of terms of the request which are present in
the keywords of the hit. If this is 'too low', forget this hit.
This is similar to something I did before (but not with xapian) for
detecting potentials duplicates between records. For our needs, 'too
low' was around 65%.
This is poor code, and it won't work if you use wildcards in the query,
but it was working not so bad for us, and you can compute a new score
for the hit by combining the xapian score, M/R and R/K.
Perhaps that computing the 'levenshtein distance' between the request
and the keywords field would do the same on a more scientific base...
my two cents...
Daniel
More information about the Xapian-discuss
mailing list