[Xapian-discuss] word pair indexing and querying

Fri Sep 22 09:21:09 BST 2006

Mark Hagger a écrit :

>Consider the example of a "wifi hotspot" record, I'd notionally like the
>5 keywords:
>
>wifi
>wi fi
>wi fi hotspot
>wi fi hot spot
>wifi hot spot
>
>But clearly it would be less than useful for a query for "wi cake
>sellers" to match this record, nor indeed a search for "red spot on
>chin" to match.
>  
>
Perhaps that computing the distance (the similarity) between the user 
request and the keywords of the hits would help?

The hits would be post-processed with something like this (pseudo code 
using the api, not omega):

- restrict the number of hits returned by xapian, by using 
enquire->set_cutoff() and perhaps forcing the first term of the request 
to be present (add a  '+' sign before).

- get the number of terms in the user request (R= query->get_length())

then, iterate over the mset (ordered by pertinence) and for each hit :
- get the number of matching terms for this hit (call it M). Xapian does 
not seem to have something like get_matching_terms_count(), but 
iterating with enquire->get_matching_term_begin/end and counting will do 
the job.
- compute M/R : percentage of matching terms of the hit which were 
present in the request. If this is 'too low', forget this hit.
else, go on doing the same, but with the document keywords:
- compute K, the number of terms in the keywords fields of the current 
hit (you will have to tokenize and count by yourself, perhaps storing 
this during indexing)
- compute R/K : percentage of terms of the request which are present in 
the keywords of the hit. If this is 'too low', forget this hit.

This is similar to something I did before (but not with xapian) for 
detecting potentials duplicates between records. For our needs, 'too 
low' was around 65%.

This is poor code, and it won't work if you use wildcards in the query, 
but it was working not so bad for us, and you can compute a new score 
for the hit by combining the xapian score, M/R and R/K.

Perhaps that computing the 'levenshtein distance' between the request 
and the keywords field would do the same on a more scientific base...

my two cents...

Daniel