[Xapian-devel] Xapian Indexer problem.

Olly Betts olly at survex.com
Fri Nov 14 16:12:01 GMT 2008


On Fri, Nov 14, 2008 at 11:06:55AM +0800, liminghit wrote:
> Only ??0 001a heathrow taxis?? can have 100% matching.
> 
> Shorter or longer query, should less than 100% matching, right?

A longer query would since (unless you repeat terms) it must have
words which aren't in the document.

But otherwise no, and this behaviour is as intended.  It's not
"percentage of document text matched" it's a measure of "how well your
query matches this document".

If all the query terms match the highest scoring document, we give it
100%.  If not all the terms match the highest scoring document, we give
it a proportion of 100% based on the term weights

And then we calculate percentage scores for all other documents based on
this assigned percentage value.

Your definition seems unhelpful to me - in most uses the query is quite
a lot shorter than the document, and a 3 word query would score at most
0.3% for a 1000 word document.

> If I want to archive this, how to do indexing?

You might be able to achieve something like what you describe at search
time by writing your own weighting scheme and making get_sumpart()
return 1/(unnormalised document length)

Cheers,
    Olly



More information about the Xapian-devel mailing list