[Xapian-discuss] Revision 11671 cursory observations wrt sort performance
Henry
henka at cityweb.co.za
Sat Dec 6 15:42:58 GMT 2008
Hello all,
I've been trying out revision 11671 with Perl Search::Xapian and noted
the following interesting points:
a) The more items you add to MultiValueSorter() the slower it
performs. I presume this is to be expected, but this could be a
target for optimisation. I'm only using serialised numeric values to
sort on.
On a small (~55k docs, ~562MB) test corpus I get the following
(Keywords==number of search items):
Keywords Hits Time Sort columns
-----------------------------------------
1 11,600 0.17 0
1 11,600 0.58 1 (~241% slower)
1 11,600 0.82 2
1 11,600 1.17 3
2 15,527 0.23 0
2 15,527 0.77 1 (~234% slower)
2 15,527 1.05 2
2 15,527 1.54 3
As you can see, performance drops _precipitously_ as soon as you start
sorting on your own fields and not only the internal score.
btw, of no importance, but if you use MultiValueSorter() with no args,
the process consumes 99% of CPU and doesn't return.
b) *Is* Xapian sorting through all 11-15k results above? With
performance an issue when sorting, I wonder: I seem to vaguely recall
an index search approach which roughly did the following: since the
user will only ever possibly view (say) 1000 results, why bother
grinding through all 1 million results (or 10-15k in my tests above)
to sort, etc? ie, only gather and collate those results (say, 1000)
with the highest scores (or those which have a particular 'field'
above a certain threshold), discarding the rest, but still returning a
"hit" total of X for display/informational purposes only... or is
Xapian already doing this?
Cheers
Henry
More information about the Xapian-discuss
mailing list