[Xapian-discuss] Revision 11671 cursory observations wrt sort performance

Sat Dec 6 15:42:58 GMT 2008

Hello all,

I've been trying out revision 11671 with Perl Search::Xapian and noted  
the following interesting points:

a)  The more items you add to MultiValueSorter() the slower it  
performs.  I presume this is to be expected, but this could be a  
target for optimisation.  I'm only using serialised numeric values to  
sort on.

On a small (~55k docs, ~562MB) test corpus I get the following  
(Keywords==number of search items):

Keywords   Hits     Time     Sort columns
-----------------------------------------
1          11,600   0.17     0
1          11,600   0.58     1  (~241% slower)
1          11,600   0.82     2
1          11,600   1.17     3

2          15,527   0.23     0
2          15,527   0.77     1  (~234% slower)
2          15,527   1.05     2
2          15,527   1.54     3

As you can see, performance drops _precipitously_ as soon as you start  
sorting on your own fields and not only the internal score.

btw, of no importance, but if you use MultiValueSorter() with no args,  
the process consumes 99% of CPU and doesn't return.

b)  *Is* Xapian sorting through all 11-15k results above?  With  
performance an issue when sorting, I wonder:  I seem to vaguely recall  
an index search approach which roughly did the following:  since the  
user will only ever possibly view (say) 1000 results, why bother  
grinding through all 1 million results (or 10-15k in my tests above)  
to sort, etc?  ie, only gather and collate those results (say, 1000)  
with the highest scores (or those which have a particular 'field'  
above a certain threshold), discarding the rest, but still returning a  
"hit" total of X for display/informational purposes only... or is  
Xapian already doing this?

Cheers
Henry