[Xapian-discuss] Revision 11671 cursory observations wrt sort performance

Sun Dec 7 22:35:02 GMT 2008

On Sat, Dec 06, 2008 at 05:42:58PM +0200, Henry wrote:
> I've been trying out revision 11671 with Perl Search::Xapian and noted  
> the following interesting points:

You don't seem to say whether you're using flint or chert, which is a
pretty important distinction.

Benchmarking flint on this stuff probably won't reveal anything very
interesting as we know it's slow when using a lot of values.

The issues which make flint slow here have been addressed in chert.  The
value streams still need to be plumbed more directly to the matcher to
take full advantage of this though.  So relative comparisons are
probably informative, but absolute ones aren't yet.

> a)  The more items you add to MultiValueSorter() the slower it  
> performs.  I presume this is to be expected, but this could be a  
> target for optimisation.  I'm only using serialised numeric values to  
> sort on.

I think this is to be expected - if you sort on two values, that's twice
as much data to fetch (less true for flint, which has to fetch all the
value data for a document together anyway, but it doesn't have to decode
it all still).

> On a small (~55k docs, ~562MB) test corpus I get the following  
> (Keywords==number of search items):
> 
> Keywords   Hits     Time     Sort columns
> -----------------------------------------
> 1          11,600   0.17     0
> 1          11,600   0.58     1  (~241% slower)
> 1          11,600   0.82     2
> 1          11,600   1.17     3
> 
> 2          15,527   0.23     0
> 2          15,527   0.77     1  (~234% slower)
> 2          15,527   1.05     2
> 2          15,527   1.54     3

What does "sort columns" == 0 mean?  Sort by relevance?

> As you can see, performance drops _precipitously_ as soon as you start  
> sorting on your own fields and not only the internal score.
> 
> btw, of no importance, but if you use MultiValueSorter() with no args,  
> the process consumes 99% of CPU and doesn't return.

That'll be a bug then!  Thanks for spotting it.

> b)  *Is* Xapian sorting through all 11-15k results above?  With  
> performance an issue when sorting, I wonder:  I seem to vaguely recall  
> an index search approach which roughly did the following:  since the  
> user will only ever possibly view (say) 1000 results, why bother  
> grinding through all 1 million results (or 10-15k in my tests above)  
> to sort, etc?  ie, only gather and collate those results (say, 1000)  
> with the highest scores (or those which have a particular 'field'  
> above a certain threshold), discarding the rest, but still returning a  
> "hit" total of X for display/informational purposes only... or is  
> Xapian already doing this?

I'm not totally sure what you're describing here.

We currently use nth_element() (O(n)) and then only sort the top
documents.  So the sorting part is O(m.log(m)) where m is the size of
the requested M-set.

The code is in matcher/multimatch.cc if you want to look for yourself.

Cheers,
    Olly