[Xapian-discuss] Random ordering from Python

Shane Evans shane at 3continents.net
Thu Jan 22 06:24:05 GMT 2009


Olly Betts wrote:
> On Wed, Jan 21, 2009 at 11:34:54PM +0100, amix wrote:
>   
>> I have tried to implement my own random weight, but that did not work
>> out. I would also like this random sorting to perform good and work on
>> big result sets.
>>     
>
> Implementing a random weighting scheme in Python should be possible,
> though the overhead of the callbacks might be an issue if you're working
> with a lot of data (I've never profiled, but it's a potential issue as
> there's at least one per query term per matching document).
>
> If you're happy using SVN trunk, then BoolWeight plus a PostingSource
> which returns a random weight boost between 0 and some fixed value
> should do the job.  That's one callback per matching document, which
> is better for long queries.
>   
I have a similar requirement and I discussed implementing a 
RandomPostingSource class with Richard last week. I'd also be calling it 
from python, but was thinking of implementing the posting source in C++.

When we get around to implementing it, I'll happily make it available. 
In this particular case I don't expect a lot of documents to be ranked 
(always less than 100,000 and usually a lot less), so my performance 
requirements are different. 

Perhaps there is a way you can reduce the size of the result set to 
randomly choose a selection from? You could, for example, add a term to 
each document representing a fixed bucket that the document belongs to. 
When the estimated result size looks too large, restrict it to randomly 
selected buckets.


>> I could do random selects easily if counts were exact counts and not
>> estimates - so returning exact counts would also solve my problem. I
>> need performance thought, so setting check_at_least to 1 million is
>> not a solution (unless it performs really good).
>>     
>
> It's probably worth investigating.  High check_at_least prevents various
> terminate early optimisations, but then it seems to me that so will
> anything which is picking random matches.  This would also avoid calling
> back to Python code.
>   
I'll be interested to hear how you get on.

Cheers,

Shane



More information about the Xapian-discuss mailing list