[Xapian-discuss] Random ordering from Python
Shane Evans
shane at 3continents.net
Thu Jan 22 06:24:05 GMT 2009
Olly Betts wrote:
> On Wed, Jan 21, 2009 at 11:34:54PM +0100, amix wrote:
>
>> I have tried to implement my own random weight, but that did not work
>> out. I would also like this random sorting to perform good and work on
>> big result sets.
>>
>
> Implementing a random weighting scheme in Python should be possible,
> though the overhead of the callbacks might be an issue if you're working
> with a lot of data (I've never profiled, but it's a potential issue as
> there's at least one per query term per matching document).
>
> If you're happy using SVN trunk, then BoolWeight plus a PostingSource
> which returns a random weight boost between 0 and some fixed value
> should do the job. That's one callback per matching document, which
> is better for long queries.
>
I have a similar requirement and I discussed implementing a
RandomPostingSource class with Richard last week. I'd also be calling it
from python, but was thinking of implementing the posting source in C++.
When we get around to implementing it, I'll happily make it available.
In this particular case I don't expect a lot of documents to be ranked
(always less than 100,000 and usually a lot less), so my performance
requirements are different.
Perhaps there is a way you can reduce the size of the result set to
randomly choose a selection from? You could, for example, add a term to
each document representing a fixed bucket that the document belongs to.
When the estimated result size looks too large, restrict it to randomly
selected buckets.
>> I could do random selects easily if counts were exact counts and not
>> estimates - so returning exact counts would also solve my problem. I
>> need performance thought, so setting check_at_least to 1 million is
>> not a solution (unless it performs really good).
>>
>
> It's probably worth investigating. High check_at_least prevents various
> terminate early optimisations, but then it seems to me that so will
> anything which is picking random matches. This would also avoid calling
> back to Python code.
>
I'll be interested to hear how you get on.
Cheers,
Shane
More information about the Xapian-discuss
mailing list