sorting large msets

Olly Betts olly at survex.com
Tue Apr 3 21:59:07 BST 2018


On Sat, Mar 31, 2018 at 12:58:19AM +0000, Eric Wong wrote:
> Olly Betts <olly at survex.com> wrote:
> > If you're just wanting the 200 newest, it'll be faster not to calculate
> > weights, so:
> > 
> > $enquire->set_sort_by_value(0, 1);
> > $enquire->set_weighting_scheme(new Xapian::BoolWeight());
> > 
> > For me, this drops the time from ~0.075 seconds to ~0.067 seconds (with
> > xapian-core 1.4.5).
> 
> Thanks, I can see how that helps.
> 
> > But even 0.075 seconds doesn't really seem "slow" to me.  What times
> > are you seeing?  If it's much slower, I'd make sure you're at least
> > using the latest 1.4.x release.
> 
> Roughly what you saw with $n = 100 (the default in my sample
> script).  The problem is time increases with DB size.  Setting
> $n to 1000 makes it roughly 0.750s.

That's what I'd expect - currently sorting by value will tend to a
linear limit as the total number of matches increases.

> > If you do want faster, the simplest solution is to arrange that the
> > document id order matches the document age order, and then you can
> > specify to just sort by that:
> > 
> > $enquire->set_weighting_scheme(new Xapian::BoolWeight());
> > $enquire->set_docid_order(Search::Xapian::ENQ_DESCENDING);
> 
> That would be tricky with emails being delivered out-of-order;
> not to mention old archives being imported + indexed.

This was the trick we used with search.gmane.org.  We mostly ignored
the archive import issue there, which wasn't ideal but a periodic
reindex cleaned it up.

You can probably assume that out-of-order delivery won't be more than a
certain amount out of order (and that small infelicities in date order
aren't a big deal, especially as "Date:" headers may come from skewed
clocks and threads give you ordering information that can contradict
"Date:" header order).

> > That's more like 0.053 seconds for 1.4.5 and 0.021 seconds for git
> > master with glass.
> > 
> > The reverse order (ENQ_ASCENDING) is really fast - about 0.0001 seconds.
> > This is because in that case we can just stop once we've found 200
> > matches.
> 
> So that sounds like it's O(1) and independent of how many
> documents are in the mset?

It won't be independent of the number of documents in the MSet -
something needs to be done for each document added to the MSet, so
clearly it'll be at least O(n) in that, but if n is fixed at 200
then the O() complexity isn't relevant.  It'll be fairly independent of
the total number of documents that match (and so of $n in your script).

> Would it be possible to teach Xapian to optimize its storage for
> certain queries so it can stop once it's found 200 matches?
> From what I recall, SQL implementations are pretty good at that.

Probably - e.g. tracking some sort of buckets for values would help here
as well as for some other uses of values.

Cheers,
    Olly



More information about the Xapian-discuss mailing list