[Xapian-discuss] Speed of range queries

Olly Betts olly at survex.com
Wed Oct 24 21:32:18 BST 2012


On Tue, Oct 23, 2012 at 10:00:07PM +0200, Dmitry Karasik wrote:
> > > 
> > > Xapian::Query((VALUE_RANGE 70 stringA stringB AND KEYvalue:(pos=1)))
> > > 
> > > takes 0.6 seconds, whereas this one
> > > 
> > > Xapian::Query((KEYvalue:(pos=1) AND VALUE_RANGE 70 stringA stringB))
> > > 
> > > takes 20 seconds. I'm in the process of isolating the problem, so that I can
> > 
> > The order shouldn't make a difference - internally the subqueries of a
> > tree of AND-like operators are gathered up and rearranged into a shape
> > which is likely to work most efficiently.
> > 
> > So if that's working properly, the only thing I can think is that the
> > two subqueries have the same estimated term frequency.  Does term
> > KEYvalue match a lot of documents?
> 
> Indeed yes, KEYvalue matches a lot, around 1/10th of the whole base, which is
> about 10^5-10^6 documents. The value #70 on the contrary is more or less
> unique, and the range captures about 50 documents. Does that qualify for the
> same estimated term frequency?

For a (non-degenerate) value range, we can't really estimate how many
documents the range matches, so we currently guess at N/2 (where N =
db.get_doccount()).

We can check if a given document is in the value range in O(log N) (or
often better amortised), but the cost of iterating through a value range
to find the matches is O(N) rather than O(how many are in the range).
So unless the term is very common, we want to scan the postlist for the
term and check the value range for each occurrence (rather than the
other way around).

So the crude N/2 estimate actually works well for the purposes of
optimising the query tree.

If the term frequency is exactly N/2, I can see that the order of the
original query might make a difference, as it's arbitrary which subquery
gets picked.  But that doesn't sound like what you have.

Can you try applying the attached patch and running the two cases?  This
will show the PostList tree which the query optimiser produces.

If rebuilding is a pain, you could try profiling the two cases and see
where the time is spent - that might give enough of a clue to work out
what is happening.  There are some tips on profiling on the wiki:

http://trac.xapian.org/wiki/ProfilingXapian

Cheers,
    Olly
-------------- next part --------------
A non-text attachment was scrubbed...
Name: postlist-debug.patch
Type: text/x-diff
Size: 627 bytes
Desc: not available
URL: <http://lists.xapian.org/pipermail/xapian-discuss/attachments/20121024/5e0709a7/attachment.patch>


More information about the Xapian-discuss mailing list