[Xapian-discuss] get_matches_estimated and value range

Olly Betts olly at survex.com
Wed Oct 6 08:54:39 BST 2010


On Tue, Oct 05, 2010 at 06:06:03PM +0200, Luca Barbieri wrote:
> 2010/10/5 Olly Betts <olly at survex.com>
> 
> >  If you ask for 0 matches, no documents will be considered - the
> > estimates given in this case are based only on the statistics available
> > about the terms and values involved.
> 
> Can you explain please what kind of stats on values are involved? I'm not
> sure I understand..

If you use chert (the default database backend in 1.2.x) then for each
value slot it tracks the number of documents which has a (non-empty) value
set, and upper and lower bounds the (non-empty) values.

Flint (the default in 1.0.x) doesn't track anything.

> If I set "VALUE_RANGE 0 1286264160..1286264249", is this checked for the
> estimate as the terms are?

No, since we have no idea how many times values in that range are set
(aside from if the range falls entirely outside the known bounds, and
that we also know it can't match more than the number of non-empty
values in slot 0 - these are the two optimisations I was talking about
which aren't done in current releases, but I'm in the process of adding).

We'd have to track occurrence statistics for particular subranges of values to
be able to do what you're after.  That might be worth doing, but we don't
currently do it.

For a term, we know exactly how many times it occurs, and for expressions
we can calculate pretty good bounds and estimates based on those values
for the terms involved.

> If I ask fore more than 0 matches (or if I use the checkatleast) the query
> slows down sensibly, and seems that xapian goes in a linear search on the
> documents matched by the terms.

Yes, it has to check the value for every document considered.  Values work
best for filtering, sorting, etc of results which are already restricted
by a subexpression involving terms.

> > The estimate (and min/max) should be 0 when the value range falls
> > completely outside the [lower bound, upper bound] range.  Currently that
> > isn't checked for, but I'm just testing a fix, and will commit it
> > shortly assuming the rest of the testsuite passes.
> >
> > The bounds also don't make use of the count of set values which the
> > chert backend stores - if there's no value set in a slot for some
> > documents, that can be used to reduce the maximum number of documents
> > which can match a value range on that slot.  I'll take a look at making
> > use of this once the above change is committed.
> 
> ok, but my query has some control for the asked vaue ranges, and I'm sure
> that I'am searching for values included in the [lower bound-upper bound] of
> the database

Yes, this probably won't help your case, but it's a missing easy optimisation
which will help some cases.

Cheers,
    Olly



More information about the Xapian-discuss mailing list