[Xapian-discuss] Very far out and static get_matches_estimated

Olly Betts olly at survex.com
Thu Jun 11 06:00:33 BST 2009


On Thu, Jun 11, 2009 at 12:29:06AM +0100, Matthew Somerville wrote:
> So the estimate is wildly out for all pages until we get to the actual 
> number of results. Changing the sort to relevance instead of reverse 
> date gives a different far out number, but the effect is the same. 
> Without the date range limiting, the initial estimate is 43,612, and 
> this slowly changes as I up the page count until it gets to the correct 
> result of 43,537 (good initial estimate!), as I'd expect.

Well, the estimate is an estimate, and may be far from the true value.
While it might not be helpful, if it's >= lower_bound and <=
upper_bound, then it's "working".  You can look at the bounds to see
how wrong it might be:

http://trac.xapian.org/wiki/FAQ/MoreAccurateEstimates

If you don't want it to be way out when there are 362 matches, setting
checkatleast to 363 or more will address that.  By default, Xapian
assumes you are more interested in getting the result fast than having
a very accurate estimate of how many there are.

The particular problem here is that we don't have a good way to estimate
what proportion of a value range will match, so currently we just guess
arbitrarily that it will match half the documents it sees.

In 1.0 it's not easy to do better.  We could monitor what proportion of
documents the value range is checked for match, but unfortunately
actually using that information would need some big changes to how
things work.  Perhaps assuming that it matches 1/10 of the documents
would be better - in most cases, underestimating is better than
overestimating I suspect.

In 1.1, chert keeps bounds on the value and knows how many documents it
is set for, which could be used with the value range bounds to make an
estimate assuming an even spread of values, but we don't currently do
so.  The key thing needed is a function to efficiently calculate how far
through a string range a given string is (e.g. "b" is 0.5 of "a".."c").
Essentially, base 256 fixed point arithmetic...

Cheers,
    Olly



More information about the Xapian-discuss mailing list