[Xapian-discuss] double counting

Olly Betts olly at survex.com
Mon Nov 3 01:20:42 GMT 2014


On Sun, Nov 02, 2014 at 02:45:40PM -0800, Mir Siadaty wrote:
> A question regarding count of matching documents returned by
> Xapain;
> When I use OR to create a query of a term and some of its
> related synonyms, it appears to me xapian double-counts some docs.
> The extreme example is the count of query ‘term1’ versus ‘term1
> OR term1’. I assumed the two queries should return the same counts, but the
> second query returns twice.
> Have I missed anything?

I assume you're talking about the numbers returned by
MSet::get_matches_estimated()?  If so, you should note the word
"estimated" in the method name.

In this case, the estimated number of occurrences of a OR b is evaluated
based on the assumption that a and b occur independently of one another
(and then it may get clamped based on information we get from actually
running the match).

The assumption of independence is clearly particularly untrue when a and
b are both the same term, but we don't currently try to detect this
special case.  It could be done, though I think it would be more useful
to handle a broader class of situations via a form of
common-subexpression elimination (CSE).

Incidentally, it's not a double count in general - e.g. if you have 100
documents and term1 occurs in 50, the estimate for term1 OR term1 would
be 75.  But when term frequency << collection size, it will be just
under double, and rounding may often make it exactly double.

You may also find the discussion in the FAQ useful:

http://trac.xapian.org/wiki/FAQ/MoreAccurateEstimates

Cheers,
    Olly



More information about the Xapian-discuss mailing list