[Xapian-discuss] ValueCountMatchSpy with collapse_key

Olly Betts olly at survex.com
Tue Oct 9 03:01:10 BST 2012

On Sun, Oct 07, 2012 at 02:24:52PM -0400, Matthew Story wrote:
> Don't mind submitting a patch if the behavior is indeed undesirable.

I'm not sure it's desirable, but I think it's going to be hard to avoid.

The last candidate MSet entry considered could knock out any earlier
entry with a non-empty collapse key, so we'd have to buffer up the best
entry seen for each collapse key in order to be able to pass them all to
the MatchSpy.  If every considered document has a unique collapse key
(i.e. no collapsing happens) that means buffering *every* document

We could buffer a limited number, but that would have to be >=
checkatleast (if that's set) since we document setting checkatleast
to the number of documents in the database causes you to see them


Changing this documented behaviour is an option, but it's a useful
feature, so it seems unhelpful to break it.

So I wonder if it's best to clearly document that the MatchSpy operates
before collapsing (i.e. current reality).

> Question is, what is the right approach to resolving this.  Should the
> ValueCountMatchSpy be provided with the ability to ignore or respect
> collapse, and then internally to itself track the collapse state based
> on a collapse key provided to operator?

In general, tracking this inside ValueCountMatchSpy is going to have the
same problems with potentially needing to buffer vast quantities of

If you're working in a situation where the collapsed documents are
simply duplicates, things are much easier - you can just take the
first with a particular collapse key and so you only need to track
which collapse keys you have seen, so you can ignore subsequent


More information about the Xapian-discuss mailing list