[Xapian-discuss] ValueCountMatchSpy with collapse_key
Olly Betts
olly at survex.com
Tue Oct 9 03:01:10 BST 2012
On Sun, Oct 07, 2012 at 02:24:52PM -0400, Matthew Story wrote:
> Don't mind submitting a patch if the behavior is indeed undesirable.
I'm not sure it's desirable, but I think it's going to be hard to avoid.
The last candidate MSet entry considered could knock out any earlier
entry with a non-empty collapse key, so we'd have to buffer up the best
entry seen for each collapse key in order to be able to pass them all to
the MatchSpy. If every considered document has a unique collapse key
(i.e. no collapsing happens) that means buffering *every* document
considered.
We could buffer a limited number, but that would have to be >=
checkatleast (if that's set) since we document setting checkatleast
to the number of documents in the database causes you to see them
all:
http://xapian.org/docs/apidoc/html/classXapian_1_1Enquire.html#43b54489c53d26a98d3fde3f1d3aa14f
Changing this documented behaviour is an option, but it's a useful
feature, so it seems unhelpful to break it.
So I wonder if it's best to clearly document that the MatchSpy operates
before collapsing (i.e. current reality).
> Question is, what is the right approach to resolving this. Should the
> ValueCountMatchSpy be provided with the ability to ignore or respect
> collapse, and then internally to itself track the collapse state based
> on a collapse key provided to operator?
In general, tracking this inside ValueCountMatchSpy is going to have the
same problems with potentially needing to buffer vast quantities of
results.
If you're working in a situation where the collapsed documents are
simply duplicates, things are much easier - you can just take the
first with a particular collapse key and so you only need to track
which collapse keys you have seen, so you can ignore subsequent
occurrences.
Cheers,
Olly
More information about the Xapian-discuss
mailing list