[Xapian-tickets] [Xapian] #670: double set_collapse_key with different values
Xapian
nobody at xapian.org
Mon May 4 06:16:28 BST 2015
#670: double set_collapse_key with different values
-------------------------+---------------------------
Reporter: Sadrak | Owner: olly
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Matcher | Version: 1.2.20
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-------------------------+---------------------------
Comment (by olly):
Sorry for failing to respond for ages.
I read this when it was reported, and thought there was a reason why we
don't already support this, but couldn't quite remember all the details.
So I left it a while to think about and it slipped my mind until I saw
this in the ticket list today.
I'm not sure I can remember all the details still, but this is probably
enough to be useful:
For a normal collapse, we just maintain a "prototype" MSet, containing the
best N documents we've seen so far. This prototype MSet contains a most
`count` documents with any particular collapse key. Any document which
isn't as good as the worst entry in this prototype MSet can simply be
ignored, and if we're sorting primarily by relevance, there are various
short-cuts we can perform based on that minimum weight (e.g. an `OR` can
be turned into an `AND` once the maximum weight either side can return is
less than the minimum required to make the prototype MSet).
Each time we have a new candidate document which is better ranked the the
lowest ranked document in the prototype MSet, we look at its collapse key.
If we have fewer than `count` documents with that collapse key, we can
just add it to the prototype MSet and drop the lowest ranked document
instead. If we have `count` documents already, if it ranks lower than all
of them we just discard it - otherwise we add the document and discard the
lowest ranked of those we already have with the collapse key.
One problem with collapsing on two different aspects is that we lose this
one-in/one-out property of the process. With the double collapse you
want, each candidate document can knock out two existing documents which
are in the prototype MSet, so we'd have to keep a much larger prototype
set to make sure we haven't discarded a document we might end up needing.
I think we would need to keep as many extra documents as there are
potential candidates left to consider. In general, that's up to
`Database.get_doccount()` less the number of documents already considered.
For some queries we can probably bound that more tightly, but it's still a
lot more than we usually keep.
And because we're keeping a lot more documents around, the minimum weight
required to make it into the prototype MSet will stay much lower, which
will limit the ability of the matcher to short-cut.
So there's more data to keep around, and less chance to short-cut the
process, so this is likely to incur quite a performance hit compared to a
single collapse.
But your situation is actually a special one - one of your collapse keys
is a prefix of the other, so the "collapse sets" of one will always be
subsets of collapse sets of the other. I've a feeling that means it's
possible to handle this case more efficiently than the general multi-
collapse case, though I've not looked at it in detail. I'm also not sure
what the API for it might look like.
--
Ticket URL: <http://trac.xapian.org/ticket/670#comment:2>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list