[Xapian-tickets] [Xapian] #670: double set_collapse_key with different values

Mon May 4 06:16:28 BST 2015

#670: double set_collapse_key with different values
-------------------------+---------------------------
 Reporter:  Sadrak       |             Owner:  olly
     Type:  enhancement  |            Status:  new
 Priority:  normal       |         Milestone:
Component:  Matcher      |           Version:  1.2.20
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+---------------------------

Comment (by olly):

 Sorry for failing to respond for ages.

 I read this when it was reported, and thought there was a reason why we
 don't already support this, but couldn't quite remember all the details.
 So I left it a while to think about and it slipped my mind until I saw
 this in the ticket list today.

 I'm not sure I can remember all the details still, but this is probably
 enough to be useful:

 For a normal collapse, we just maintain a "prototype" MSet, containing the
 best N documents we've seen so far.  This prototype MSet contains a most
 `count` documents with any particular collapse key.  Any document which
 isn't as good as the worst entry in this prototype MSet can simply be
 ignored, and if we're sorting primarily by relevance, there are various
 short-cuts we can perform based on that minimum weight (e.g. an `OR` can
 be turned into an `AND` once the maximum weight either side can return is
 less than the minimum required to make the prototype MSet).

 Each time we have a new candidate document which is better ranked the the
 lowest ranked document in the prototype MSet, we look at its collapse key.
 If we have fewer than `count` documents with that collapse key, we can
 just add it to the prototype MSet and drop the lowest ranked document
 instead.  If we have `count` documents already, if it ranks lower than all
 of them we just discard it - otherwise we add the document and discard the
 lowest ranked of those we already have with the collapse key.

 One problem with collapsing on two different aspects is that we lose this
 one-in/one-out property of the process.  With the double collapse you
 want, each candidate document can knock out two existing documents which
 are in the prototype MSet, so we'd have to keep a much larger prototype
 set to make sure we haven't discarded a document we might end up needing.
 I think we would need to keep as many extra documents as there are
 potential candidates left to consider.  In general, that's up to
 `Database.get_doccount()` less the number of documents already considered.
 For some queries we can probably bound that more tightly, but it's still a
 lot more than we usually keep.

 And because we're keeping a lot more documents around, the minimum weight
 required to make it into the prototype MSet will stay much lower, which
 will limit the ability of the matcher to short-cut.

 So there's more data to keep around, and less chance to short-cut the
 process, so this is likely to incur quite a performance hit compared to a
 single collapse.

 But your situation is actually a special one - one of your collapse keys
 is a prefix of the other, so the "collapse sets" of one will always be
 subsets of collapse sets of the other.  I've a feeling that means it's
 possible to handle this case more efficiently than the general multi-
 collapse case, though I've not looked at it in detail.  I'm also not sure
 what the API for it might look like.

--
Ticket URL: <http://trac.xapian.org/ticket/670#comment:2>
Xapian <http://xapian.org/>
Xapian