[Xapian-tickets] [Xapian] #394: Speed up phrase queries with a "settling pond"

Sat May 28 22:17:19 BST 2022

#394: Speed up phrase queries with a "settling pond"
-------------------------+-------------------------------
 Reporter:  Olly Betts   |             Owner:  Olly Betts
     Type:  enhancement  |            Status:  assigned
 Priority:  normal       |         Milestone:  1.5.0
Component:  Matcher      |           Version:  git master
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+-------------------------------
Comment (by Olly Betts):

 I found a somewhat newer version of the patch from 2013 on my dev box so
 attaching in case it's useful.  I think this is probably the version of
 the patch Gustavo tested.

 It would be good to put this ticket to bed, either by concluding that the
 idea is no longer useful in the light of the numerous improvements to
 phrase search performance we've made in the meantime, or by merging an
 updated implementation of the idea.

 Looking over the 2013 patch, some things come to mind.

 * It's a shame that it only helps when the positional check is "and-like
 with the root" - e.g. `"some phrase" OR "other words"` isn't helped at
 all.  I can see that a subquery could read ahead a bit and have its own
 pond, but it gets more complicated with things like
 `count_matching_subqs()` needing to work for the "current" entry and I'm
 not sure how feasible it would actually be.  Probably best to retry the
 idea at the top level and if it's helps in that case think harder about
 whether it's doable for a subquery.

 * It could probably still use `read_position_list()` instead of creating
 new `PositionList` objects every time, with a little rejigging so that the
 docid to open for is specifiable.  That would reduce memory allocation
 overhead and mean that we might not have to deal in the term names.

 * The data structure for the pond could be improved.  Comments in the
 patch suggest a min/max heap.  Essentially we need to be able to
 efficiently add items, find and remove the entry with the highest weight,
 and remove all entries below a specified weight; the data structure has a
 fixed max number of entries which is known in advance
-- 
Ticket URL: <https://trac.xapian.org/ticket/394#comment:18>
Xapian <https://xapian.org/>
Xapian