[Xapian-tickets] [Xapian] #682: Collapsing when sorting by value unnecessarily slow

Xapian nobody at xapian.org
Wed Jun 10 02:59:23 BST 2015


#682: Collapsing when sorting by value unnecessarily slow
----------------------------+--------------------
        Reporter:  olly     |      Owner:  olly
            Type:  defect   |     Status:  new
        Priority:  normal   |  Milestone:  1.3.x
       Component:  Matcher  |    Version:  1.2.20
        Severity:  normal   |   Keywords:
      Blocked By:           |   Blocking:
Operating System:  All      |
----------------------------+--------------------
 Gareth Rees reported on IRC that collapsing when sorting by value could be
 very slow, and this got worse as `offset` was increased (approximately
 linear in `offset` from a graph he showed).

 = Background =

 Once the matcher has found enough matching results (where enough is
 `offset + msize`), it keeps the best `offset + msize` in a heap, with the
 worst one at the tip of the heap (sometimes called a "min heap").

 Each new matching document is compared to the tip of that heap - if it's
 worse, it's discarded; if it's better, the current tip is popped and the
 new document added to the heap.

 When collapsing, to replace a document with one with the same collapse key
 but which ranks more highly, we currently do a linear scan of the vector
 the heap is over to find the old document, and replace it in the vector
 with the better one (which means the vector is no longer a valid heap),
 and flag the heap as needing a rebuild before we next use it as a heap.

 So if collapsing replaces a lot documents, the heap gets rebuilt a lot.

 If `offset` or `msize` is set high, the heap is large (since it has `msize
 + offset` entries).

 Sorting by value makes this worse - if you sort by relevance the match can
 often terminate early, which it can't for sorting by value.  Also sorting
 by value probably makes the comparisons while rebuilding the heap more
 expensive.

 = Solution =

 There's a FIXME comment in matcher/multimatch.cc which notes this (in less
 detail):

 {{{
 #!cpp
                             // We can replace an arbitrary element in
 O(log N)
                             // but have to do it by hand (in this case the
 new
                             // elt is bigger, so we just swap down the
 tree).
                             // FIXME: implement this, and clean up is_heap
                             // handling
                             *i = new_item;
                             pushback = false;
                             is_heap = false;
 }}}


 As the comment notes, the replacement is possible in `O(log N)`
 operations, but this is not an operation which the heap in the C++ STL
 provides.

 But if there's a suitably licensed heap template (boost may have
 something, or maybe one of the open source C++ STLs is suitably licensed)
 we could copy that in and add the operation we want to it.

 Marking as found in 1.2.20, but the code has been this way for a long
 time.  I don't think anyone has reported this as causing performance
 issues in practice before now.

 Setting milestone 1.3.x initially.

--
Ticket URL: <http://trac.xapian.org/ticket/682>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list