[Xapian-tickets] [Xapian] #682: Collapsing when sorting by value unnecessarily slow
Xapian
nobody at xapian.org
Wed Jun 10 02:59:23 BST 2015
#682: Collapsing when sorting by value unnecessarily slow
----------------------------+--------------------
Reporter: olly | Owner: olly
Type: defect | Status: new
Priority: normal | Milestone: 1.3.x
Component: Matcher | Version: 1.2.20
Severity: normal | Keywords:
Blocked By: | Blocking:
Operating System: All |
----------------------------+--------------------
Gareth Rees reported on IRC that collapsing when sorting by value could be
very slow, and this got worse as `offset` was increased (approximately
linear in `offset` from a graph he showed).
= Background =
Once the matcher has found enough matching results (where enough is
`offset + msize`), it keeps the best `offset + msize` in a heap, with the
worst one at the tip of the heap (sometimes called a "min heap").
Each new matching document is compared to the tip of that heap - if it's
worse, it's discarded; if it's better, the current tip is popped and the
new document added to the heap.
When collapsing, to replace a document with one with the same collapse key
but which ranks more highly, we currently do a linear scan of the vector
the heap is over to find the old document, and replace it in the vector
with the better one (which means the vector is no longer a valid heap),
and flag the heap as needing a rebuild before we next use it as a heap.
So if collapsing replaces a lot documents, the heap gets rebuilt a lot.
If `offset` or `msize` is set high, the heap is large (since it has `msize
+ offset` entries).
Sorting by value makes this worse - if you sort by relevance the match can
often terminate early, which it can't for sorting by value. Also sorting
by value probably makes the comparisons while rebuilding the heap more
expensive.
= Solution =
There's a FIXME comment in matcher/multimatch.cc which notes this (in less
detail):
{{{
#!cpp
// We can replace an arbitrary element in
O(log N)
// but have to do it by hand (in this case the
new
// elt is bigger, so we just swap down the
tree).
// FIXME: implement this, and clean up is_heap
// handling
*i = new_item;
pushback = false;
is_heap = false;
}}}
As the comment notes, the replacement is possible in `O(log N)`
operations, but this is not an operation which the heap in the C++ STL
provides.
But if there's a suitably licensed heap template (boost may have
something, or maybe one of the open source C++ STLs is suitably licensed)
we could copy that in and add the operation we want to it.
Marking as found in 1.2.20, but the code has been this way for a long
time. I don't think anyone has reported this as causing performance
issues in practice before now.
Setting milestone 1.3.x initially.
--
Ticket URL: <http://trac.xapian.org/ticket/682>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list