xapian 1.4 performance issue
Olly Betts
olly at survex.com
Thu Dec 7 22:03:22 GMT 2017
On Thu, Dec 07, 2017 at 10:29:09AM +0100, Jean-Francois Dockes wrote:
> Recoll builds snippets by partially reconstructing documents out of index
> contents.
>
[...]
>
> The specific operation which has become slow is opening many term position
> lists, each quite short.
The difference will actually be chert vs glass, rather than 1.2 vs 1.4
as such (glass is the new backend in 1.4 and now the default).
This is a consequence of the change to the ordering within the position
list table. In chert, it was keyed on (documentid, term), so all the
position lists for a document were together (good spatial locality for
what you are doing). In glass, it is keyed on (term, documentid), so
all the position lists for a term are now together - this gives good
spatial locality for queries - a phrase search wants the positional data
for a small number of terms in (potentially) many documents, and the
more documents positional data is wanted for, the better the locality of
access is. And indeed this change delivered a big improvement for
previously very slow phrase search cases.
I'm not sure there's much we can do about this directly - changing the
order back would undo the speed up to slow phrases. And while
recreating documents like this is a neat idea, making positional data
faster for how it's intended to be used seems more critical.
> In a quite typical example, the abstract generation time has gone from 100 mS
> to 90 S (SECONDS), on a cold index. Users don't like it, they think that
> the application is dead, this is what has triggered the user reports.
>
> The TLDR is that Recoll is unusable with Xapian 1.4.
>
> I don't know why I had not seen it earlier, probably because I always work
> with warm indexes, this is an I/O issue.
>
> Any idea how I can work around this ?
Some options:
* Explicitly create chert databases. This isn't a good long term option
(chert is already removed in git master) but would ease the immediate
user pain while you work on a better solution.
* Store the extracted text (e.g. in the document data, which will be
compressed using zlib for you). That's more data to store, but a
benefit is you get capitalisation and punctuation. You can reasonably
limit the amount stored, as the chances of finding a better snippet
will tend to tail off for large documents.
* Store the extracted text, but compressed using the document's term
list as a dictionary - could be with an existing algorithm which
supports a dictionary (e.g. zlib) or a custom algorithm. Potentially
gives better compression than the first option, though large documents
will benefit less. I'd suggest a quick prototype to see if it's worth
the effort.
* If you're feeding the reconstructed text into something to dynamically
select a snippet, then that could potentially be driven from a lazy
merging of the positional data (e.g. a min heap of PositionIterator).
If the snippet selector terminates once it's found a decent snippet
then this would avoid having to decode all the positional data, but it
would have to read it to find the first position of every term, so it
doesn't really address the data locality issue. It also would have to
handle positional data for a lot of terms in parallel and the larger
working set may not fit in the CPU cache.
* Re-extract the text for documents you display, possibly with a caching
layer if you have a high search load (I doubt a single-user search
like recoll would need one). If you have some slow extractors (e.g.
OCR) then you could store text from those (perhaps just store based on
how long the extraction took at index time, and users can tune that
based on how much they care about extra disk usage). An added benefit
is that you get to show the current version of the document, rather
than the version that was last indexed. This seems like a good option
for recoll.
Cheers,
Olly
More information about the Xapian-discuss
mailing list