Using multiple temporary indexes during updates
Olly Betts
olly at survex.com
Fri Mar 15 21:50:01 GMT 2024
On Fri, Mar 15, 2024 at 08:15:55PM +0100, Jean-Francois Dockes wrote:
> I have been playing at converting the index update stage of the Recoll indexer to use
> multiple temporary indexes and a final merge.
>
> This yields an improvement factor of almost 3 (on my quad-core CPU), for the total
> indexing time for "easy" files like HTML pages. This is nice (!) and I wanted to share my
> admiration for the "compact()" method.
>
> If someone is interested in a bit more detail:
> https://www.recoll.org/pages/idxthreads/threadingRecoll.html#_the_xapian_bottleneck_and_how_it_was_resolved_thanks_to_xapian
Nice write-up!
It'd be helpful to note the Xapian version you're using for such
benchmarking as the results are likely to evolve over time.
Also are you using Xapian::DBCOMPACT_MULTIPASS? The linked page doesn't
seem to say.
In theory it should be faster when merging many databases, but Tom
Mortimer reported he found it slower. That was a long time ago, but
I've never managed to get around to profiling to see what was going
on or if it was even still the case (probably makes most sense to do at
the same time as implementing https://trac.xapian.org/ticket/444 ).
Incidentally, for the "fork() on a large process is slow" bit at the
end, posix_spawn() may help assuming it's flexible enough to do what
you want. The glibc implementation calls "clone(2) with CLONE_VM and
CLONE_VFORK flags".
Cheers,
Olly
More information about the Xapian-discuss
mailing list