Using multiple temporary indexes during updates
Jean-Francois Dockes
jf at dockes.org
Sat Mar 16 07:52:18 GMT 2024
Olly Betts writes:
> On Fri, Mar 15, 2024 at 08:15:55PM +0100, Jean-Francois Dockes wrote:
> > I have been playing at converting the index update stage of the Recoll indexer to use
> > multiple temporary indexes and a final merge.
> >
> > This yields an improvement factor of almost 3 (on my quad-core CPU), for the total
> > indexing time for "easy" files like HTML pages. This is nice (!) and I wanted to share my
> > admiration for the "compact()" method.
> >
> > If someone is interested in a bit more detail:
> > https://www.recoll.org/pages/idxthreads/threadingRecoll.html#_the_xapian_bottleneck_and_how_it_was_resolved_thanks_to_xapian
>
> Nice write-up!
>
> It'd be helpful to note the Xapian version you're using for such
> benchmarking as the results are likely to evolve over time.
It's the default Ubuntu Jammy Xapian package: 1.4.18. I updated the page.
> Also are you using Xapian::DBCOMPACT_MULTIPASS? The linked page doesn't
> seem to say.
I am using the default WritableDatabase::compact(targetdir)
https://framagit.org/medoc92/recoll/-/blob/multiwritedbs/src/rcldb/rcldb.cpp?ref_type=heads#L1114
> In theory it should be faster when merging many databases, but Tom
> Mortimer reported he found it slower. That was a long time ago, but
> I've never managed to get around to profiling to see what was going
> on or if it was even still the case (probably makes most sense to do at
> the same time as implementing https://trac.xapian.org/ticket/444 ).
I could give DBCOMPACT_MULTIPASS a try, but the merge time is so much dominated by the
creation/update time that it should not make much difference anyway. Also, regarding
ticket/444, with my modest amounts of data, on SSDs, I did not feel that the process was
I/O bound, it was using very close to 100% CPU during compact().
> Incidentally, for the "fork() on a large process is slow" bit at the
> end, posix_spawn() may help assuming it's flexible enough to do what
> you want. The glibc implementation calls "clone(2) with CLONE_VM and
> CLONE_VFORK flags".
Oops, the last paragraph was completly obsolete actually, there is a whole other document
on the subject: https://www.recoll.org/pages/idxthreads/forkingRecoll.html
The TL:DR being: carefully use vfork() + permanent external processes.
Last time I checked posix_spawn() was exactly equivalent to vfork/exec and had an issue
concerning descriptors management. More detail in the page above if you feel like wasting
a little more time :)
Cheers,
jf
More information about the Xapian-discuss
mailing list