Using multiple temporary indexes during updates

Jean-Francois Dockes jf at dockes.org
Sat Mar 16 07:52:18 GMT 2024


Olly Betts writes:
 > On Fri, Mar 15, 2024 at 08:15:55PM +0100, Jean-Francois Dockes wrote:
 > > I have been playing at converting the index update stage of the Recoll indexer to use
 > > multiple temporary indexes and a final merge.
 > > 
 > > This yields an improvement factor of almost 3 (on my quad-core CPU), for the total
 > > indexing time for "easy" files like HTML pages. This is nice (!) and I wanted to share my
 > > admiration for the "compact()" method.
 > > 
 > > If someone is interested in a bit more detail:
 > > https://www.recoll.org/pages/idxthreads/threadingRecoll.html#_the_xapian_bottleneck_and_how_it_was_resolved_thanks_to_xapian
 > 
 > Nice write-up!
 > 
 > It'd be helpful to note the Xapian version you're using for such
 > benchmarking as the results are likely to evolve over time.

It's the default Ubuntu Jammy Xapian package: 1.4.18. I updated the page.

 > Also are you using Xapian::DBCOMPACT_MULTIPASS?  The linked page doesn't
 > seem to say.

I am using the default WritableDatabase::compact(targetdir)
https://framagit.org/medoc92/recoll/-/blob/multiwritedbs/src/rcldb/rcldb.cpp?ref_type=heads#L1114
 
 > In theory it should be faster when merging many databases, but Tom
 > Mortimer reported he found it slower.  That was a long time ago, but
 > I've never managed to get around to profiling to see what was going
 > on or if it was even still the case (probably makes most sense to do at
 > the same time as implementing https://trac.xapian.org/ticket/444 ).

I could give DBCOMPACT_MULTIPASS a try, but the merge time is so much dominated by the
creation/update time that it should not make much difference anyway. Also, regarding
ticket/444, with my modest amounts of data, on SSDs, I did not feel that the process was
I/O bound, it was using very close to 100% CPU during compact().

 > Incidentally, for the "fork() on a large process is slow" bit at the
 > end, posix_spawn() may help assuming it's flexible enough to do what
 > you want.  The glibc implementation calls "clone(2) with CLONE_VM and
 > CLONE_VFORK flags".

Oops, the last paragraph was completly obsolete actually, there is a whole other document
on the subject: https://www.recoll.org/pages/idxthreads/forkingRecoll.html

The TL:DR being: carefully use vfork() + permanent external processes.

Last time I checked posix_spawn() was exactly equivalent to vfork/exec and had an issue
concerning descriptors management. More detail in the page above if you feel like wasting
a little more time :)

Cheers,

jf



More information about the Xapian-discuss mailing list