[Xapian-discuss] Optimal usage of xapian-compact for merging

Olly Betts olly at survex.com
Wed Feb 3 05:40:00 GMT 2010


On Tue, Feb 02, 2010 at 02:49:46PM +0200, Henry C. wrote:
> I've been wondering, what's the sane/optimal use of xapian-compact when
> merging many indexes with a view to maximum merging performance?
> 
> The obvious:
> - only use -F on the final db.

That's not totally obvious, but is unlikely to make much difference either way.

> - use -m since I'm merging more than 3 dbs.

Someone reported -m was slower for them, but it was certainly a win for me.
It does do more work, but without it, the postlist table is an N-way merge,
which scatters reads a lot.  So it's essentially an attempt to avoid being
so I/O bound.

> Best strategy?
> a)  loop:  merge batches (of say 50, where the individual db's are small)
> into a temp index, then merge the (larger) temp into the final product...
> end-loop
> 
> b)  loop:  merge batches (of say 50, where the individual db's are small)
> into many temp indexes... end-loop
> Then merge those (larger) temps into the final product.

Or just merge all the databases in a single invocation.

I don't have figures to compare these, and it may vary according to your
data, OS, FS, and/or hardware, so all I can really suggest is to try the
different approaches and see.  Do report if you find anything interesting.

Currently the grouping under -m is fairly crude - postlists are just merged
in pairs (plus a three if there are an odd number), and then the merged
lists are remerged in the same way until we have just one, but that may be
reasonable even for mismatched sizes.

It would probably be significantly faster not to use a Btree for the
intermediate stages, but just serialise it to a flat file - we will end up
rereading it in order.  That would only make a difference when merging more
than 3 databases though.

I should file a ticket for it - it would make a fairly self-contained project
for someone wanting to hack on Xapian without needing to understand much of the
internals.

> Finally, presumably it's best to use the same blocksize (-b) as the
> underlying filesystem?  I see the default is 8K, but the default blocksize
> on (eg) ext3 is 4k...  or am I way off here?

It should certainly not be smaller than the hardware blocksize (or else you
need to read the existing disk-block in order to write a Xapian-block).  A
multiple is fine though, and larger blocks are a bit more efficient.  I did
some tests a year or so ago which suggested 16KB might be slightly better than
8KB, but it is sufficiently close that it didn't seem to justify changing the
default.

Cheers,
    Olly



More information about the Xapian-discuss mailing list