[Xapian-discuss] Optimal usage of xapian-compact for merging

Henry C. henka at cityweb.co.za
Wed Feb 3 09:12:44 GMT 2010


On Wed, February 3, 2010 07:40, Olly Betts wrote:
> Or just merge all the databases in a single invocation.

Merging several hundred thousand dbs in a single invocation presents a
spot of bother :)

> I don't have figures to compare these, and it may vary according to your
> data, OS, FS, and/or hardware, so all I can really suggest is to try the
> different approaches and see.  Do report if you find anything
> interesting.

Looks like I've found a sweet spot with merging batches of 50 - but will
try more.

> Currently the grouping under -m is fairly crude - postlists are just
> merged in pairs (plus a three if there are an odd number), and then the
> merged lists are remerged in the same way until we have just one, but that
> may be reasonable even for mismatched sizes.
>
> It would probably be significantly faster not to use a Btree for the
> intermediate stages, but just serialise it to a flat file - we will end up
>  rereading it in order.  That would only make a difference when merging
> more than 3 databases though.
>
> I should file a ticket for it - it would make a fairly self-contained
> project for someone wanting to hack on Xapian without needing to
> understand much of the internals.

What kind of improvement do you think we'll see?

>
>> Finally, presumably it's best to use the same blocksize (-b) as the
>> underlying filesystem?  I see the default is 8K, but the default
>> blocksize on (eg) ext3 is 4k...  or am I way off here?
>
> It should certainly not be smaller than the hardware blocksize (or else
> you need to read the existing disk-block in order to write a
> Xapian-block).  A
> multiple is fine though, and larger blocks are a bit more efficient.  I
> did some tests a year or so ago which suggested 16KB might be slightly
> better than 8KB, but it is sufficiently close that it didn't seem to
> justify changing the default.

Thanks for the considered response.

Regards
Henry




More information about the Xapian-discuss mailing list