[Xapian-discuss] Optimal usage of xapian-compact for merging

Kevin Duraj kevin.softdev at gmail.com
Tue Mar 23 17:46:15 GMT 2010


Henry, Olly,

I am merging 300 indexes at once, it takes less than a day for merge
to happen for 100 million documents, during merging I notice very
heavy IO. Tomorrow I am planning to install new Seagate Barracuda XT
Hard Drive - 2TB, 7200 RPM, SATA 6G, 64MB Cache on find1friend.com
server that will replace my old 1TB Barracuda because it is running
out of space. My old system runs on CentOS 5 with with 1KB disk-block
size running two Xapian  indexes of around 150 million documents,
running fairly fast as you can see: http://find1friend.com/  Although
I might be not be able to use SATA 6GB without additional interface,
but let see what happens, I don't want to put my datacenter on fire,
my co-location providers are very nice to me. :-)

tune2fs -l /dev/sda1
Block size:               1024
Fragment size:            1024

Performance is excellent, but will try to using Ubuntu server 9.10
with disk-block size 16KB to see whether the search engine gets
better, and wish to index least 200-300 million Facebook profiles
using C++/Perl on Xapian.

PS: Search 150 million documents from one hard drive using Xapian.
       Can imagine what Xapian would do, using two hard drives! :-)

Kevin Duraj
http://find1friend.com/
http://myhealthcare.com/



On Tue, Feb 2, 2010 at 10:40 PM, Olly Betts <olly at survex.com> wrote:
> On Tue, Feb 02, 2010 at 02:49:46PM +0200, Henry C. wrote:
>> I've been wondering, what's the sane/optimal use of xapian-compact when
>> merging many indexes with a view to maximum merging performance?
>>
>> The obvious:
>> - only use -F on the final db.
>
> That's not totally obvious, but is unlikely to make much difference either way.
>
>> - use -m since I'm merging more than 3 dbs.
>
> Someone reported -m was slower for them, but it was certainly a win for me.
> It does do more work, but without it, the postlist table is an N-way merge,
> which scatters reads a lot.  So it's essentially an attempt to avoid being
> so I/O bound.
>
>> Best strategy?
>> a)  loop:  merge batches (of say 50, where the individual db's are small)
>> into a temp index, then merge the (larger) temp into the final product...
>> end-loop
>>
>> b)  loop:  merge batches (of say 50, where the individual db's are small)
>> into many temp indexes... end-loop
>> Then merge those (larger) temps into the final product.
>
> Or just merge all the databases in a single invocation.
>
> I don't have figures to compare these, and it may vary according to your
> data, OS, FS, and/or hardware, so all I can really suggest is to try the
> different approaches and see.  Do report if you find anything interesting.
>
> Currently the grouping under -m is fairly crude - postlists are just merged
> in pairs (plus a three if there are an odd number), and then the merged
> lists are remerged in the same way until we have just one, but that may be
> reasonable even for mismatched sizes.
>
> It would probably be significantly faster not to use a Btree for the
> intermediate stages, but just serialise it to a flat file - we will end up
> rereading it in order.  That would only make a difference when merging more
> than 3 databases though.
>
> I should file a ticket for it - it would make a fairly self-contained project
> for someone wanting to hack on Xapian without needing to understand much of the
> internals.
>
>> Finally, presumably it's best to use the same blocksize (-b) as the
>> underlying filesystem?  I see the default is 8K, but the default blocksize
>> on (eg) ext3 is 4k...  or am I way off here?
>
> It should certainly not be smaller than the hardware blocksize (or else you
> need to read the existing disk-block in order to write a Xapian-block).  A
> multiple is fine though, and larger blocks are a bit more efficient.  I did
> some tests a year or so ago which suggested 16KB might be slightly better than
> 8KB, but it is sufficiently close that it didn't seem to justify changing the
> default.
>
> Cheers,
>    Olly
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>



More information about the Xapian-discuss mailing list