[Xapian-discuss] Optimal usage of xapian-compact for merging
Kevin Duraj
kevin.softdev at gmail.com
Tue Mar 23 17:46:15 GMT 2010
Henry, Olly,
I am merging 300 indexes at once, it takes less than a day for merge
to happen for 100 million documents, during merging I notice very
heavy IO. Tomorrow I am planning to install new Seagate Barracuda XT
Hard Drive - 2TB, 7200 RPM, SATA 6G, 64MB Cache on find1friend.com
server that will replace my old 1TB Barracuda because it is running
out of space. My old system runs on CentOS 5 with with 1KB disk-block
size running two Xapian indexes of around 150 million documents,
running fairly fast as you can see: http://find1friend.com/ Although
I might be not be able to use SATA 6GB without additional interface,
but let see what happens, I don't want to put my datacenter on fire,
my co-location providers are very nice to me. :-)
tune2fs -l /dev/sda1
Block size: 1024
Fragment size: 1024
Performance is excellent, but will try to using Ubuntu server 9.10
with disk-block size 16KB to see whether the search engine gets
better, and wish to index least 200-300 million Facebook profiles
using C++/Perl on Xapian.
PS: Search 150 million documents from one hard drive using Xapian.
Can imagine what Xapian would do, using two hard drives! :-)
Kevin Duraj
http://find1friend.com/
http://myhealthcare.com/
On Tue, Feb 2, 2010 at 10:40 PM, Olly Betts <olly at survex.com> wrote:
> On Tue, Feb 02, 2010 at 02:49:46PM +0200, Henry C. wrote:
>> I've been wondering, what's the sane/optimal use of xapian-compact when
>> merging many indexes with a view to maximum merging performance?
>>
>> The obvious:
>> - only use -F on the final db.
>
> That's not totally obvious, but is unlikely to make much difference either way.
>
>> - use -m since I'm merging more than 3 dbs.
>
> Someone reported -m was slower for them, but it was certainly a win for me.
> It does do more work, but without it, the postlist table is an N-way merge,
> which scatters reads a lot. So it's essentially an attempt to avoid being
> so I/O bound.
>
>> Best strategy?
>> a) loop: merge batches (of say 50, where the individual db's are small)
>> into a temp index, then merge the (larger) temp into the final product...
>> end-loop
>>
>> b) loop: merge batches (of say 50, where the individual db's are small)
>> into many temp indexes... end-loop
>> Then merge those (larger) temps into the final product.
>
> Or just merge all the databases in a single invocation.
>
> I don't have figures to compare these, and it may vary according to your
> data, OS, FS, and/or hardware, so all I can really suggest is to try the
> different approaches and see. Do report if you find anything interesting.
>
> Currently the grouping under -m is fairly crude - postlists are just merged
> in pairs (plus a three if there are an odd number), and then the merged
> lists are remerged in the same way until we have just one, but that may be
> reasonable even for mismatched sizes.
>
> It would probably be significantly faster not to use a Btree for the
> intermediate stages, but just serialise it to a flat file - we will end up
> rereading it in order. That would only make a difference when merging more
> than 3 databases though.
>
> I should file a ticket for it - it would make a fairly self-contained project
> for someone wanting to hack on Xapian without needing to understand much of the
> internals.
>
>> Finally, presumably it's best to use the same blocksize (-b) as the
>> underlying filesystem? I see the default is 8K, but the default blocksize
>> on (eg) ext3 is 4k... or am I way off here?
>
> It should certainly not be smaller than the hardware blocksize (or else you
> need to read the existing disk-block in order to write a Xapian-block). A
> multiple is fine though, and larger blocks are a bit more efficient. I did
> some tests a year or so ago which suggested 16KB might be slightly better than
> 8KB, but it is sufficiently close that it didn't seem to justify changing the
> default.
>
> Cheers,
> Olly
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
More information about the Xapian-discuss
mailing list