[Xapian-discuss] How to update DB concurrently?
Olly Betts
olly at survex.com
Thu May 18 09:41:22 BST 2006
On Wed, May 17, 2006 at 08:52:58PM -0800, oscaruser at programmer.net wrote:
> How can I increase or improve the rate of the indexer to the level the
> spiders are processing the URLs?
Hmm, I'd imagine 150 spiders are probably netting you several hundred
documents per second, maybe thousands.
Some ideas:
* Read http://www.xapian.org/docs/scalability.html if you haven't
already.
* Make sure the indexer is running continuously and don't call flush()
explicitly.
* Batch up updates by setting XAPIAN_FLUSH_THRESHOLD in the
environment (don't forget to export it!) It defaults to 10000 - if
you've plenty of RAM, you can raise this substantially. Gmane uses
100000 (100 thousand) currently.
* Use the flint backend instead of quartz:
http://wiki.xapian.org/FlintBackend
Don't be put off by the warning - the current state very stable
(sufficiently good that I'm contemplating forking off a copy as
the default backend for Xapian 1.0.)
* Make sure the machine has plenty of RAM and fast disks.
* Run several indexers into separate databases and merge these later
with xapian-compact (for flint) or quartzcompact (for quartz). The
indexing rate drops off gradually as database size grows, so the
fastest way to build a large database is to build a number of
databases and merge - gmane builds databases containing 1 million
documents each and then merges them together. I chose this threshold
after doing a bit of profiling so it's a good starting value, but you
may be able to tune it further and it'll depend on your hardware too.
* If you aren't trying to read from the databases while building
them, you could try enabling "dangerous mode" - for flint you
just need to uncomment the obvious #define in
backends/flint/flint_table.cc (search for DANGEROUS) and recompile.
"Dangerous" mode updates blocks in place rather than ensuring the
old version is preserved, so reading while writing won't work, and
(this is the "danger" bit) if the power fails or the system crashes
your database may not be in a consistent state. But it reduces the
amount of I/O and buys you a little speed. I use this mode to build
gmane's database.
I'm also have plans for a number of improvements, which I'm working on
in an on-going fashion. If you're in a hurry and have a budget for
your project, then funding is always welcome and would enable me to
devote more time to this work!
Cheers,
Olly
More information about the Xapian-discuss
mailing list