[Xapian-discuss] Multiple databases vs Single large database

Olly Betts olly at survex.com
Fri Nov 21 13:42:34 GMT 2008


On Fri, Nov 21, 2008 at 05:51:30AM -0500, Jim wrote:
> Consider
> 1.  Are the searches fast enough (of multiple DBs)?

There's not been much profiling of searching multiple databases, and I
don't have figures for how performance compares.  The split databases
will generally be bigger in total size than a merged database would be
so you'll need a bit more disk space, but then you need to rebuild while
searching the old database, won't need as much scratch space if you can
rebuild one database at a time.

There's clearly an overhead for opening each one - we try to minimise
this, so it's not much for each one, but if we're talking hundreds or
thousands of databases, it might start to add up.  In some applications
you can keep them open between searches, but that's not always viable.

If you find slow cases, profiling them often reveals a bottleneck that
can be addressed fairly easily.  There are some tips on profiling on
the wiki.

> 2.  How often are multiple DBs searched?

Also, the term statistics will be different for a search over a single
user's database and a search over a combined database filtered to show
a single user's data.  If users are very different (e.g. different
languages) that might lead to worse results from a merged database.
If they're broadly similar, the averaging of statistics might actually
lead to better results from a merged database.

> 2.  Consider ping ponging two Xapian DBs when updating.  I use the 
> following logic.
> I have two directories with Xapian DBs.  A  and B.
> If A is older than B
>   copy contents of B into A
> else
>   copy contents of A into B
> add new entries to the copy
> if the copy is A
>     rm C
>     ln -s A C
> if the copy is B
>     rm C
>     ln -s B C
> 
> where C is the database that I am using to search.

This leaves a time interval where there's no valid database at C though,
which is problematic if search process are could be trying to open the
database while you're switch the new database live.

A better approach is to use a stub database file for C.  You can write a
new file as "C.tmp" and then atomically switch with "mv C.tmp C" (at
least on POSIX platforms).

Cheers,
    Olly



More information about the Xapian-discuss mailing list