[Xapian-discuss] database stubs: practical limitations, rules of thumb?

Tue Dec 2 06:40:42 GMT 2008

On Mon, Dec 01, 2008 at 08:11:12PM +0900, Josef Novak wrote:
>   Is there any standing recommendation on the use of database stubs
> with xapian?  Is there a rule of thumb in terms of size+number_of_dbs
> limit for a stub?  Aside from disk I/O, how does having the individual
> dbs located on a remote machine factor into stub usage?
>   I've been searching the lists a bit, looking for posts on the usage
> of stubs, but I only found one highly-relevant-looking thread,
> http://lists.tartarus.org/pipermail/xapian-discuss/2006-August/002533.html

Well, what's there isn't specific to stubs, but a generic point about
searching over a large number of databases.

I'm not aware of anyone who has benchmarked opening a large number of
local or remote databases.  If you want to try, I'd certainly be
interested to hear.

I just did a very quick time test - a loop which just opens and closes
the same database 5000 times takes about 0.85 seconds with flint (and
0.7 seconds with chert).  And that should be a lower bound on how long a
search over that many different databases would take.

You really want searches to take under a second or they'll "feel slow",
so if you try to search over 5000 databases together you'll probably
have frustrated users.

There's probably scope for reducing this overhead by profiling to find
ways to speed up opening a database, but I suspect it's still going to
be a bad idea to try to search thousands of databases together.

>   and it seems, if the rather old thread is still relevant, that there
> is a fairly low limit to the number of dbs one can corral into a
> single stub, without incurring a fairly stiff performance hit.

I think you're reading a meaning I didn't intend then.  I'm really just
saying there's it is pointless benchmarking a few thousand databases
versus one big one as the big one is clearly going to be significantly
faster.

>    This appears to be considerably faster, and given the thread above,
> would appear to be the preferred way to proceed.  However this means
> that my larger dbs are each 'all in one place', and are effectively
> less robust.  My intuition is that it would make the most sense to
> shard each larger city, county, etc. db, based on overall size (and
> perhaps access statistics), and distribute the shards over a group of
> different machines, but I wonder if there is a rule of thumb in terms
> of shard size, and number of shards per stub.  If not I guess I'll
> just have to experiment!

I don't know of any previous experiments in this area I'm afraid.  Do
let us know how you get on...

Cheers,
    Olly