[Xapian-discuss] database stubs: practical limitations, rules of thumb?

Tue Dec 2 14:18:04 GMT 2008

Hi,
  Thanks for the feedback, I'll try and get back with some numbers if
I make some headway.
Cheers

2008/12/2 Olly Betts <olly at survex.com>:
> On Mon, Dec 01, 2008 at 08:11:12PM +0900, Josef Novak wrote:
>>   Is there any standing recommendation on the use of database stubs
>> with xapian?  Is there a rule of thumb in terms of size+number_of_dbs
>> limit for a stub?  Aside from disk I/O, how does having the individual
>> dbs located on a remote machine factor into stub usage?
>>   I've been searching the lists a bit, looking for posts on the usage
>> of stubs, but I only found one highly-relevant-looking thread,
>> http://lists.tartarus.org/pipermail/xapian-discuss/2006-August/002533.html
>
> Well, what's there isn't specific to stubs, but a generic point about
> searching over a large number of databases.
>
> I'm not aware of anyone who has benchmarked opening a large number of
> local or remote databases.  If you want to try, I'd certainly be
> interested to hear.
>
> I just did a very quick time test - a loop which just opens and closes
> the same database 5000 times takes about 0.85 seconds with flint (and
> 0.7 seconds with chert).  And that should be a lower bound on how long a
> search over that many different databases would take.
>
> You really want searches to take under a second or they'll "feel slow",
> so if you try to search over 5000 databases together you'll probably
> have frustrated users.
>
> There's probably scope for reducing this overhead by profiling to find
> ways to speed up opening a database, but I suspect it's still going to
> be a bad idea to try to search thousands of databases together.
>
>>   and it seems, if the rather old thread is still relevant, that there
>> is a fairly low limit to the number of dbs one can corral into a
>> single stub, without incurring a fairly stiff performance hit.
>
> I think you're reading a meaning I didn't intend then.  I'm really just
> saying there's it is pointless benchmarking a few thousand databases
> versus one big one as the big one is clearly going to be significantly
> faster.
>
>>    This appears to be considerably faster, and given the thread above,
>> would appear to be the preferred way to proceed.  However this means
>> that my larger dbs are each 'all in one place', and are effectively
>> less robust.  My intuition is that it would make the most sense to
>> shard each larger city, county, etc. db, based on overall size (and
>> perhaps access statistics), and distribute the shards over a group of
>> different machines, but I wonder if there is a rule of thumb in terms
>> of shard size, and number of shards per stub.  If not I guess I'll
>> just have to experiment!
>
> I don't know of any previous experiments in this area I'm afraid.  Do
> let us know how you get on...
>
> Cheers,
>    Olly
>