MultiDatabase shard count limitations

Sun Aug 23 22:10:24 BST 2020

On Fri, Aug 21, 2020 at 09:06:59AM +0000, Eric Wong wrote:
> Going back to the "prioritizing aggregated DBs" thread from
> February 2020, I've got 390 Xapian shards for 130 public inboxes
> I want to search against(*).  There's more on the horizon (we're
> expecting tens of thousands of public inboxes).

Was that "(*)" meant to have a matching footnote?

> After bumping RLIMIT_NOFILE and running ->add_database a bunch,
> the actual queries seem to be taking ~30s (not good :x).
> 
> Now I'm thinking, MultiDatabase isn't the right way to go about
> this...

I'm not aware of anyone who's tried to use that many shards before, so
it might be you're just hitting something easy to address.  Anything to
do with shards should be at worst O(n) in the number of shards (and it's
often O(1)), but perhaps there's something silly happening which doesn't
matter with a more modest number of shards.

If you run the search command under "time", how does the CPU time
(user+sys) compare to real?  If they're much less, then it's spending
a lot of time waiting for I/O, which in this case means loading files
from disk.

If most (or at least a significant amount) is CPU time then it would
be useful to profile to see if there are any low-hanging fruit.  I've
been mostly using the profiler in gperftools lately if you want to
try this and don't know what to use.

It would also be interesting to compare with xapian git master (if
you're not using that already - you don't seem to mention a version).
The handling of shards has changed in some possibly significant ways.

> Perhaps creating a new, all-encompassing Xapian index with a
> reasonable shard count would be wise, at least for the normal
> WWW frontend?

There are some inherent overheads to dealing with lots of shards.
If you open them all for every search, there's the overhead of that.
There's also going to be more space overhead from the table structure
on average, which means disk cache will be under more pressure.

If you have a persistent search process and ample RAM, then it may be
feasible to scale up to tens of thousands of shards.

> Managing removals of entire inboxes from an all-encompassing
> Xapian DB would get much trickier.

If each inbox is indexed by its own boolean term you can delete all
the documents indexed by a specified term with one API call
(Xapian::WritableDatabase::delete_document(term)).  It may take a
while for a large inbox, but it's more slow than tricky.

Cheers,
    Olly