MultiDatabase shard count limitations

Mon Aug 9 02:42:22 BST 2021

On Sun, Aug 08, 2021 at 07:40:04PM +0000, Eric Wong wrote:
> Olly Betts <olly at survex.com> wrote:
> > Here's a patch against xapian git master to do that:
> >
> > https://oligarchy.co.uk/xapian/patches/get-wdf-upper-bound-from-postlist.patch
> >
> > I think it'll need a bit more work for 1.4.x, but if you're able to
> > test this that'd be useful.
> 
> Thanks, I tested against master
> (74df7696f8603add68c6fda27e44dda2b7090093) with SWIG Xapian.pm.
> 
> With 537 shards the time to search a particular query went from
> ~2 minutes to ~3s; so this patch is a huge improvement for me.

OK, that's much better than the 40%+ expected even.

I think this definitely justifies the extra effort to backport this
change to 1.4.x.

> master roughly matches Debian buster 1.4.11 and buster-backport
> 1.4.18 speeds, it's busy system, so it's hard to notice small
> improvements, but major improvements are easily noticeable.

You should find git master to be a bit faster in general, but it's
probably mostly at the magnitude you could miss if the timings are
noisy.

> Btw, I noticed some "make check" failures so far:
> 
>   Running tests with backend "multi_glass"...
>   Running test: eliteset1... FAILED
>   Running test: eliteset2... FAILED
>   Running test: eliteset4... FAILED
>   ...
> 	Running tests with backend "multi_glass_remoteprog_glass"...
> 	Running test: eliteset1... FAILED
> 	Running test: eliteset2... FAILED
> 
> I think it's from the patch (it takes a while to run "make check").
> My search results for me are as expected; but I only tested one
> particular query and I'm not using OP_ELITE_SET (haven't really
> digested many of the options :x)

I see these too (I'd lazily only run `./runtest ./apitest -b glass`
before).

I think this is essentially an existing bug which previously affected
these testcases when searching two remote shards together, but now
triggers for any sharded combination (both remote or both local or
mixed).

The reason it now triggers is because for each local shard we now use
the wdf upper bound just for that shard, whereas previously we used the
bound for the combined database.  This means that with this patch the
weight bounds will be tighter and the matcher has more scope to
optimise, which is really a good thing (and likely at least part of
the speed-up you see).

The issue is that OP_ELITE_SET selects based on the weight upper bounds
the weighting scheme reports for the terms, but selects per-shard so can
select different terms per-shard, so you don't necessarily get the same
terms and hence can get different search results when searching the same
collection of documents but split differently into shards (but only when
using OP_ELITE_SET).

I think OP_ELITE_SET should probably use its own calculations based on
the combined database, and not be tied to the weighting scheme
(currently it isn't useful with schemes like BoolWeight or CoordWeight
which return a fixed weight upper bound, and more generally it kind of
assumes the known upper bound on the weight is a good indicator of the
typical value of the weight (which is likely often true, but isn't
inherently so).

If anyone knows of a good way of selecting a specified number of terms
from a larger set to run as a query, suggestions are most welcome.
The idea is it should be useful for things like query-by-example, so you
could feed it all the words from a document and have it pick the best
(say) 50 to actually query for.  For example, it might work by
discarding the terms from the set which are common or very rare in the
dataset being searched.

Such a reworking is probably too radical for 1.4.x though, so maybe
there we just adjust these testcases to extend the XFAIL_FOR_BACKEND to
cover more backends.  It's not like this bug makes OP_ELITE_SET
unusable.

Cheers,
     Olly