MultiDatabase shard count limitations
Olly Betts
olly at survex.com
Wed Aug 4 08:39:05 BST 2021
On Wed, Aug 26, 2020 at 12:56:53AM +0100, Olly Betts wrote:
> On Tue, Aug 25, 2020 at 10:15:42PM +0000, Eric Wong wrote:
> > So I managed to get current xapian.git (commit 61724d477edb)
> > built with CXXFLAGS=-ggdb3, and it's closer to 100%:
> >
> > https://80x24.org/spew/20200825215517.GA3936@dcvr/2-perf-report-20200825-214820.gz
> >
> > The machine I'm working on is also significantly busier at the
> > moment trying to reproduce an unrelated problem.
>
> Oh and perf samples the whole machine, not just this process.
>
> This seems to suggest a significant part of the problem is getting the
> wdf upper bound for each term (which is used to bound the weight each
> term can return). This seems to account for 36.43% of a process total
> of 86.62% - I think this really shouldn't take any significant time,
> which would probably mean a 40%+ speed up for this case by itself.
>
> For glass, we store a global wdf upper bound for the database, and then
> return the smaller of this and the term's collection frequency - the
> implementation of this looks up the collection frequency when called,
> which is stored in the first chunk of the postlist for that term. That
> means during a search we end up re-fetching this separately to finding
> it to read the postlist, with an extra cursor seek.
>
> It would be better to just look up each term once. I think to achieve
> that cleanly will need a bit of refactoring.
Here's a patch against xapian git master to do that:
https://oligarchy.co.uk/xapian/patches/get-wdf-upper-bound-from-postlist.patch
I think it'll need a bit more work for 1.4.x, but if you're able to
test this that'd be useful.
My own tests don't show a significant improvement, but they're small
scale and CPU-bound; I would expect the big savings to come if we're I/O
bound as it could then avoid rereading data from disk. Your profile
seemed to show that's where you were.
Cheers,
Olly
More information about the Xapian-discuss
mailing list