System requirements to boost xapian's performance?

Mon Jan 17 09:14:29 GMT 2022

On Mon, 17 Jan 2022 at 01:07, Olly Betts <olly at survex.com> wrote:

> On Sat, Jan 15, 2022 at 11:52:54AM -0400, David Bremner wrote:
> > Philip Colmer <philip.colmer at linaro.org> writes:
> > >
> > > What changes can I make to the specification of a server to best
> improve
> > > the performance of the indexing? For example, if I throw more cores at
> > > this, will the indexing go faster?
> >
> > I think it will depend a great deal on the indexer, and I don't know
> > anything about how mailman uses Xapian. Based on my experience with
> > Notmuch I would say strive for fast IO (definitely SSD, perhaps ramdisk)
> > and fast single threaded performance. Memory use is usually moderate by
> > 2021 standards.
>
> I should also include the caveat that I also know nothing specific about
> how mailman uses Xapian.
>
> If you're contemplating using a RAM disk, I'd expect (though haven't
> benchmarked) that you'd get equivalent gains by letting that RAM instead
> be used by the OS to cache all of a disk-based database and disable
> syncing of the database for the initial run.  That has the added
> benefits that you don't need to create a RAM disk (and so don't need
> to decide how big to make it), and don't need to copy the database from
> RAM disk to disk once indexing completes.
>
> You can disable syncing by opening the Xapian::WritableDatabase with
> the DB_NO_SYNC flag (which may need a tweak to the indexer code if
> it doesn't already support doing so), or running the indexing command
> under eatmydata.
>
> You can also speed up an initial index run by using DB_DANGEROUS which
> updates database blocks in place rather than doing copy-on-write which
> reduces the amount of I/O required.  This is even less crash resilient.
>
> Increasing the batch size between automatic flushes can improve
> throughput if there's plenty of RAM - set (and export!)
> environment variable XAPIAN_FLUSH_THRESHOLD which is a threshold for a
> counter of number of documents changed.  It defaults to 10000, which
> is fairly conservative.
>
> Using the newest Xapian version you can may also help.  E.g. 1.4.19
> added an optimisation which helps indexing if the indexer runs queries
> during indexing.

Thank you for the replies.

All I know about HyperKitty is that it uses Haystack (django-haystack) as
an agnostic intermediary to the underlying search/indexer. On my
installation, I'm using https://github.com/notanumber/xapian-haystack as
the mechanism to then go from Haystack to Xapian.

When indexing an individual Mailman3 list, HyperKitty uses Haystacks
"update-index" command:
https://django-haystack.readthedocs.io/en/master/management_commands.html#update-index

As a result, I'm not sure if there is much (any?) fine-tuning that can be
done over Xapian itself as I fear that Haystack might just abstract things
too much, but I can certainly pass some of this back to the Mailman 3
developers to see if there is anything that can be done.

It does seem clear, though, that Xapian is single-threaded when indexing so
I'll see what my options are for fast CPU and more memory within the EC2
configurations.

Thanks.

Philip