Is there a large variance in xapian searching?

Olly Betts olly at survex.com
Tue Jul 3 07:21:02 BST 2018


On Mon, Jul 02, 2018 at 06:08:40PM +0800, morefreeze wrote:
> I found every first time(like after booting computer) or
> sometime(occasional) to query(use QueryParse) this databases will cost
> significant seconds (like 5 seconds), although it cost 0.8 seconds on
> average. What is the reason of this?

If you've just rebooted, none of the database will be cached, so
everything has to be fetched from disk and that takes more time.

The second query will be faster even if it's for entirely different
terms, because at least the root blocks will be read from cache.
And pretty quickly the cache ends up with all the frequently read
blocks.

This can also happen without a reboot if another process reads a lot
of data which ends up in cache instead of the database blocks.  If
the machine has cronjobs making backups, update the db used by the
"locate" tool, or doing other things which read a lot of files, you
might want to consider carefully when they run, or run them under
something which minimises cache effects such as "nocache".

> If I want to shorten this query time what should I do or try? BTW, I
> think splitting more databases and query them parallelly is not a good
> idea, unless xapian ensure each query is less than a expected
> time(Actually this 13M database is 'small', :P).

I'd think searching more databases would if anything make this "cold
cache" effect worse.

You don't say what version you're using, but make sure it's a recent
Xapian 1.4.x and that you're using the glass backend.  If you're still
using 1.2.x, or 1.4.x with chert databases then switching to 1.4.x+glass
is likely to help.

You can warm the cache usefully just by running a few queries (if
you make them for commonly searched terms that will be more effective).
So if you have a cluster of search machines and want to add a new
member to it, you can automate running a few "warm up" queries after
spinning up the new instance but before actually adding it to the
cluster.

1.4.x will issue prefetch hints if posix_fadvise() is available, which
helps when the cache is cold.  These are done automatically for
postlists, but you can call MSet::fetch() to issue prefetch hints for
fetching document data.  This ticket is about the prefetching changes:

https://trac.xapian.org/ticket/671

If you want to profile what database blocks are being read, then the
strace-analyse script may be useful:

https://trac.xapian.org/browser/git/xapian-maintainer-tools/profiling/strace-analyse

See the comments in the script for how to use it.

Cheers,
    Olly



More information about the Xapian-devel mailing list