[Xapian-discuss] Incremental indexing limitations

Olly Betts olly at survex.com
Thu Oct 18 00:37:07 BST 2007


On Sun, Oct 14, 2007 at 03:12:19PM +0200, Ron Kass wrote:
> Our main concern right now is search speed. Reading some exchanges here 
> about people who deal with >minute searches, that would be an issue. 

I guess you must mean the "xapian performance" thread from last
November:

http://thread.gmane.org/gmane.comp.search.xapian.general/3542

The result of that thread was that we now handle OP_PHRASE/OP_NEAR with
three or more operands more efficiently, we now pack and unpack
positional data much more efficiently (138 times faster, though that's
a microbenchmark).

Searches taking more than a minute is truly exceptional, and the likely
explanation is either that the hardware is being overwhelmed, or that
there's a bottleneck somewhere in the code.

If you think you've found a bottleneck in the code and you're on Linux,
running it under oprofile and producing a callgraph will usually show
where the problem is, and generally it's easy to make dramatic
improvements in such cases.

Any decent profiling tool will give some idea, but oprofile looks at the
whole system, so we see time spent in the kernel, and doing I/O.  Also,
it has a low overhead.

I ought to write up an "oprofile how to" for the wiki, but meanwhile I
believe a 2.6 kernel is needed for best results - just run your indexer
under oprofile with the callgraph enabled (call opcontrol with
--callgraph=12 or some suitable stack depth, then run opreport with
--callgraph).  The reports are big, so please don't post them to the
list as attachments.  The best thing to do is open a bug report so they
don't get lost.

Incidentally, the next release will speed up three or more operand
OP_AND, OP_PHRASE, and OP_NEAR.  Tests on wikipedia data with real
queries shows a 16-17% reduction in time for OP_AND.  Searches from
tweakers.net's "slow search" log were sped up by about 8% on average.
Also, queries with a ValueRangeProcessor filter are now about 3 times
faster.  And I'm currently testing another patch which should improve
OP_FILTER performance when used on an OP_AND subexpression.

> Now, of course you can throw more RAM on the thing and get better 
> results, but thats not really the right criteria for speed. What is 
> important here is to get better speed than alternative with the SAME 
> hardware. This is the winning card here.

There is an absolute criterion too - whether the performance from any
alternative is good enough from the hardware that you can afford for
a particular application.

> Now, for this we have two ideas we are happy to share and hear feedback on..
> 1. Auto warm-up: In many discussions it is stated that a warmed up 
> database is faster. Of course it is. One nice thing would be to have a 
> warmup (or auto-warmup) mechanism that will automatically load 
> critical/popular parts of the BTREEs into memory even before first 
> searches are executed. It is better to let the machine do that than 
> letting the first users do that and see slow searches for a while.

It's pretty simple to do this anyway - just pick out some of the top
queries from your search logs and run them.

I'm not sure we could do better than that.  Having to track the popular
parts of the Btrees would incur overhead on every search, and require
searches to be able to write to the database directory.  The other
approach would be to just read the skeleton of each Btree, but for a
large database that may not all be cacheable, so we'd end up flushing
the start of the "warm up" read blocks from the cache.  To make it work
I think you'd need to look at how much cache space there was and 

> 2. Graceful Time-out on searches: To allow a search to run a MAXIMUM of 
> X seconds, and then return whatever results it has EVEN if not 
> complete/perfect. In many applications, it is better to show a partial 
> result after 5 seconds than a complete/accurate one after 50. Users 
> don't wait 50 seconds for a result. (most don't wait even 5, so maybe a 
> timeout of 0.5 should be used by some. Anyway, it should a parameter per 
> query).

The problem is that this messes up paging through results, because the
next query will probably time out differently (or not at all, given
we'll probably have most of the blocks we need cached).

> 3. I know that Xapian's model is to trust the OS's cache. But, was this 
> assumption tested to prove that it really is logical? I am sure it saves 
> effort, but I am just wondering at what expense. If indeed the speed 
> difference is marginal, then its a smart choice. But its a question still.

All the evidence I've seen suggests it's actually faster than trying to
manage caching yourself (and under Linux at least, you may find the VM
system sometimes actually pages little used parts of your cache in
favour of its own!)

I've not tried to implement such a cache for Xapian though, so I can
only go on indirect evidence (e.g. comparing with Xapian's proprietary
precursor, which did have a block cache).

It's not just saved effort either - if you maintain your own cache, you
need something persistent to manage it, and some way to get blocks from
the cache.  The lack of a "xapian daemon" makes Xapian easier to use in
many situations.

But if you (or anyone else) thinks a block cache would be faster, feel
free to hack up a prototype.

Cheers,
    Olly



More information about the Xapian-discuss mailing list