[Xapian-discuss] Incremental indexing limitations

Fri Oct 12 02:43:34 BST 2007

On Thu, Oct 11, 2007 at 06:51:50PM +0200, Ron Kass wrote:
> 1. Size of such a database will actually be about 3K per document. That 
> is bigger than the text of the documents themselves. This is quite 
> surprising really, as what Xapian is supposed to be storing is just the 
> term IDs and the positions. Any ideas why its bigger than the original 
> size as opposed to 1/3 of the size lets say?

It's usually smaller, but it does depend on the nature of the source
data.  If you're indexing with positional information, I wouldn't expect
to achieve 1/3 the size.  It might be interesting to see the output of
"ls -l" on the database directory.

Part of the database space will actually be unused (what xapian-compact
does is to minimise this unused space).

The Btree implementation has two update modes - linear and non-linear.
In linear mode, each block is filled as full as it will go before the
next block is started, so blocks end up nearly 100% used (apart for the
last block updated in a run of linear updates).  In non-linear mode,
blocks fill until an update won't fit in the remaining space in the
block where it needs to go, at which point that block is split into 2
each ~50% full, so on average non-linear update results in ~75% use of
blocks.  This is actually how a B-tree achieves its amortised update
cost.

If you're just appending documents with add_document(), then for most
tables you'll only get linear update, and they end up pretty compact.
The exception is the postlist table - the first flush will be entirely
linear, and you may get some linear update during subsequent flushes
but there's inevitably going to be non-linear update.  If the flush
threshold is larger, you'll increase the scope for (and size of) spells
of linear updating.

There's certainly scope for improvement here though - for example, we
currently store the document lengths in every posting list entry, but it
would save quite a bit of space to store them just once, and it's likely
to be faster overall when you consider the extra I/O and caching
overhead.

I'm also working on a replacement Btree manager which reduces the
amount of disk space which the "wood" of the Btree takes.  That's
particularly an issue for the positions table which contains a lot of
items, many of which are very small.

> 2. Such an indexing process will start very fast, but when reaching a 
> database size of 2M or 3M documents, each flush will take 2-4 minutes. 
> This is already very slow for a flush of 10K small documents. Changing 
> flushes to every 1K document doesn't help.

For greater effiency, you'll want to make the flush threshold larger,
not smaller.

> It seems the time a flush 
> takes is not related so directly to the size of the flush itself but 
> does strongly relate to the size of the database itself. How come? What 
> happens during a flush?

Most of the work is updating the inverted file.  This is done by making
a pass in sorted order over the postlist table, applying changes.
Because we make a pass over the table, this is much more efficient per
document if you use a larger flush threshold (but not so large that the
buffered changes are fighting with the OS cache of blocks from the
Xapian database).

Ideally this would self-tune, but at present if you want to maximise
throughput you need to set XAPIAN_FLUSH_THRESHOLD, and experiment a bit
with the value you set it to.

> 3. If the time it takes to flush 10K documents when the DB is 3M 
> documents, is 2.5 minutes, does it actually mean that when the database 
> is 100M documents, each flush will take over an hour?

It's not a linear relationship.

> So.. thinking about how to maybe optimize this process, the concept of 
> using a "live" small database for updates and then merge it into a big 
> database comes to mind.

That's certainly an approach that can work for some situations.

> However, there are two issues here:
> 1. Such a merge is slow. It will take quite a lot of time (many hours) 
> to compact/merge each such live database into a main one. If this 
> process has to be done hourly lets say, and the process takes more than 
> an hour, we are faced with a critical problem.
> 2. Such a merge process seems to take quite a lot of resources from the 
> system, limiting CPU, I/O and memory for the more urgent task.. indexing 
> and searching.

Note that if your system has a quiet time of day (as most do) you can
perform the merge then, and still allow searching of new documents
between merges by using Xapian's ability to search over multiple
databases together.

> 3. It also means that we can never use more than 50% of the diskspace on 
> the server. In fact less than 40% or 45% to be safe. This is because the 
> compact process is merging the big database with the small one, into a 
> new database. So the new database will be bigger than the original big 
> one. So just because of this process, the server diskspace can not be 
> utilized effectively.

Disk is pretty cheap these days though (until you run out of drive
bays!)  But you're right that this is a drawback of this approach,
although it can be useful to have that sort of amount of spare space for
other reasons - for example you're a bit stuck otherwise if you want to
change how you index and need to rebuild the index from scratch while
still allowing searches of the existing database while the rebuild
happens.

Cheers,
    Olly