[Xapian-discuss] Incremental indexing limitations

Thu Oct 11 17:51:50 BST 2007

Assuming we have a server with 2GB memory and 500GB disk.
We want to use it for indexing of an always updating database of documents.
And lets say each document is 2KB of text.

We have a process that constantly indexes data into a Xapian database.
This process flushes updated every 10K documents (to make sure they are 
searchable and successfully stored) and after such an update, marks the 
documents as indexed.

Few observations until now.
1. Size of such a database will actually be about 3K per document. That 
is bigger than the text of the documents themselves. This is quite 
surprising really, as what Xapian is supposed to be storing is just the 
term IDs and the positions. Any ideas why its bigger than the original 
size as opposed to 1/3 of the size lets say?
2. Such an indexing process will start very fast, but when reaching a 
database size of 2M or 3M documents, each flush will take 2-4 minutes. 
This is already very slow for a flush of 10K small documents. Changing 
flushes to every 1K document doesn't help. It seems the time a flush 
takes is not related so directly to the size of the flush itself but 
does strongly relate to the size of the database itself. How come? What 
happens during a flush?
3. If the time it takes to flush 10K documents when the DB is 3M 
documents, is 2.5 minutes, does it actually mean that when the database 
is 100M documents, each flush will take over an hour? If so, that is 
extremely "painful', isn't it?

So.. thinking about how to maybe optimize this process, the concept of 
using a "live" small database for updates and then merge it into a big 
database comes to mind.
However, there are two issues here:
1. Such a merge is slow. It will take quite a lot of time (many hours) 
to compact/merge each such live database into a main one. If this 
process has to be done hourly lets say, and the process takes more than 
an hour, we are faced with a critical problem.
2. Such a merge process seems to take quite a lot of resources from the 
system, limiting CPU, I/O and memory for the more urgent task.. indexing 
and searching.
3. It also means that we can never use more than 50% of the diskspace on 
the server. In fact less than 40% or 45% to be safe. This is because the 
compact process is merging the big database with the small one, into a 
new database. So the new database will be bigger than the original big 
one. So just because of this process, the server diskspace can not be 
utilized effectively.

Any thoughts and insights about the matter are greatly appreciated.
Ron