[Xapian-discuss] The position list table and index size (compared to Lucene)

Paul Boddie paul.boddie at biotek.uio.no
Fri Nov 6 16:38:44 GMT 2009


Hello,

I have been using Xapian (and Python) in a biomedical literature search 
application for several months, having previously used PyLucene (which 
I've been moving away from because of the awkward build constraints). 
One issue which arose from a migration from Lucene to Xapian was the 
approximately four-fold increase in index size when moving to Xapian; 
this appears to be a product of the term position information, and I 
have also seen similar reports of index "growth" mentioned in the 
mailing list archives. For example:

http://lists.tartarus.org/pipermail/xapian-discuss/2009-April/006626.html
(There are probably better examples than this, but I can't locate more 
relevant documents at the moment.)

Are there any simple explanations for the large differences in index and 
position list table size? I notice that the documentation, specifically 
this Wiki page...

http://trac.xapian.org/wiki/FlintPositionListTable

...mentions "tname" in the "key format", which I presume means "term 
name", but in a B-tree I imagine that even storing the term name for 
each key would still only result in the overhead in doing so being 
proportional to the document frequency of each term, not to the actual 
"occurrence" frequency, although this probably wouldn't be a trivial 
amount for a collection of tens of millions of short documents, which is 
the kind of collection I've built.

I must confess to not having studied the Xapian source code thoroughly 
at this point. While trying to formulate some kind of explanation, I 
have also consulted the Lucene index format documentation:

http://lucene.apache.org/java/2_9_0/fileformats.html

However, it appears to me that Lucene doesn't use a standard B-tree 
structure, anyway. I've also written a very simple indexer in Python 
using a highly restrictive index format similar to (but even simpler 
than) that employed by Lucene, and I can reproduce a significant 
reduction in index size, although I'm obviously not going to make any 
performance claims for this solution. Is there anything I can configure 
or change to affect the index size in Xapian, or is there anything I 
should be more aware of?

Regards,

Paul



More information about the Xapian-discuss mailing list