[Xapian-discuss] The position list table and index size (compared to Lucene)
Paul Boddie
paul.boddie at biotek.uio.no
Fri Nov 6 16:38:44 GMT 2009
Hello,
I have been using Xapian (and Python) in a biomedical literature search
application for several months, having previously used PyLucene (which
I've been moving away from because of the awkward build constraints).
One issue which arose from a migration from Lucene to Xapian was the
approximately four-fold increase in index size when moving to Xapian;
this appears to be a product of the term position information, and I
have also seen similar reports of index "growth" mentioned in the
mailing list archives. For example:
http://lists.tartarus.org/pipermail/xapian-discuss/2009-April/006626.html
(There are probably better examples than this, but I can't locate more
relevant documents at the moment.)
Are there any simple explanations for the large differences in index and
position list table size? I notice that the documentation, specifically
this Wiki page...
http://trac.xapian.org/wiki/FlintPositionListTable
...mentions "tname" in the "key format", which I presume means "term
name", but in a B-tree I imagine that even storing the term name for
each key would still only result in the overhead in doing so being
proportional to the document frequency of each term, not to the actual
"occurrence" frequency, although this probably wouldn't be a trivial
amount for a collection of tens of millions of short documents, which is
the kind of collection I've built.
I must confess to not having studied the Xapian source code thoroughly
at this point. While trying to formulate some kind of explanation, I
have also consulted the Lucene index format documentation:
http://lucene.apache.org/java/2_9_0/fileformats.html
However, it appears to me that Lucene doesn't use a standard B-tree
structure, anyway. I've also written a very simple indexer in Python
using a highly restrictive index format similar to (but even simpler
than) that employed by Lucene, and I can reproduce a significant
reduction in index size, although I'm obviously not going to make any
performance claims for this solution. Is there anything I can configure
or change to affect the index size in Xapian, or is there anything I
should be more aware of?
Regards,
Paul
More information about the Xapian-discuss
mailing list