[Xapian-discuss] The position list table and index size (compared to Lucene)

Olly Betts olly at survex.com
Sun Nov 8 10:40:14 GMT 2009


On Fri, Nov 06, 2009 at 05:38:44PM +0100, Paul Boddie wrote:
> One issue which arose from a migration from Lucene to Xapian was the
> approximately four-fold increase in index size when moving to Xapian;

Four-fold sounds extreme - are you sure you are comparing like with like?

For example, indexing your data in the same way with the two systems, storing
the same literal data in each, and comparing optimised/compacted sizes or
unoptimised/uncompacted sizes, not one of each.

> this appears to be a product of the term position information, and I
> have also seen similar reports of index "growth" mentioned in the
> mailing list archives. For example:
>
> http://lists.tartarus.org/pipermail/xapian-discuss/2009-April/006626.html
> (There are probably better examples than this, but I can't locate more
> relevant documents at the moment.)
>
> Are there any simple explanations for the large differences in index and
> position list table size?

Well, the article you just linked to discusses one big factor - the termlist
table.  It seems silly to repeat myself - just read what I wrote before - but
to update what's said there a little, this is now optional in trunk, though
"imperfect" deletion isn't yet supported.

As it also mentions, the chert backend in 1.1.x is more compact than the flint
backend which is the default in 1.0.x.

> I notice that the documentation, specifically
> this Wiki page...
>
> http://trac.xapian.org/wiki/FlintPositionListTable
>
> ...mentions "tname" in the "key format", which I presume means "term
> name"

Yes, "tname" is "term name".

> in a B-tree I imagine that even storing the term name for each key would
> still only result in the overhead in doing so being proportional to the
> document frequency of each term

It should be roughly proportional to the number of documents each term appears
in.  The most common words in most languages tend to be the shortest, but it
would probably save a significant amount of space to have a lexicon which
mapped the term name to a unique integer which could be used in the keys to the
position table, and in a few other places.  I've not yet tried this though.

If you measure how much of the position table is actually the positional data
the percentage is surprisingly low.  I got 14% for gmane, though that will
certainly vary, but something like 86% is a mixture of keys, Btree structure,
and unused space.  We can't eliminate all of that (we do need to somehow store
what each stream of positional data is for), but it shouldn't be hard to
make a dramatic difference here.  This is something I'm likely to be working
on later this month.

> Is there anything I can configure or change to affect the index size in
> Xapian, or is there anything I should be more aware of?

Using 1.1.x and chert (set XAPIAN_PREFER_CHERT=1 in the environment until we
make it the default) if you aren't scared of potentially having to rebuild
your application and index when you upgrade to a new release.

If you don't need to be able to delete or replace documents, then with trunk
(and 1.1.3 once we release it) you can "rm termlist*" in the database directory
after you create it and you'll have a termlist-less database (this is rather
new, and will gain an API flag once it's had any kinks knocked out).

Cheers,
    Olly



More information about the Xapian-discuss mailing list