[Xapian-discuss] The position list table and index size (compared to Lucene)

Mon Nov 9 12:41:19 GMT 2009

Olly Betts wrote:
> On Fri, Nov 06, 2009 at 05:38:44PM +0100, Paul Boddie wrote:
>   
>> One issue which arose from a migration from Lucene to Xapian was the
>> approximately four-fold increase in index size when moving to Xapian;
>>     
>
> Four-fold sounds extreme - are you sure you are comparing like with like?
>
> For example, indexing your data in the same way with the two systems, storing
> the same literal data in each, and comparing optimised/compacted sizes or
> unoptimised/uncompacted sizes, not one of each.
>   

I'm fairly sure that it's a like-for-like comparison for both aggressive 
tokenisation (lots of tokens) and more conventional tokenisation. I 
always optimise/compact each index I create because this has shown a 
substantial effect on search performance, especially with Lucene. The 
typical index size is around 390M for Xapian, but only around 100M for 
Lucene. (I describe the internal table sizes below.)

The "pathological" aspect of the indexes I've created is that few tokens 
are discarded, which obviously bloats each index and is likely to 
produce lots of position data. I know that this is a "bad thing" where 
text indexing solutions are concerned, but the domain I'm working in 
doesn't easily permit me to discard tokens which look like junk. 
(Actually, there are approaches which involve identifying regions of 
"interesting" text using machine learning approaches, just indexing 
those regions, and then throwing everything else away, but we have 
chosen not to follow that approach.)

>> this appears to be a product of the term position information, and I
>> have also seen similar reports of index "growth" mentioned in the
>> mailing list archives. For example:
>>
>> http://lists.tartarus.org/pipermail/xapian-discuss/2009-April/006626.html
>> (There are probably better examples than this, but I can't locate more
>> relevant documents at the moment.)
>>
>> Are there any simple explanations for the large differences in index and
>> position list table size?
>>     
>
> Well, the article you just linked to discusses one big factor - the termlist
> table.  It seems silly to repeat myself - just read what I wrote before - but
> to update what's said there a little, this is now optional in trunk, though
> "imperfect" deletion isn't yet supported.
>
> As it also mentions, the chert backend in 1.1.x is more compact than the flint
> backend which is the default in 1.0.x.
>   

I did try the 1.1.x branch a few weeks ago, but the savings appeared to 
be negligible for my particular case. As for the term list table, the 
possibility of deleting it would be applicable for the indexes I create 
since they are never updated - new versions of documents are 
stored/indexed in new, separate index versions - but I'm not too sure 
that this would dramatically decrease the overall index sizes, however, 
since in the typical case the termlist.DB file is around 50M, whereas 
the position.DB file is around 230M. (The postlist.DB file is around 
50M, the record.DB file is around 3M, and the value.DB file is around 40M.)

>> I notice that the documentation, specifically
>> this Wiki page...
>>
>> http://trac.xapian.org/wiki/FlintPositionListTable
>>
>> ...mentions "tname" in the "key format", which I presume means "term
>> name"
>>     
>
> Yes, "tname" is "term name".
>
>   
>> in a B-tree I imagine that even storing the term name for each key would
>> still only result in the overhead in doing so being proportional to the
>> document frequency of each term
>>     
>
> It should be roughly proportional to the number of documents each term appears
> in.  The most common words in most languages tend to be the shortest, but it
> would probably save a significant amount of space to have a lexicon which
> mapped the term name to a unique integer which could be used in the keys to the
> position table, and in a few other places.  I've not yet tried this though.
>   

I imagine that Xapian's index structure really is quite different from 
Lucene's, then, since Lucene seems to use a fairly flat index structure 
and manages to employ term name compression by encoding the length of 
the common prefix from one term to the next.

> If you measure how much of the position table is actually the positional data
> the percentage is surprisingly low.  I got 14% for gmane, though that will
> certainly vary, but something like 86% is a mixture of keys, Btree structure,
> and unused space.  We can't eliminate all of that (we do need to somehow store
> what each stream of positional data is for), but it shouldn't be hard to
> make a dramatic difference here.  This is something I'm likely to be working
> on later this month.
>   

But how much positional data does GMane index now? My problem may well 
be related to the excess of positional data which most people would try 
and eliminate simply by reducing the number of indexed tokens. I'm 
prepared to accept this explanation, but it still leaves that awkward 
issue of how Lucene manages to handle similar volumes of positional data.

>> Is there anything I can configure or change to affect the index size in
>> Xapian, or is there anything I should be more aware of?
>>     
>
> Using 1.1.x and chert (set XAPIAN_PREFER_CHERT=1 in the environment until we
> make it the default) if you aren't scared of potentially having to rebuild
> your application and index when you upgrade to a new release.
>   

I'm not scared of rebuilding the index from time to time, although it 
does require a while. That's what nights and weekends are for. :-)

> If you don't need to be able to delete or replace documents, then with trunk
> (and 1.1.3 once we release it) you can "rm termlist*" in the database directory
> after you create it and you'll have a termlist-less database (this is rather
> new, and will gain an API flag once it's had any kinks knocked out).
>   

This would be very interesting for my use-case, I think, and sounds like 
the kind of configuration setting I was thinking of.

Thank you for the guidance and advice so far!

Paul