[Xapian-devel] Index Size comparison

Jaguar Xiong xiong.jaguar at gmail.com
Sat May 5 16:16:28 BST 2012


Here is the example for diff term: 'v122-8'. The whole string is treated 
as a term in lucene index. While xapian seems split the string by '-', 
and store 'v122' as a term. So I would guess splitting via '-' make 
xapian received less terms. In my experiment, there are about 196000 
documents, the average size is about 1.5k, with a total of 287M.

For reducing the size of btree, front-coding of string key (store the 
common prefix once) seems a good idea. I'll see what I could do.

Cheers,
Jaguar

On 2012/05/02 21:28, Olly Betts wrote:
> On Mon, Apr 23, 2012 at 10:16:51PM +0800, Jaguar Xiong wrote:
>> I did a comparison based on similar steps as in the blog
>> (zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter),
>> against lucene-3.4 and xapian-1.3.0. The overall index sizes are:
>> lucene 89M, xapian 189M (chert backend and compacted).
>> Since I'm more interested in index size, I dig a little further to dump
>> the full term list. There are about 360000 terms from lucene index, and
>> about 285000 terms from xapian index.
> What are the additional terms lucene has indexed?
>
>> But surprisingly, the termlist.DB of xapian index is already 122M.
> It's surprising to hear termlist.DB is ~2/3 of the total size, as it is
> usually much less - I guess if you are indexing tweets then that's a
> lot of very small documents, and the front coding used in the termlist
> entries works better for larger documents.
>
> The termlist table stores the list of terms each document contains (and
> if you are storing any document values, also the value slots used in
> each document).
>
> This information allows Xapian to delete or update a document correctly,
> and also allows query expansion.  My understanding is that Lucene
> doesn't store this information, and handles deletion by adding the
> document id to a "deleted" list, which has to be excluded from query
> results; this also means the frequency statistics will tend to be
> increasingly inaccurate as more documents are deleted or modified.
> That's the trade-off in exchange for not having to store the termlist
> data.
>
> Xapian doesn't currently support a "deleted" list, but if you don't
> want to be able to delete or modify documents, you can just delete
> this table from your database ("rm termlist.*") and pretty much
> everything else will continue to work.  The other things which rely
> on the termlist table are listed in the ticket for this issue:
>
> http://trac.xapian.org/ticket/181
>
> If you delete the termlist, then it looks like Xapian would be ~67M vs
> Lucene's 89M.
>
>> Is tmere some idea/plan on reducing the index size? I'll glad if I could
>> help.
> Brass should be a little smaller than chert, but it's not going to be
> dramatic.
>
> There are a few ideas we have to reduce the size - if you're wanting to
> help work on this, here are a couple:
>
> * Posting list encodings could be more compact (probably in exchange for
>    being more expensive to update, so supporting several encodings and
>    picking the appropriate one via heuristics and/or user hints would
>    probably be best):
>
>    http://trac.xapian.org/wiki/GSoCProjectIdeas#Project:Postinglistencodingimprovements
>
> * The Btree keys are currently stored in full each time, but within
>    almost all blocks, the keys will share a common prefix, so it would
>    reduce the spaced used and allow us to fit more in a block if we just
>    stored that prefix once.  This would help tables with a lot of small
>    entries especially (like the position table).
>
> Cheers,
>      Olly
>




More information about the Xapian-devel mailing list