[Xapian-devel] Index Size comparison
Jaguar Xiong
xiong.jaguar at gmail.com
Sat May 5 16:16:28 BST 2012
Here is the example for diff term: 'v122-8'. The whole string is treated
as a term in lucene index. While xapian seems split the string by '-',
and store 'v122' as a term. So I would guess splitting via '-' make
xapian received less terms. In my experiment, there are about 196000
documents, the average size is about 1.5k, with a total of 287M.
For reducing the size of btree, front-coding of string key (store the
common prefix once) seems a good idea. I'll see what I could do.
Cheers,
Jaguar
On 2012/05/02 21:28, Olly Betts wrote:
> On Mon, Apr 23, 2012 at 10:16:51PM +0800, Jaguar Xiong wrote:
>> I did a comparison based on similar steps as in the blog
>> (zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter),
>> against lucene-3.4 and xapian-1.3.0. The overall index sizes are:
>> lucene 89M, xapian 189M (chert backend and compacted).
>> Since I'm more interested in index size, I dig a little further to dump
>> the full term list. There are about 360000 terms from lucene index, and
>> about 285000 terms from xapian index.
> What are the additional terms lucene has indexed?
>
>> But surprisingly, the termlist.DB of xapian index is already 122M.
> It's surprising to hear termlist.DB is ~2/3 of the total size, as it is
> usually much less - I guess if you are indexing tweets then that's a
> lot of very small documents, and the front coding used in the termlist
> entries works better for larger documents.
>
> The termlist table stores the list of terms each document contains (and
> if you are storing any document values, also the value slots used in
> each document).
>
> This information allows Xapian to delete or update a document correctly,
> and also allows query expansion. My understanding is that Lucene
> doesn't store this information, and handles deletion by adding the
> document id to a "deleted" list, which has to be excluded from query
> results; this also means the frequency statistics will tend to be
> increasingly inaccurate as more documents are deleted or modified.
> That's the trade-off in exchange for not having to store the termlist
> data.
>
> Xapian doesn't currently support a "deleted" list, but if you don't
> want to be able to delete or modify documents, you can just delete
> this table from your database ("rm termlist.*") and pretty much
> everything else will continue to work. The other things which rely
> on the termlist table are listed in the ticket for this issue:
>
> http://trac.xapian.org/ticket/181
>
> If you delete the termlist, then it looks like Xapian would be ~67M vs
> Lucene's 89M.
>
>> Is tmere some idea/plan on reducing the index size? I'll glad if I could
>> help.
> Brass should be a little smaller than chert, but it's not going to be
> dramatic.
>
> There are a few ideas we have to reduce the size - if you're wanting to
> help work on this, here are a couple:
>
> * Posting list encodings could be more compact (probably in exchange for
> being more expensive to update, so supporting several encodings and
> picking the appropriate one via heuristics and/or user hints would
> probably be best):
>
> http://trac.xapian.org/wiki/GSoCProjectIdeas#Project:Postinglistencodingimprovements
>
> * The Btree keys are currently stored in full each time, but within
> almost all blocks, the keys will share a common prefix, so it would
> reduce the spaced used and allow us to fit more in a block if we just
> stored that prefix once. This would help tables with a lot of small
> entries especially (like the position table).
>
> Cheers,
> Olly
>
More information about the Xapian-devel
mailing list