[Xapian-discuss] xapian indexing size?

Thu May 5 19:14:19 BST 2005

On Thu, May 05, 2005 at 01:39:20PM -0400, John Paige wrote:
>    I am evaluating to use xapian in our product. I just downloaded the
> core and examples code from the website.
> I'm puzzeled about one thing though,  when I used the test program
> "simpleIndexer", I found out that the index size is four times the
> size of the corpus.

I guess you mean "simpleindex" - that splits the input file into
paragraphs, and indexes each paragraph by the terms in it, storing
the whole paragraph as the document data.  Currently document data
is stored uncompressed (I have patches to use zlib I'll be integrating
soon) so currently the size of an index built by simpleindex will
inevitably be bigger than the text indexed, because it *contains* the
entire text indexed in uncompressed form.

Typically the document data is used to store a URL or UID for a
database, a document title, and a sample of text from the document,

> I indexed 4MB worth of text files, and the index was 16MB to index,
> and even after compaction, it still consumed 10MB.  when I added
> additional 4MB of text files, the original index went to 32MB.

It does seem larger than I'd expect.  There's scope for reducing the
size of Xapian databases (this will improve in the coming months), but
even so that sounds excessively large.

The output of "ls -l" on the index directory before and after compaction
might be interesting.  Can you post that?

> The index size is four times the size of the corpus, it doesn't seem
> right. Am I doing something wrong?

Using simpleindex, perhaps.  It's really meant to show what the code for
a Xapian indexer looks like without too much non-Xapian related
complication.

Are you just experimenting, or trying to build an actual system?

Cheers,
    Olly