[Xapian-discuss] xapian indexing size?

John Paige paige.john at gmail.com
Fri May 6 02:05:39 BST 2005


On 5/5/05, Olly Betts <olly at survex.com> wrote:
> On Thu, May 05, 2005 at 01:39:20PM -0400, John Paige wrote:
> >    I am evaluating to use xapian in our product. I just downloaded the
> > core and examples code from the website.
> > I'm puzzeled about one thing though,  when I used the test program
> > "simpleIndexer", I found out that the index size is four times the
> > size of the corpus.
> 
> I guess you mean "simpleindex" - that splits the input file into
> paragraphs, and indexes each paragraph by the terms in it, storing
> the whole paragraph as the document data.  Currently document data
> is stored uncompressed (I have patches to use zlib I'll be integrating
> soon) so currently the size of an index built by simpleindex will
> inevitably be bigger than the text indexed, because it *contains* the
> entire text indexed in uncompressed form.
> 
> Typically the document data is used to store a URL or UID for a
> database, a document title, and a sample of text from the document,
> 
> > I indexed 4MB worth of text files, and the index was 16MB to index,
> > and even after compaction, it still consumed 10MB.  when I added
> > additional 4MB of text files, the original index went to 32MB.
> 
> It does seem larger than I'd expect.  There's scope for reducing the
> size of Xapian databases (this will improve in the coming months), but
> even so that sounds excessively large.
> 
> The output of "ls -l" on the index directory before and after compaction
> might be interesting.  Can you post that?

Here are the snapshot:
I indexed files from the below directory:
:~/text_files> du -sk .
4248    .
Here is the snapshot after using "simpleindex"
~/xapian/NEW> ll
total 37536
drwxr-x---   2 ja  code        4096 May  5 20:50 ./
drwxr-x---   7 ja  code        4096 May  5 20:50 ../
-rw-r-----   1 ja  code          10 May  5 20:50 meta
-rw-r-----   1 ja  code     6258688 May  5 20:50 position_DB
-rw-r-----   1 ja  code         113 May  5 20:50 position_baseA
-rw-r-----   1 ja  code         112 May  5 20:50 position_baseB
-rw-r-----   1 ja  code     2269184 May  5 20:50 postlist_DB
-rw-r-----   1 ja  code          50 May  5 20:50 postlist_baseA
-rw-r-----   1 ja  code          50 May  5 20:50 postlist_baseB
-rw-r-----   1 ja  code     7675904 May  5 20:50 record_DB
-rw-r-----   1 ja  code         133 May  5 20:50 record_baseA
-rw-r-----   1 ja  code         132 May  5 20:50 record_baseB
-rw-r-----   1 ja  code     2965504 May  5 20:50 termlist_DB
-rw-r-----   1 ja  code          61 May  5 20:50 termlist_baseA
-rw-r-----   1 ja  code          61 May  5 20:50 termlist_baseB
-rw-r-----   1 ja  code           0 May  5 20:50 value_DB
-rw-r-----   1 ja  code          14 May  5 20:50 value_baseA
-rw-r-----   1 ja  code          14 May  5 20:50 value_baseB

here after applying "quartzcompact"
~/xapian/NEW> ../bin/quartzcompact . /users/ja/xapian_compact
record: Reduced by 43.8634% 3288K (7496K -> 4208K)
postlist: Reduced by 63.5379% 1408K (2216K -> 808K)
termlist: Reduced by 54.6961% 1584K (2896K -> 1312K)
position: Reduced by 34.8168% 2128K (6112K -> 3984K)
value: Done

The size of the compact directory is:
~/xapian_compact> du -sk .
10344   .




> 
> > The index size is four times the size of the corpus, it doesn't seem
> > right. Am I doing something wrong?
> 
> Using simpleindex, perhaps.  It's really meant to show what the code for
> a Xapian indexer looks like without too much non-Xapian related
> complication.
> 
> Are you just experimenting, or trying to build an actual system?

At this time, I am experimeting with different indexing/searching
tools out there.
You have mentioned in the mail that index smaller than corpus is
achievable. Could you please provide some data like how much smaller
it can go?

And also what is the timeline for the release of "flint" database backend? 

Thanks,
John

> 
> Cheers,
>     Olly
>



More information about the Xapian-discuss mailing list