[Xapian-discuss] xapian indexing size?
John Paige
paige.john at gmail.com
Fri May 6 02:05:39 BST 2005
On 5/5/05, Olly Betts <olly at survex.com> wrote:
> On Thu, May 05, 2005 at 01:39:20PM -0400, John Paige wrote:
> > I am evaluating to use xapian in our product. I just downloaded the
> > core and examples code from the website.
> > I'm puzzeled about one thing though, when I used the test program
> > "simpleIndexer", I found out that the index size is four times the
> > size of the corpus.
>
> I guess you mean "simpleindex" - that splits the input file into
> paragraphs, and indexes each paragraph by the terms in it, storing
> the whole paragraph as the document data. Currently document data
> is stored uncompressed (I have patches to use zlib I'll be integrating
> soon) so currently the size of an index built by simpleindex will
> inevitably be bigger than the text indexed, because it *contains* the
> entire text indexed in uncompressed form.
>
> Typically the document data is used to store a URL or UID for a
> database, a document title, and a sample of text from the document,
>
> > I indexed 4MB worth of text files, and the index was 16MB to index,
> > and even after compaction, it still consumed 10MB. when I added
> > additional 4MB of text files, the original index went to 32MB.
>
> It does seem larger than I'd expect. There's scope for reducing the
> size of Xapian databases (this will improve in the coming months), but
> even so that sounds excessively large.
>
> The output of "ls -l" on the index directory before and after compaction
> might be interesting. Can you post that?
Here are the snapshot:
I indexed files from the below directory:
:~/text_files> du -sk .
4248 .
Here is the snapshot after using "simpleindex"
~/xapian/NEW> ll
total 37536
drwxr-x--- 2 ja code 4096 May 5 20:50 ./
drwxr-x--- 7 ja code 4096 May 5 20:50 ../
-rw-r----- 1 ja code 10 May 5 20:50 meta
-rw-r----- 1 ja code 6258688 May 5 20:50 position_DB
-rw-r----- 1 ja code 113 May 5 20:50 position_baseA
-rw-r----- 1 ja code 112 May 5 20:50 position_baseB
-rw-r----- 1 ja code 2269184 May 5 20:50 postlist_DB
-rw-r----- 1 ja code 50 May 5 20:50 postlist_baseA
-rw-r----- 1 ja code 50 May 5 20:50 postlist_baseB
-rw-r----- 1 ja code 7675904 May 5 20:50 record_DB
-rw-r----- 1 ja code 133 May 5 20:50 record_baseA
-rw-r----- 1 ja code 132 May 5 20:50 record_baseB
-rw-r----- 1 ja code 2965504 May 5 20:50 termlist_DB
-rw-r----- 1 ja code 61 May 5 20:50 termlist_baseA
-rw-r----- 1 ja code 61 May 5 20:50 termlist_baseB
-rw-r----- 1 ja code 0 May 5 20:50 value_DB
-rw-r----- 1 ja code 14 May 5 20:50 value_baseA
-rw-r----- 1 ja code 14 May 5 20:50 value_baseB
here after applying "quartzcompact"
~/xapian/NEW> ../bin/quartzcompact . /users/ja/xapian_compact
record: Reduced by 43.8634% 3288K (7496K -> 4208K)
postlist: Reduced by 63.5379% 1408K (2216K -> 808K)
termlist: Reduced by 54.6961% 1584K (2896K -> 1312K)
position: Reduced by 34.8168% 2128K (6112K -> 3984K)
value: Done
The size of the compact directory is:
~/xapian_compact> du -sk .
10344 .
>
> > The index size is four times the size of the corpus, it doesn't seem
> > right. Am I doing something wrong?
>
> Using simpleindex, perhaps. It's really meant to show what the code for
> a Xapian indexer looks like without too much non-Xapian related
> complication.
>
> Are you just experimenting, or trying to build an actual system?
At this time, I am experimeting with different indexing/searching
tools out there.
You have mentioned in the mail that index smaller than corpus is
achievable. Could you please provide some data like how much smaller
it can go?
And also what is the timeline for the release of "flint" database backend?
Thanks,
John
>
> Cheers,
> Olly
>
More information about the Xapian-discuss
mailing list