[Xapian-discuss] Sanity check on database size
Jean-Francois Dockes
jean-francois.dockes at wanadoo.fr
Thu Apr 6 10:32:24 BST 2006
First, I apologize for the time you spent answering my previous message. I
was looking for something stupid that I may be doing, and indeed, I was not
disappointed.
I had 'just' overlooked the fact that the document set was made of
compressed mailboxes, which get squeezed very well. The actual size of the
uncompressed document data is 1.15 GB. So the compacted xapian db is just a
little bigger than the uncompressed document set, nothing to be alarmed
about.
However I am posting the database sizes hereafter, in case they may be of
interest.
By the way, in the course of my 'investigation', I looked for a document
with at least a rough description of the contents and organisation of the
database tables, and how they are used during a query, but, if it does
exist, there doesn't seem to be an obvious pointer to it. Such a document
would be extremely useful to understand what one is doing while using the
API.
For example I had made an assumption that the size of the file path unique
terms that I'm using to identify documents did not matter much because prefix
compression was going to be extremely efficient on them. Actually, after
having a look at the termlist_DB file, this does not appear to be the case
(as they are repeated separately for each document, or am I mistaken
again?), and, for very small documents and long paths, this may become
significant for the termlist_DB size.
Some answers to questions in your answer:
(database size ~ document set size)
Olly Betts writes:
> Roughly - generally I'd expect the database to be somewhat smaller than
> the document set if you're indexing positional information.
Being a personal tool, the assumption for recoll is that space does not
really matter (given that disks are cheap and mostly full of multimedia
which doesnt get into xapian), and it has no stoplist, and basically
indexes any term it can extract however crappy it may look. And no stemming
as you mentionned. Which probably explains why a typical recoll db will
have a size close to the doc set's.
> Also, do you put a limit on term size? Omega's indexers ignore
> probabilistic terms longer than 64 characters, since they're usually
> junk like uuencoded or base64 data.
Yes, the term size limit is 40 characters. This may probably be a bit low,
but I just can't imagine a user typing a longer than 40 characters search
term :)
> [...]
>
> Given the setting of XAPIAN_FLUSH_THRESHOLD, the memory used depends
> mostly on the size of the documents being handled (we buffer the posting
> lists as we generate them - essentially we build the inverted file in
> XAPIAN_FLUSH_THRESHOLD document chunks).
After searching for XAPIAN_FLUSH_THRESHOLD, I saw that this question was
repeatedly answered on the mailing list. I hadn't used the right search
terms before :)
Actually, from a user point of view, I think that the relevant parameter to
set is the amount of memory used, not a number of document flush
threshold. Wouldn't it be possible for xapian to maintain a very rough
estimate of memory used during indexation, and flush when it exceeds a set
threshold, independantly of the number of documents indexed ? The threshold
might be trespassed because of big documents, etc..., but this would come
closer to the relevant operational parameter.
The stats follow.
Regards, and apologies again,
J.F. Dockes
The size of the document set data is 232,580 KB but 1,161,119 KB uncompressed
ndocs 244852 lastdocid 244852 avglength 539.113
Total number of terms: 1,141,729
Size of term dump: 26,182,561 bytes (Avg term size 22)
Max term length 40 bytes, except for unique terms identifying documents
(paths) which are longer.
corbieres$ ls -s xapiandb/
total 2477756
4 meta 524332 postlist_DB 8 termlist_baseA
20 position_baseA 4 record_baseA 8 termlist_baseB
20 position_baseB 4 record_baseB 506156 termlist_DB
1257624 position_DB 189536 record_DB 4 value_baseA
12 postlist_baseA 4 stem_english 4 value_baseB
12 postlist_baseB 4 stem_french 0 value_DB
corbieres$ quartzcompact xapiandb/ compacted
postlist: Reduced by 55.4271% 290336K (523816K -> 233480K)
record: Reduced by 35.1783% 66608K (189344K -> 122736K)
termlist: Reduced by 40.6442% 205520K (505656K -> 300136K)
position: Reduced by 20.7496% 260696K (1256392K -> 995696K)
value: Size unchanged (0K)
corbieres$ ls -s compacted/
total 1653744
4 meta 233712 postlist_DB 300436 termlist_DB
4 position_baseA 4 record_baseA 4 value_baseA
16 position_baseB 4 record_baseB 4 value_baseB
996676 position_DB 122860 record_DB 0 value_DB
4 postlist_baseA 4 termlist_baseA
4 postlist_baseB 8 termlist_baseB
corbieres$ XAPIAN_PREFER_FLINT=yes copydatabase xapiandb flint
corbieres$ ls -s flint/
total 1967908
0 flicklock 12 postlist.baseB 8 termlist.baseB
4 iamflint 524332 postlist.DB 319252 termlist.DB
16 position.baseA 4 record.baseA 4 value.baseA
16 position.baseB 4 record.baseB 4 value.baseB
997604 position.DB 126632 record.DB 0 value.DB
8 postlist.baseA 8 termlist.baseA
More information about the Xapian-discuss
mailing list