[Xapian-discuss] weak populated b-trees?
Markus Wörle
mrks at mrks.de
Fri Sep 5 19:14:21 BST 2008
Hi,
I just ran xapian-compact on an index which comsumes about 12 GB of
disk space, containing 858.383 documents with an average doclength of
169.018, and got surprised by a huge compactification factor which I
haven't expected. After compactification, the index needed only 3.8 GB
on disk anymore.
My expection was that it would only shrink about 25% or so, because of
the average allocation of b-tree blocks with I expected to be about 75%.
This is what xapian-compact said:
postlist: Reduced by 76.888% 2444640K (3179480K -> 734840K)
record: Reduced by 65.4923% 1446352K (2208432K -> 762080K)
termlist: Reduced by 67.2607% 1110312K (1650760K -> 540448K)
position: Reduced by 56.6145% 2342160K (4137032K -> 1794872K)
value: Reduced by 81.2667% 397264K (488840K -> 91576K)
spelling: Size unchanged (0K)
synonym: Size unchanged (0K)
My Index' brief history:
The index was once built from scrach with add_document(), and got
updated by a large amount of replace_document_by_term(),
add_document(), and delete_document_by_term() over a longer period
(about 2 month or so). Some numbers: about 1 million modifications per
day, and thereof about 4000 document adds, and 3000 removes.
Additionally, in this 2-month-period, all documents got rebuild about
5 times by using replace_document_by_term() on a unique term for each
document.
So my question is: Is this reasonable? Respectively, do you have any
idea why my b-trees are such empty? Does Xapian merge weakly populated
blocks again?
I am currently planning to stop indexing once a day to run xapian-
compact, but I am uncertain if this whould "denaturate" the system. I
have many modifications, and althought "best indexing performance" is
not really a point in my use-case, I feel somehow bad about
manipulating a natural-balanced b-tree in a non-changing environment.
What do you suggest?
Thanks,
mrks
More information about the Xapian-discuss
mailing list