[Xapian-discuss] Compressed Btrees
Arjen van der Meijden
arjen at glas.its.tudelft.nl
Mon Dec 13 14:26:28 GMT 2004
Olly Betts wrote:
> On Mon, Dec 13, 2004 at 02:23:00PM +0100, Arjen van der Meijden wrote:
>
>>This is on the non-compacted database (currently I don't have a
>>compacted one):
>
>
> The results would be the same anyway.
>
>
>>entries: 293400883
>>Totals:
>>Before: 1680133099
>>After: 1189099066
>>Compressed by: 29.3%
>>Theoretical limit (assuming uniform): 1188233055
>>
>>If I understand it correctly this will be the compression on top of the
>>compaction (which only yields 8% reduction) of the position-table ?
>
>
> It's not totally obvious how to translate it - this figure is just for
> the change in size of the tag values. There's also storage for the keys
> and general overhead from the tree structure. But if the tags are
> shorter then they'll generally be split into fewer items inside the
> Btree, which means fewer keys need to be stored. And the less there is
> in the Btree, the less overhead there is.
>
> So you should expect the size of position_DB to decrease by somewhat
> more than (1680133099 - 1189099066) bytes. Is this the 6.3G
> position_DB? If so, I'm suprised it only has 1.6G of tags.
>
> But assuming it is, you'd expect the filesize to go down by at least
> 29.3*1.6/6.3 or around 7.5%. It will probably be substantially better
> than that though.
Yes its the 6.3G (or 6.9 non-compacted) table. Does that mean the rest
of the data is mostly structural (keys to access the tags +
btree-overhead) ?
Reading this small piece of information from the Xapian-website:
"PositionList. For each (term, document) pair, this stores the list of
positions in the document at which the term occurs.
Key: pack_uint(did) + tname "
I'm actually not sure whether I should be surprised by that or not. A
lot of terms are rather unique in a document and/or relatively long, so
it isn't very strange if a key (docid + term) is actually longer than
its tag (list of positions), or am I missing something important here? :)
Best regards,
Arjen van der Meijden
More information about the Xapian-discuss
mailing list