The 4 byte fixed size is not the best size:<div>1) In the test the docs number in a posting list in 20000, while in brass the number is 2000. It's 10 times larger.</div><div>2) The fixed size can be further reduced when applying PFD optimization: which choose the smallest size that can fit 90% entries.</div>
<div>3) In the test I generate doc length with uniform distribution within [0, 2^20]. While in reality a normal distribution is more reasonable, so the number of long length docs will be less.</div><div><br></div><div>Above all, I think the index size of PFD will be even smaller than variable length encoding, according with the literature.</div>
<div><div><br><div class="gmail_quote">On Thu, Apr 19, 2012 at 11:58 PM, Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com">olly@survex.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">On Thu, Apr 19, 2012 at 11:12:32PM -0400, Weixian Zhou wrote:<br>
> 3. The implemented fixed length encoding uses 4 bytes as fixed length.<br>
> This is not optimal and can be further optimized in PFD.<br>
</div>[...]<br>
<div class="im">> Fix a typo in the attachment: The search time of 100000 searches of<br>
> variable length encoding and fixed length encoding are reversed.<br>
<br>
</div>That's promising I think - the fixed length is quite a bit faster, but<br>
almost exactly twice the size with a 4 byte fixed size.<br>
<br>
But document lengths will probably fit in 2 bytes in many situations<br>
(and should almost never need more than 3) so even a simple per-chunk<br>
choice of the number of byes to use will often put this about the same<br>
in size terms it seems.<br>
<br>
Cheers,<br>
Olly<br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>Weixian Zhou<br>Department of Computer Science and Engineering<br>University at Buffalo, SUNY<br><br>
</div></div>