[Xapian-devel] Posting list encoding improvements - pfd encoding & var len encoding comparison program
Weixian Zhou
ideazwx at gmail.com
Fri Apr 20 05:19:46 BST 2012
The 4 byte fixed size is not the best size:
1) In the test the docs number in a posting list in 20000, while in brass
the number is 2000. It's 10 times larger.
2) The fixed size can be further reduced when applying PFD optimization:
which choose the smallest size that can fit 90% entries.
3) In the test I generate doc length with uniform distribution within [0,
2^20]. While in reality a normal distribution is more reasonable, so the
number of long length docs will be less.
Above all, I think the index size of PFD will be even smaller than variable
length encoding, according with the literature.
On Thu, Apr 19, 2012 at 11:58 PM, Olly Betts <olly at survex.com> wrote:
> On Thu, Apr 19, 2012 at 11:12:32PM -0400, Weixian Zhou wrote:
> > 3. The implemented fixed length encoding uses 4 bytes as fixed length.
> > This is not optimal and can be further optimized in PFD.
> [...]
> > Fix a typo in the attachment: The search time of 100000 searches of
> > variable length encoding and fixed length encoding are reversed.
>
> That's promising I think - the fixed length is quite a bit faster, but
> almost exactly twice the size with a 4 byte fixed size.
>
> But document lengths will probably fit in 2 bytes in many situations
> (and should almost never need more than 3) so even a simple per-chunk
> choice of the number of byes to use will often put this about the same
> in size terms it seems.
>
> Cheers,
> Olly
>
--
Weixian Zhou
Department of Computer Science and Engineering
University at Buffalo, SUNY
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120420/ee0b91fa/attachment.htm>
More information about the Xapian-devel
mailing list