<p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
Hi Olly,</p><p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
Thanks for pointing out my misunderstanding of posting list encoding. I have some further questions.</p><p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
<strong style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;vertical-align:baseline;background-image:initial;background-color:transparent;font-weight:bold">1) How many encoding method are we required to implement? </strong></p>
<p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
I digged into more literature and found some interesting ideas. I think there are 3 encoding methods are worth a try:</p><p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
a) optPFD. The fastest decompression speed with large index size. Although the speed looks good but we should also aware that the experiments in most literature put whole index in memory, so the fetching from disk time is simply ingored in them. </p>
<p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
b). Simple 16. This is an improved version on simple 9. It's slower than optPFD while with smaller index size.</p><p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
c). VSEnocding [1]. This is the new discovery. It reports in [1] that VSEncoding outperforms PFD and Simple 16 in both index size and decompression speed. The general idea that differs from previous methods is that VSEncoding use dynamic programming to optimally partition the lists in blocks. </p>
<p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
</p><p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
<strong style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;vertical-align:baseline;background-image:initial;background-color:transparent;font-weight:bold">2) What's your opinion of implementing docID reordering?</strong></p>
<p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
I think it's a good option because its significant reduction in index size (14% and grows as the size increase). Also the query processing speed is more valuable than index compression speed. </p><p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
</p><p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
<strong style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;vertical-align:baseline;background-image:initial;background-color:transparent;font-weight:bold">3) Are we going to compress term positions?</strong></p>
<p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
Although the size of term positions in posting lists is much larger than docID and frequencies, there are few literature give pretty thoughts on compressing them. The method proposed in [2] involves much page-wise and context information and makes it complex and less feasible in real engineering. </p>
<p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
</p><p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
<strong style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;vertical-align:baseline;background-image:initial;background-color:transparent;font-weight:bold">Reference:</strong></p>
<p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
[1] VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming, 2010</p><p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
[2] Compressing term positions in web indexes, 2009</p><p style="margin-top:0px;margin-right:0px;margin-bottom:8px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-right-width:0px;border-bottom-width:0px;border-left-width:0px;border-style:initial;border-color:initial;outline-width:0px;outline-style:initial;outline-color:initial;font-size:12px;vertical-align:baseline;background-image:initial;background-color:rgb(240,247,252);line-height:20px;font-family:Arial,'Helvetica Neue',Helvetica,sans-serif">
<br></p><br><div class="gmail_quote">On Thu, Apr 5, 2012 at 5:56 PM, Weixian Zhou <span dir="ltr"><<a href="mailto:ideazwx@gmail.com">ideazwx@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I've submitted my proposal for posting list improvement. Please check and reply suggestions. <div>Sorry for the late submission. I'm reviewing papers on posting list compression in recently years.<div><div class="h5">
<br><br><div class="gmail_quote">
On Tue, Apr 3, 2012 at 12:47 AM, Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com" target="_blank">olly@survex.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>On Sat, Mar 31, 2012 at 01:25:31AM -0400, Weixian Zhou wrote:<br>
> My name is Weixian Zhou, Computer Science student of University at Buffalo,<br>
> State University of New York. I am interested in the project of posting<br>
> list encoding improvements and weighting schemes. I have some questions<br>
> toward them.<br>
<br>
</div>If you're equally interested in both, I'd probably steer you towards the<br>
posting list encoding project - we've had quite a lot of interest in the<br>
weighting schemes project already.<br>
<div><br>
> 1) After read the comments in brass_postlist.cc, I am still not very clear<br>
> about the detailed structure of postings list. If you can provide some<br>
> simple examples/graphs will be very straightforward.<br>
<br>
</div>Each term's posting list is stored in one or more chunks. Each chunk is<br>
up to about 2KB of encoded data, and is stored in a B-tree, with a key<br>
built from the term name and the first docid in the chunk. This allows<br>
us to skip ahead in a posting list efficiently - we just look up the<br>
key for the docid we want which puts us on the right chunk (or the one<br>
before it if that docid is in a gap between chunks).<br>
<br>
Within each chunk, there's a simple header and then the docids and wdfs<br>
are encoding in a fairly simple way using variable length integers, with<br>
7 bits of each byte used for data and the most significant bit used to<br>
indicate if there are more bytes for this integer, so 0 to 127 are<br>
encoded in one byte, 128 to 16383 in two bytes, etc. We store the<br>
(docid delta - 1) and the wdf in this way.<br>
<br>
This isn't the most compact encoding by any means, but it is reasonably<br>
compact and easy to update - you can splice in new entries without<br>
having to re-encode a whole chunk, or do a lot of bit shifts.<br>
<br>
However, for users not worried about efficiency of updating existing<br>
documents, a more compact encoding would be good.<br>
<div><br>
> 2) My instant idea to make list smaller: use gamma codes to encode the gap<br>
> between docids instead of docids.<br>
<br>
</div>Why gamma codes?<br>
<br>
I've not done comparisons myself, but Managing Gigabytes suggests<br>
interpolative coding is the best choice for encoding posting lists (of<br>
those it considers, which includes gamma). Interpolative coding comes<br>
up as the most compact and similar time to encode and decode. If you<br>
have access to the second edition, the table is on page 129. Or you<br>
may be able to see it in the preview link here:<br>
<br>
<a href="https://www.google.com/search?tbm=bks&q=%22either+the+appropriate+model+must+be+read+off+disk%22" target="_blank">https://www.google.com/search?tbm=bks&q=%22either+the+appropriate+model+must+be+read+off+disk%22</a><br>
<br>
I've not done experiments, but we use interpolative coding for<br>
positional data so we already have a well tested implementation.<br>
<div><br>
> Last question towards the project of weighting schemes: Do we need only to<br>
> implement existing weighting scheme instead of coming up with new ideas?<br>
<br>
</div>Yes, we're more interested there in implementing proven approaches than<br>
trying to do original research.<br>
<div><br>
> And our mission is to find a weighting scheme that could replace the<br>
> default BM25 in Xapian?<br>
<br>
</div>I'm not sure that's the mission exactly. The idea is to implement more<br>
schemes which are likely to be useful to users. They might be a better<br>
in some situations, or give faster results, or be useful a baseline for<br>
researchers, or be useful for some other reason.<br>
<br>
If there's a good argument for changing the default from BM25, that's<br>
definitely an option. We wouldn't remove BM25 though.<br>
<br>
Cheers,<br>
Olly<br>
</blockquote></div><br><br clear="all"><div><br></div></div></div><div class="im">-- <br>Weixian Zhou<br>Department of Computer Science and Engineering<br>University at Buffalo, SUNY<br><br>
</div></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>Weixian Zhou<br>Department of Computer Science and Engineering<br>University at Buffalo, SUNY<br><br>