I&#39;ve submitted my proposal for posting list improvement. Please check and reply suggestions. <div>Sorry for the late submission. I&#39;m reviewing papers on posting list compression in recently years.<br><br><div class="gmail_quote">


On Tue, Apr 3, 2012 at 12:47 AM, Olly Betts <span dir="ltr">&lt;<a href="mailto:olly@survex.com">olly@survex.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div class="im">On Sat, Mar 31, 2012 at 01:25:31AM -0400, Weixian Zhou wrote:<br>

&gt; My name is Weixian Zhou, Computer Science student of University at Buffalo,<br>

&gt; State University of New York. I am interested in the project of posting<br>

&gt; list encoding improvements and weighting schemes. I have some questions<br>

&gt; toward them.<br>

<br>

</div>If you&#39;re equally interested in both, I&#39;d probably steer you towards the<br>

posting list encoding project - we&#39;ve had quite a lot of interest in the<br>

weighting schemes project already.<br>

<div class="im"><br>

&gt; 1) After read the comments in brass_postlist.cc, I am still not very clear<br>

&gt; about the detailed structure of postings list. If you can provide some<br>

&gt; simple examples/graphs will be very straightforward.<br>

<br>

</div>Each term&#39;s posting list is stored in one or more chunks.  Each chunk is<br>

up to about 2KB of encoded data, and is stored in a B-tree, with a key<br>

built from the term name and the first docid in the chunk.  This allows<br>

us to skip ahead in a posting list efficiently - we just look up the<br>

key for the docid we want which puts us on the right chunk (or the one<br>

before it if that docid is in a gap between chunks).<br>

<br>

Within each chunk, there&#39;s a simple header and then the docids and wdfs<br>

are encoding in a fairly simple way using variable length integers, with<br>

7 bits of each byte used for data and the most significant bit used to<br>

indicate if there are more bytes for this integer, so 0 to 127 are<br>

encoded in one byte, 128 to 16383 in two bytes, etc.  We store the<br>

(docid delta - 1) and the wdf in this way.<br>

<br>

This isn&#39;t the most compact encoding by any means, but it is reasonably<br>

compact and easy to update - you can splice in new entries without<br>

having to re-encode a whole chunk, or do a lot of bit shifts.<br>

<br>

However, for users not worried about efficiency of updating existing<br>

documents, a more compact encoding would be good.<br>

<div class="im"><br>

&gt; 2) My instant idea to make list smaller: use gamma codes to encode the gap<br>

&gt; between docids instead of docids.<br>

<br>

</div>Why gamma codes?<br>

<br>

I&#39;ve not done comparisons myself, but Managing Gigabytes suggests<br>

interpolative coding is the best choice for encoding posting lists (of<br>

those it considers, which includes gamma).  Interpolative coding comes<br>

up as the most compact and similar time to encode and decode.  If you<br>

have access to the second edition, the table is on page 129.  Or you<br>

may be able to see it in the preview link here:<br>

<br>

<a href="https://www.google.com/search?tbm=bks&amp;q=%22either+the+appropriate+model+must+be+read+off+disk%22" target="_blank">https://www.google.com/search?tbm=bks&amp;q=%22either+the+appropriate+model+must+be+read+off+disk%22</a><br>


<br>

I&#39;ve not done experiments, but we use interpolative coding for<br>

positional data so we already have a well tested implementation.<br>

<div class="im"><br>

&gt; Last question towards the project of weighting schemes: Do we need only to<br>

&gt; implement existing weighting scheme instead of coming up with new ideas?<br>

<br>

</div>Yes, we&#39;re more interested there in implementing proven approaches than<br>

trying to do original research.<br>

<div class="im"><br>

&gt; And our mission is to find a weighting scheme that could replace the<br>

&gt; default BM25 in Xapian?<br>

<br>

</div>I&#39;m not sure that&#39;s the mission exactly.  The idea is to implement more<br>

schemes which are likely to be useful to users.  They might be a better<br>

in some situations, or give faster results, or be useful a baseline for<br>

researchers, or be useful for some other reason.<br>

<br>

If there&#39;s a good argument for changing the default from BM25, that&#39;s<br>

definitely an option.  We wouldn&#39;t remove BM25 though.<br>

<br>

Cheers,<br>

    Olly<br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>Weixian Zhou<br>Department of Computer Science and Engineering<br>University at Buffalo, SUNY<br><br>

</div>