I've submitted my proposal for posting list improvement. Please check and reply suggestions. <div>Sorry for the late submission. I'm reviewing papers on posting list compression in recently years.<br><br><div class="gmail_quote">
On Tue, Apr 3, 2012 at 12:47 AM, Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com">olly@survex.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">On Sat, Mar 31, 2012 at 01:25:31AM -0400, Weixian Zhou wrote:<br>
> My name is Weixian Zhou, Computer Science student of University at Buffalo,<br>
> State University of New York. I am interested in the project of posting<br>
> list encoding improvements and weighting schemes. I have some questions<br>
> toward them.<br>
<br>
</div>If you're equally interested in both, I'd probably steer you towards the<br>
posting list encoding project - we've had quite a lot of interest in the<br>
weighting schemes project already.<br>
<div class="im"><br>
> 1) After read the comments in brass_postlist.cc, I am still not very clear<br>
> about the detailed structure of postings list. If you can provide some<br>
> simple examples/graphs will be very straightforward.<br>
<br>
</div>Each term's posting list is stored in one or more chunks. Each chunk is<br>
up to about 2KB of encoded data, and is stored in a B-tree, with a key<br>
built from the term name and the first docid in the chunk. This allows<br>
us to skip ahead in a posting list efficiently - we just look up the<br>
key for the docid we want which puts us on the right chunk (or the one<br>
before it if that docid is in a gap between chunks).<br>
<br>
Within each chunk, there's a simple header and then the docids and wdfs<br>
are encoding in a fairly simple way using variable length integers, with<br>
7 bits of each byte used for data and the most significant bit used to<br>
indicate if there are more bytes for this integer, so 0 to 127 are<br>
encoded in one byte, 128 to 16383 in two bytes, etc. We store the<br>
(docid delta - 1) and the wdf in this way.<br>
<br>
This isn't the most compact encoding by any means, but it is reasonably<br>
compact and easy to update - you can splice in new entries without<br>
having to re-encode a whole chunk, or do a lot of bit shifts.<br>
<br>
However, for users not worried about efficiency of updating existing<br>
documents, a more compact encoding would be good.<br>
<div class="im"><br>
> 2) My instant idea to make list smaller: use gamma codes to encode the gap<br>
> between docids instead of docids.<br>
<br>
</div>Why gamma codes?<br>
<br>
I've not done comparisons myself, but Managing Gigabytes suggests<br>
interpolative coding is the best choice for encoding posting lists (of<br>
those it considers, which includes gamma). Interpolative coding comes<br>
up as the most compact and similar time to encode and decode. If you<br>
have access to the second edition, the table is on page 129. Or you<br>
may be able to see it in the preview link here:<br>
<br>
<a href="https://www.google.com/search?tbm=bks&q=%22either+the+appropriate+model+must+be+read+off+disk%22" target="_blank">https://www.google.com/search?tbm=bks&q=%22either+the+appropriate+model+must+be+read+off+disk%22</a><br>
<br>
I've not done experiments, but we use interpolative coding for<br>
positional data so we already have a well tested implementation.<br>
<div class="im"><br>
> Last question towards the project of weighting schemes: Do we need only to<br>
> implement existing weighting scheme instead of coming up with new ideas?<br>
<br>
</div>Yes, we're more interested there in implementing proven approaches than<br>
trying to do original research.<br>
<div class="im"><br>
> And our mission is to find a weighting scheme that could replace the<br>
> default BM25 in Xapian?<br>
<br>
</div>I'm not sure that's the mission exactly. The idea is to implement more<br>
schemes which are likely to be useful to users. They might be a better<br>
in some situations, or give faster results, or be useful a baseline for<br>
researchers, or be useful for some other reason.<br>
<br>
If there's a good argument for changing the default from BM25, that's<br>
definitely an option. We wouldn't remove BM25 though.<br>
<br>
Cheers,<br>
Olly<br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>Weixian Zhou<br>Department of Computer Science and Engineering<br>University at Buffalo, SUNY<br><br>
</div>