<div><br></div><div>Hi,</div><div><br></div><div>The size of the string generated by VSEncoder is 12592387,</div><div>while that by InterpolativeEncoder is 8554817.</div><div><br></div><div>When only encoding the first 1000lines, </div><div>both cost 0ms to decode and VS cost 1ms to encode while Interpolative cost 0ms,</div><div>1000lines is just too little to catch the difference in my test.</div><div><br></div><div>I upload the source code to https://github.com/HurricaneTong/Xapian/tree/master/VSEncoder</div><div>The bitstream.cpp and bitstream.h is about VSEncoder.</div><div>The Interpolative.cpp and Interpolative is about InterpolativeEncoder from the Xapian's code.</div><div>I haven't write comment on the code, but in main.cpp, </div><div><span style="line-height: 1.5;">double CalTimeVS( const vector<unsigned int>& src ) gives an example about how to use VSEncoder and VSDecoder.</span></div><div><span style="line-height: 1.5;"><br></span></div><div><br></div><div>Best </div><div>Regards</div><div><br></div><div><div style="color:#909090;font-family:Arial Narrow;font-size:12px">------------------</div><div style="font-size:14px;font-family:Verdana;color:#000;">Shangtong Zhang,<div>Second Year Undergraduate,</div><div>School of Computer Science,</div><div>Fudan University, China.</div></div></div><div> </div><div><div><br></div><div><br></div><div style="font-size: 12px;font-family: Arial Narrow;padding:2px 0 2px 0;">------------------ Original ------------------</div><div style="font-size: 12px;background:#efefef;padding:8px;"><div><b>From: </b> "Olly Betts";<olly@survex.com>;</div><div><b>Date: </b> Wed, Mar 12, 2014 04:37 PM</div><div><b>To: </b> "Hurricane Tong"<zhangshangtong.cpp@qq.com>; <wbr></div><div><b>Cc: </b> "xapian-devel"<xapian-devel@lists.xapian.org>; <wbr></div><div><b>Subject: </b> Re: [Xapian-devel] Optimized VSEncoding</div></div><div><br></div>On Wed, Mar 12, 2014 at 12:30:09PM +0800, Hurricane Tong wrote:<br>> I optimized the code of VSEncoder,<br>> and encode/decode the whole list, linux.postlist.<br>> <br>> Encoding time :  <br>> VSEncoder 103429ms<br>> InterpolativeEncoder 684ms<br>> <br>> Decoding time:<br>> VSDecoder 756ms<br>> InterpolativeDecoder 925ms<br><br>What where the sizes in each case?<br><br>Is your code online somewhere?  I'd be interested in taking a look.<br><br>> ( compile with -o2, and the Interpolative Encoder/Decoder is just from<br>> the Xapian's code, bitstream.cc )<br>> <br>> VS algorithm performs better in decoding, at the price of encoding time.<br>> <br>> I analyse the VSEncoder, most time is spent on getting the optimal<br>> split, I try to do some optimization, but it seems very difficult to<br>> make significant progress.<br>> <br>> I don't think the position list of a term will be frequently modified,<br>> so I think it is acceptable to sacrifice some encoding time for faster<br>> decoding.<br><br>You've flipped to talking about position lists here - the choice of<br>encoding is something to think about there too, though it's really<br>unlikely they'd be nearly as long as linux.postlist, and the pattern<br>will be a little different (e.g. the same word occurring at adjacent<br>positions is rare for most words).<br><br>It is true that position lists don't need unpacking and repacking,<br>but if encoding actually takes 150 times longer in practice, that's<br>probably not something we want to do by default.  It could be an<br>option for users who want to squeeze out every last bit of search<br>speed though.<br><br>For postlists, we would need to decode and reencode chunks when<br>documents are deleted or modified - there I think the extra encode<br>time would also be too much for general use.<br><br>But linux.postlist is rather large - we'll probably never encode<br>such a big chunk of data in one go.  Our postlist chunks are currently<br>about 1000 entries at most I'd guess.<br><br>What happens if you try the first 1000 lines of linux.postlist?<br><br>Cheers,<br>    Olly<br>.<br></div>