[Xapian-devel] Optimized VSEncoding

Hurricane Tong zhangshangtong.cpp at qq.com
Thu Mar 13 02:41:02 GMT 2014


The size of the string generated by VSEncoder is 12592387,
while that by InterpolativeEncoder is 8554817.

When only encoding the first 1000lines, 
both cost 0ms to decode and VS cost 1ms to encode while Interpolative cost 0ms,
1000lines is just too little to catch the difference in my test.

I upload the source code to https://github.com/HurricaneTong/Xapian/tree/master/VSEncoder
The bitstream.cpp and bitstream.h is about VSEncoder.
The Interpolative.cpp and Interpolative is about InterpolativeEncoder from the Xapian's code.
I haven't write comment on the code, but in main.cpp, 
double CalTimeVS( const vector<unsigned int>& src ) gives an example about how to use VSEncoder and VSDecoder.


Shangtong Zhang,Second Year Undergraduate,
School of Computer Science,
Fudan University, China.


------------------ Original ------------------
From:  "Olly Betts";<olly at survex.com>;
Date:  Wed, Mar 12, 2014 04:37 PM
To:  "Hurricane Tong"<zhangshangtong.cpp at qq.com>; 
Cc:  "xapian-devel"<xapian-devel at lists.xapian.org>; 
Subject:  Re: [Xapian-devel] Optimized VSEncoding

On Wed, Mar 12, 2014 at 12:30:09PM +0800, Hurricane Tong wrote:
> I optimized the code of VSEncoder,
> and encode/decode the whole list, linux.postlist.
> Encoding time :  
> VSEncoder 103429ms
> InterpolativeEncoder 684ms
> Decoding time:
> VSDecoder 756ms
> InterpolativeDecoder 925ms

What where the sizes in each case?

Is your code online somewhere?  I'd be interested in taking a look.

> ( compile with -o2, and the Interpolative Encoder/Decoder is just from
> the Xapian's code, bitstream.cc )
> VS algorithm performs better in decoding, at the price of encoding time.
> I analyse the VSEncoder, most time is spent on getting the optimal
> split, I try to do some optimization, but it seems very difficult to
> make significant progress.
> I don't think the position list of a term will be frequently modified,
> so I think it is acceptable to sacrifice some encoding time for faster
> decoding.

You've flipped to talking about position lists here - the choice of
encoding is something to think about there too, though it's really
unlikely they'd be nearly as long as linux.postlist, and the pattern
will be a little different (e.g. the same word occurring at adjacent
positions is rare for most words).

It is true that position lists don't need unpacking and repacking,
but if encoding actually takes 150 times longer in practice, that's
probably not something we want to do by default.  It could be an
option for users who want to squeeze out every last bit of search
speed though.

For postlists, we would need to decode and reencode chunks when
documents are deleted or modified - there I think the extra encode
time would also be too much for general use.

But linux.postlist is rather large - we'll probably never encode
such a big chunk of data in one go.  Our postlist chunks are currently
about 1000 entries at most I'd guess.

What happens if you try the first 1000 lines of linux.postlist?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140313/75395ca7/attachment.html>

More information about the Xapian-devel mailing list