[Xapian-devel] Optimized VSEncoding

Fri Mar 14 01:03:00 GMT 2014

Hi,

Thanks for your patch, it works well, the output is the same with what it is before.
To do further optimization, I try to use more subtle bit operation in encoding, it does make a little progress.
The get_optimal_split cost 72% of the time for encoding, but I can hardly apply more optimization to it.
I will try to optimize the dynamic programming algorithm.

>>We also recommend using Linux for development work rather than Windows.

I have a ubuntu installed in my machine, but in ubuntu, my machine cannot awake from hiberating, sleeping....
I tried many methods but finally failed, so if not necessary, I don't switch to ubuntu.

Best
Regards

------------------
Shangtong Zhang,Second Year Undergraduate,
School of Computer Science,
Fudan University, China.

------------------ Original ------------------
From:  "Olly Betts";<olly at survex.com>;
Date:  Thu, Mar 13, 2014 06:49 PM
To:  "Hurricane Tong"<zhangshangtong.cpp at qq.com>; 
Cc:  "xapian-devel"<xapian-devel at lists.xapian.org>; 
Subject:  Re: [Xapian-devel] Optimized VSEncoding

On Thu, Mar 13, 2014 at 10:41:02AM +0800, Hurricane Tong wrote:
> When only encoding the first 1000lines, 
> both cost 0ms to decode and VS cost 1ms to encode while Interpolative cost 0ms,
> 1000lines is just too little to catch the difference in my test.

As Richard says, there are better ways to do timing tests than calling
a fairly low resolution timer like clock() before and after a single
run.

We also recommend using Linux for development work rather than Windows.
That would help here as there are better open source profiling tools
available on Linux.

> I upload the source code to https://github.com/HurricaneTong/Xapian/tree/master/VSEncoder

Thanks.

You're inserting repeatedly at the start of vector S - each time you do
that the entire current contents of the vector have to be copied up one
place to make room, which means that inserting N items has cost O(N*N).

But vector::push_back() is O(1), so pushing back N items has cost O(N).
And we can just flip the loop which uses the items to run backwards
through S instead of forwards.

I've attached a quick patch to show what I mean.  I haven't checked the
output is the same as before, so this might not be quite right - the
patch is really just to make it clearer what I'm talking about.

Making this change reduces the encoding time by a factor of about 3.6
for me for the linux.postlist case (though if the patch isn't quite
right this reduction may not be either).  And that's still ~14 times as
long as interpolative coding takes, but I suspect there's further scope
for speeding it up.

Cheers,
    Olly
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140314/2376a902/attachment.html>