[Xapian-devel] Integrating VSEncoding in Brass

Fri Feb 28 06:01:42 GMT 2014

On Wed, Feb 26, 2014 at 11:42:42PM +0800, Hurricane Tong wrote:
> I implement a class dealing with VSEncoding and VSDecoding.
> To do further test and analysis, I am trying to integrate this class
> into BrassPositionListTable and BrassPositionList, replacing the
> BitWriter and BitReader being used now. But I get confused when I meet
> the function
>             void BrassPositionList::next_internal()
> I only find the declaration of this function and can't find the
> implementation of this function.

It seems to only be declared, and not defined or used.  In the old
flint backend, there was a next_internal() method, but we failed to
fully remove it when it became unused (back in 2008).

Thanks for spotting this, I'll remove this declaration, and it's
equivalent in the chert backend.

> As for posting list, It seems that PostlistChunkWriter directly writes
> the docid and the wdf, it doesn't do any encoding.
> As the interpolative encoding can only be applied to an array which is
> ascending or decreasing, it can't be used in posting list.

The docids in posting lists are in ascending order, but yes, we'd have
to store the wdfs in a different way if we use interpolative encoding
for the docids.  The case where I can see this might be useful is
where the wdf is constant for a chunk, as then we'd only need to store
it once (the main example is a term used as a boolean filter, where the
wdf is typically always 0).

The VSEncoding paper suggest their encoding is much faster to decode
than interpolative and nearly as compact, so it may be that
interpolative isn't a useful option.  But maybe our interpolative
decoder is better optimised than theirs is (I notice they say you have
to decode the entire list, but our decoder is actually lazy, and only
decodes as far as it has to to return the next entry).

> But VSEncoding can deal with this situation perfectly, so I think we
> can store a posting list of a term with the array < docid1, wdf1,
> docid2, wdf2, ,,,,,,, > encoded by VSEncoder. It will cost less space.
> But it also cost extra time to decode. Which is preferred?

Cost less space and extra time compared to what?

Generally reducing space reduces the amount of data that has to be read 
from disk, and increases the amount that can be cached (and so the
probability that a read can be serviced from cache), so especially if
the time penalty is small, reduced space probably wins.  But I'm not
clear what the two options here are, so it's hard to be specific.

Cheers,
    Olly