[Xapian-devel] Project: Posting list encoding improvements

Mon Apr 9 04:05:46 BST 2012

On Sun, Apr 08, 2012 at 07:32:34PM -0400, Weixian Zhou wrote:
> Thanks for pointing out my misunderstanding of posting list encoding. I
> have some further questions.
> 
> *1) How many encoding method are we required to implement? *

We don't have a set number in mind.  We'd rather see a smaller number
done well and merged to trunk than a larger number implemented in a
rush and all still sitting on a branch at the end of the summer.

> I digged into more literature and found some interesting ideas. I think
> there are 3 encoding methods are worth a try:
> 
> a) optPFD. The fastest decompression speed with large index size. Although
> the speed looks good but we should also aware that the experiments in most
> literature put whole index in memory, so the fetching from disk time is
> simply ingored in them.

Yes, that's one thing to watch out for in benchmarking of encodings - if
you have to load some of the data from disk (or all in the cold-cache
case) then a smaller size results in an additional speed-up over having
everything in memory.  The (often multi-layered these days) CPU caches
mean this effect may be true even with everything in memory, but the
speed difference between disk and memory is typically much larger than
memory vs CPU cache.

Another thing to beware of is that many of these encodings can't easily
be updated, so deleting or inserting an entry into an encoded chunk of
postings requires it to be fully decoded and re-encoded.  Some Xapian
users only append new documents, but probably the majority do at least
some updates or deletions.  This is a reason for supporting multiple
encodings, and picking the appropriate one(s) for the situation (perhaps
with a hint from the user about the expected update pattern).

> b). Simple 16. This is an improved version on simple 9. It's slower than
> optPFD while with smaller index size.
> 
> c). VSEnocding [1]. This is the new discovery. It reports in [1] that
> VSEncoding outperforms PFD and Simple 16 in both index size and
> decompression speed. The general idea that differs from previous methods is
> that VSEncoding use dynamic programming to optimally partition the lists in
> blocks.

I heard good things about VSEncoding at ECIR 2011.

It would also be really good to have an encoding which we could use for
document lengths which provided O(1) seeking within a chunk (currently
the need to decode to seek within a chunk can result in slow cases -
there's a ticket about this, which is linked from the project idea).  Or
an encoding which addresses this issue in some other way.

> *2) What's your opinion of implementing docID reordering?*
> 
> I think it's a good option because its significant reduction in index size
> (14% and grows as the size increase). Also the query processing speed is
> more valuable than index compression speed.

As I said before, I get more feedback about people wishing indexing was
faster than I do about index size, and definitely more than I get about
query speed.  The query speed complaints are all about some particular
case or other, not about general performance.

Reordering is also not suitable for some users at all - for example,
gmane arranges for docid order to match date order, so "sort by date" is
just a matter of exposing the docid order which is fast.  Reordering
is incompatible with this trick.

Calling add_document() currently returns the document id, so reordering
batches inside Xapian is incompatible with the current API.  We could
add an add_document_without_returning_docid() method (or some saner
name), or perhaps instead of returning a docid, simply return a proxy
object which wraps the arguments, and depending whether the result is
used or not, the proxy object would call one add_document() variant or
the other (though that wouldn't work for the bindings, only code using
the C++ API directly).

Omega currently uses the returned docid, so wouldn't benefit from this
unless it could be rewritten not to.

Overall it seems like a nice trick, but I feel that currently there's
plenty of scope for speed and size gains from improvements to the actual
encodings used, which will be useful to more users.

> *3) Are we going to compress term positions?*
> 
> Although the size of term positions in posting lists is much larger than
> docID and frequencies, there are few literature give pretty thoughts on
> compressing them. The method proposed in [2] involves much page-wise and
> context information and makes it complex and less feasible in real
> engineering.

We currently compress term positions with interpolative coding.  We
don't need to be able to splice in updates here, and there's a lot of
data, so good compression is particularly important.  There might be
a better encoding to use, but interpolative is hard to beat for size.

Currently we don't even split the term positions for a given
term+document into chunks.  If you look at the sizes of the encoded
entries, most are very short, though a small number might benefit from
being split.  Probably a better first step would be to make the
interpolative decoder not have to decode the whole chunk before
returning data, as we stop at the first phrase match for a given
document.

Cheers,
    Olly