[Xapian-devel] some trouble when devising skiplist

Tue May 13 17:06:25 BST 2014

Hi,

The extra space needed for skiplist isn't much.
I just insert a few triples (-1,x,x) into original chunk.
The height of the skiplist is no more than 4 currently, so the number of triples won't be very larger.
And by skiplist, we can apply more compact way to encoding number, such as gamma encoding, 
to replace current way in pack_uint.
But maybe compactness in doclen post list isn't of much importance, 
after all we only have one doclen post list.
But if we apply this format to oridinary post list, it may be meaningful. 

As for the case where docids are contiguous, fixed-width format is truly faster, 
but when a docid is deleted, docids aren't contiguous. 
The original contiguous docids become two parts, in each parts, they are contiguous.
So I think we should first encode the <docid,doclen> in current way, 
when we discover some contiguous id block, then we use fixed-width,
and after the contiguous block is encoded, we then turn back to current way. 

Best Regards

------------------
Shangtong Zhang,Second Year Undergraduate,
School of Computer Science,
Fudan University, China.

------------------ Original ------------------
From:  "Olly Betts";<olly at survex.com>;
Date:  Sun, May 11, 2014 03:12 PM
To:  "Hurricane Tong"<zhangshangtong.cpp at qq.com>; 
Cc:  "xapian-devel"<xapian-devel at lists.xapian.org>; 
Subject:  Re: [Xapian-devel] some trouble when devising skiplist

On Sat, May 10, 2014 at 11:12:12AM +0800, Hurricane Tong wrote:
> I was confronted with some trouble, I describe the trouble in my journal
> http://trac.xapian.org/wiki/GSoC2014/Posting%20list%20encoding%20improvements/Journal#May10
> And corresponding code is in my git.

How big is your encoded list + skip list?

I wonder if a skip list is not worth the complications it causes on
update.

If we optimise for the case of used document ids being mostly contiguous
(which is often the case), we could just store a list of the lengths
(with a dummy value for any gaps).  If we used a fixed-width format,
then looking up a document length is O(1) (it's effectively an array
subscripting operation), and that'll definitely be faster than
navigating via a skip list.

And that's also easy to update (the hardest case is when a new document
length exceeds the fixed size, and you need to reencode or split).

E.g. if all the doc lengths in a chunk were < 65536 (which is often
going to be true), you would need 2 bytes per document.  The current
docid delta + vint encoded document length will need at least 2 bytes
per document, plus whatever space the skip list takes up.

Cheers,
    Olly
.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140514/37812a68/attachment.html>