[Xapian-devel] GSOC - Posting list encoding improvements

Sat Apr 6 11:22:27 BST 2013

Implemented it on https://github.com/mtibeica/xapian

I could also find the size to which the stack will get (the order of
the highest bit in k - j) and initialize it to avoit reallocations?
I tested by using it in BrassPositionList::read_data. All the unit tests passed.

I will focus on finding a way to reduce the size of DIState.

On Tue, Apr 2, 2013 at 4:37 PM, Marius Tibeica <mtibeica at gmail.com> wrote:
>> It would be better to have this discussion on the mailing list really.
>
> Done! Tried to get as much information in here as possible.
>
>>
>> On Mon, Apr 01, 2013 at 04:22:09PM +0300, Marius Tibeica wrote:
>> > I've looked through the code of BitReader and BitWriter.
>> >
>> > From what I understood, we could keep state and allow repeated calls, but
>> > we can't get the list in order without extra O(n) memory (how it is now)
>> > unless we make that an O(log n) operation for each item, meaning O(n*log n)
>> > total.
>>
>> I don't think we want to make the decode time complexity worse.  The
>> current space complexity might be unavoidable (though actually I think
>> we only need O(log(n)) space - see below), but I think the key win to
>> be had is to avoid doing a full decode just to get the first few entries
>> (in terms of time, the first entry is O(1) to decode, the second
>> O(log(n)), but the third isn't an additive O(log(n)) on top of that).
>>
>> > To do this we must:
>> >   - initialize the interpolative decode
>> >       - keep the state information from the BitReader(idx, n_bits, acc) for
>> > future requests - probably returned as a struct?
>>
>> Yes, I think you'd want to wrap up the state in a class.
>>
>> >       - make a full decode (without keeping the termpos array) to update
>> > the BitWriter so that other future decodings are properly done (initial
>> > O(n) when creating the state).
>>
>> We're trying to avoid the work of doing the full decode though, so doing
>> it first and throwing the results away seems a step backwards.  We are
>> unlikely to be able to store this data in a persistent enough way for
>> this to be useful (for a phrase search, we potentially decode
>> #terms_in_phrase*#docs_matching_all_those_terms position lists to check
>> phrase matches, though we try hard to avoid doing that many).
>>
>> >   - for each requested item of the list, use the stored state. The code
>> > will be very similar to decode_interpolative, but only calculate the "half"
>> > where the item is each time (hence the log n).
>> >
>> > Does this sound like a good approach?
>>
>> My thinking is that the code runs much like it currently does, except
>> that the state is in an object (rather than on the stack from recursive
>> calls to decode_interpolative and in the passed in pos vector), and
>> that it pauses its work when it knows the next position to return.
>>
>> I think we may actually only need to store O(log(n)) data to keep the
>> state - the depth of recursion looks to be O(log(n)), and if you
>> adjust the decode_interpolative like so then the pos vector doesn't
>> seem to be needed:
>>
>> void
>> BitReader::decode_interpolative(int j, int k,
>>                                 Xapian::termpos pos_j,
>>                                 Xapian::termpos pos_k)
>> {
>>     while (j + 1 < k) {
>>         const size_t mid = (j + k) / 2;
>>         // Decode one out of (pos_k - pos_j + 1) values
>>         // (less some at either end because we must be able to fit
>>         // all the intervening pos in)
>>         const size_t outof = pos_k - pos_j + j - k + 1;
>>         Xapian::termpos pos_mid = decode(outof) + (pos_j + mid - j);
>>         decode_interpolative(j, mid, pos_j, pos_mid);
>>         j = mid;
>>     }
>> }
>>
>> I believe if you check mid before the recursive call to
>> decode_interpolative and k after it, then you'll see all the elements in
>> order (plus out of order repeats before and/or after the useful
>> occurrence - it looks like if run to completion, you end up considering
>> about n*2 values to find the n elements, which is OK).
>>
>
> Guess I understood the request wrong.
> Using a binary search tree in order traversal modified version we get
> this solution, which is pretty close to what you said:
>
> extra items in the BitReader (or a derived class)
> stack<DIState> di_stack; //will get to a maximum of O(logn) space size indeed
> DIState di_current;
>
> struct DIState {
>   DIState() { set(0, 0, 0, 0); }
>   DIState(int p_j, int p_k, Xapian::termpos pos_j, Xapian::termpos
> pos_k) { set(j, k, pos_j, pos_k); }
>   void set(int j, int k) { this.j = j; this.k = k;, this.pos_j =
> pos_j; this.pos_k = pos_k; }
>   bool is_nextterm() { return j + 1 < k; };
>   bool is_uninitialized() { return j == 0 && k == 0 && pos_j == 0 &&
> pos_k == 0; }
>   Xapian::termpos pos_j, pos_k;
>   int j, k;
> };
>
> // after calling this method, decode_interpolative_next will
> sequentially return the termpos from j + 1 to k - 1
> // on all the decoding methods, including decode_interpolative, we
> must also check if a decode_interpolative has been started and not
> finished, and signal an error or do all the reads from the buffer.
> Xapian::termpos
> BitReader::decode_interpolative(int j, int k, Xapian::termpos pos_j,
> Xapian::termpos pos_k)
> {
>     di_current(j, k, pos_j, pos_k);
> }
>
> Xapian::termpos
> BitReader::decode_interpolative_next()
> {
>     if (di_current.is_uninitialized()) {
>         //signal an error
>     }
>     while (!di_stack.empty() || di_current.is_nextterm()) {
>         if (di_current.is_nextterm()) {
>             di_current.push(di_current);
>             int mid = (j + k) / 2;
>             int pos_mid = decode(pos[k] - pos[j] + j - k + 1) +
> (pos[j] + mid - j);
>             di_current.set(di_current.j, mid, di_current.pos_j, pos_mid);
>         } else {
>             Xapian::termpos pos_ret = di_current.pos_k;
>             if (di_stack.empty() && !di_current.is_nextterm()) {
>                 // signal an error. we have called this method too many times.
>             }
>             di_current = di_stack.top();
>             di_stack.pop();
>             int mid = (j + k) / 2;
>             int pos_mid = pos_ret;
>             di_current.set(mid, di_current.k, pos_mid, di_current.pos_k);
>             return pos_ret;
>         }
>     }
> }
>
> I may be able to improve this even further by keeping less information
> in each DIState (if I can get it otherwise). Will focus on this if the
> approach is ok.
> Hope the code is self explanatory. Will add more comments if necessary.
>
>
>> Cheers,
>>     Olly
>
>
>> > > One thing comes to mind - we currently have an interpolative coding
>> > > encoder and decoder, which are used for positional information.  It
>> > > would probably be good to have these as one option for posting lists
>> > > - for something like a boolean filter term, interpolative coding
>> > > is very compact, especially for dense cases.
>> > >
>> > > However our current decoder can only do a full unpack, so you need
>> > > to unpack to a temporary vector and then iterate that.  It would
>> > > be better if it could keep state and allow repeated calls to iterate
>> > > through the list, returning one entry at a time.  This would likely
>> > > benefit phrase searches as well as helping when used for postlists.
>> > >
>> > > The code is in common/bitstream.{cc,h} if you want to take a look.