[Xapian-devel] GSOC - Posting list encoding improvements

Tue Apr 2 14:37:57 BST 2013

> It would be better to have this discussion on the mailing list really.

Done! Tried to get as much information in here as possible.

>
> On Mon, Apr 01, 2013 at 04:22:09PM +0300, Marius Tibeica wrote:
> > I've looked through the code of BitReader and BitWriter.
> >
> > From what I understood, we could keep state and allow repeated calls, but
> > we can't get the list in order without extra O(n) memory (how it is now)
> > unless we make that an O(log n) operation for each item, meaning O(n*log n)
> > total.
>
> I don't think we want to make the decode time complexity worse.  The
> current space complexity might be unavoidable (though actually I think
> we only need O(log(n)) space - see below), but I think the key win to
> be had is to avoid doing a full decode just to get the first few entries
> (in terms of time, the first entry is O(1) to decode, the second
> O(log(n)), but the third isn't an additive O(log(n)) on top of that).
>
> > To do this we must:
> >   - initialize the interpolative decode
> >       - keep the state information from the BitReader(idx, n_bits, acc) for
> > future requests - probably returned as a struct?
>
> Yes, I think you'd want to wrap up the state in a class.
>
> >       - make a full decode (without keeping the termpos array) to update
> > the BitWriter so that other future decodings are properly done (initial
> > O(n) when creating the state).
>
> We're trying to avoid the work of doing the full decode though, so doing
> it first and throwing the results away seems a step backwards.  We are
> unlikely to be able to store this data in a persistent enough way for
> this to be useful (for a phrase search, we potentially decode
> #terms_in_phrase*#docs_matching_all_those_terms position lists to check
> phrase matches, though we try hard to avoid doing that many).
>
> >   - for each requested item of the list, use the stored state. The code
> > will be very similar to decode_interpolative, but only calculate the "half"
> > where the item is each time (hence the log n).
> >
> > Does this sound like a good approach?
>
> My thinking is that the code runs much like it currently does, except
> that the state is in an object (rather than on the stack from recursive
> calls to decode_interpolative and in the passed in pos vector), and
> that it pauses its work when it knows the next position to return.
>
> I think we may actually only need to store O(log(n)) data to keep the
> state - the depth of recursion looks to be O(log(n)), and if you
> adjust the decode_interpolative like so then the pos vector doesn't
> seem to be needed:
>
> void
> BitReader::decode_interpolative(int j, int k,
>                                 Xapian::termpos pos_j,
>                                 Xapian::termpos pos_k)
> {
>     while (j + 1 < k) {
>         const size_t mid = (j + k) / 2;
>         // Decode one out of (pos_k - pos_j + 1) values
>         // (less some at either end because we must be able to fit
>         // all the intervening pos in)
>         const size_t outof = pos_k - pos_j + j - k + 1;
>         Xapian::termpos pos_mid = decode(outof) + (pos_j + mid - j);
>         decode_interpolative(j, mid, pos_j, pos_mid);
>         j = mid;
>     }
> }
>
> I believe if you check mid before the recursive call to
> decode_interpolative and k after it, then you'll see all the elements in
> order (plus out of order repeats before and/or after the useful
> occurrence - it looks like if run to completion, you end up considering
> about n*2 values to find the n elements, which is OK).
>

Guess I understood the request wrong.
Using a binary search tree in order traversal modified version we get
this solution, which is pretty close to what you said:

extra items in the BitReader (or a derived class)
stack<DIState> di_stack; //will get to a maximum of O(logn) space size indeed
DIState di_current;

struct DIState {
  DIState() { set(0, 0, 0, 0); }
  DIState(int p_j, int p_k, Xapian::termpos pos_j, Xapian::termpos
pos_k) { set(j, k, pos_j, pos_k); }
  void set(int j, int k) { this.j = j; this.k = k;, this.pos_j =
pos_j; this.pos_k = pos_k; }
  bool is_nextterm() { return j + 1 < k; };
  bool is_uninitialized() { return j == 0 && k == 0 && pos_j == 0 &&
pos_k == 0; }
  Xapian::termpos pos_j, pos_k;
  int j, k;
};

// after calling this method, decode_interpolative_next will
sequentially return the termpos from j + 1 to k - 1
// on all the decoding methods, including decode_interpolative, we
must also check if a decode_interpolative has been started and not
finished, and signal an error or do all the reads from the buffer.
Xapian::termpos
BitReader::decode_interpolative(int j, int k, Xapian::termpos pos_j,
Xapian::termpos pos_k)
{
    di_current(j, k, pos_j, pos_k);
}

Xapian::termpos
BitReader::decode_interpolative_next()
{
    if (di_current.is_uninitialized()) {
        //signal an error
    }
    while (!di_stack.empty() || di_current.is_nextterm()) {
        if (di_current.is_nextterm()) {
            di_current.push(di_current);
            int mid = (j + k) / 2;
            int pos_mid = decode(pos[k] - pos[j] + j - k + 1) +
(pos[j] + mid - j);
            di_current.set(di_current.j, mid, di_current.pos_j, pos_mid);
        } else {
            Xapian::termpos pos_ret = di_current.pos_k;
            if (di_stack.empty() && !di_current.is_nextterm()) {
                // signal an error. we have called this method too many times.
            }
            di_current = di_stack.top();
            di_stack.pop();
            int mid = (j + k) / 2;
            int pos_mid = pos_ret;
            di_current.set(mid, di_current.k, pos_mid, di_current.pos_k);
            return pos_ret;
        }
    }
}

I may be able to improve this even further by keeping less information
in each DIState (if I can get it otherwise). Will focus on this if the
approach is ok.
Hope the code is self explanatory. Will add more comments if necessary.

> Cheers,
>     Olly

> > > One thing comes to mind - we currently have an interpolative coding
> > > encoder and decoder, which are used for positional information.  It
> > > would probably be good to have these as one option for posting lists
> > > - for something like a boolean filter term, interpolative coding
> > > is very compact, especially for dense cases.
> > >
> > > However our current decoder can only do a full unpack, so you need
> > > to unpack to a temporary vector and then iterate that.  It would
> > > be better if it could keep state and allow repeated calls to iterate
> > > through the list, returning one entry at a time.  This would likely
> > > benefit phrase searches as well as helping when used for postlists.
> > >
> > > The code is in common/bitstream.{cc,h} if you want to take a look.