[Xapian-devel] apply hash in doclength

Sat Mar 15 11:01:25 GMT 2014

On Sat, Mar 15, 2014 at 02:36:11PM +0800, Hurricane Tong wrote:
> Firstly we devise a proper hash function h(). When dealing with a
> document with did1, we have a chunk storing did1 and the length of the
> doc. Let t = h(did1), then we use t+term as an index to save the chunk
> in B tree.

I don't follow "t+term" here.  What term are we using?  And why would
we want to involve a term?  We need to be able to find the doclength
independent of any particular term (and we don't want to save the
doclength once per term in the document - we used to do that and
it wastes a lot of space!)

> When  meeting did2, where h(did2) = h(did1)  = t, we append did2 and
> the length of doc did2 to the chunk indexed by t+term. So when we want
> to get the length of did2, we can exactly get the chunk containing
> did2 by querying B tree with h(did2)+term.

It's an interesting idea, but I don't think this is really going to work
well.

To implement MatchAll efficiently, we need to be able to read the used
document ids in ascending order efficiently, which we currently do by
reading the doclength list.  If you hash it, it's no longer in order,
so we'd need to find another way to do this.

And while matching, we access the doclengths which we access in
ascending order.  By hashing them, you ruin the locality of reference
which we currently have - accesses to the doclengths will jump around
all over the Btree.

Also, when we look up a doclength, we hash the docid, do a Btree lookup
(which is O(log(collection_size)) and then we need to scan through the
chunk to find the docid we want (out of all of those which hash the
same).

Currently looking up the next doclength just involves scanning forward
through the current chunk in many cases, and when it doesn't, the
next chunk we want will be in the same area of the B-tree so is
still cheaper to load than the essentially random block the next hash
key would take us to.

The aim is to make looking up the next doclength cheaper than it
currently is, not more expensive.

And during indexing, in the common case of adding new documents, we
can currently just accrue their document lengths until we have a
full chunk, then write that out and start the next chunk.  With the
hashed scheme, we'd have to load, update, and store each chunk as we add
a document (or else keep all the hashed blocks in memory, which is a lot
of memory for a large database).

Cheers,
    Olly