[Xapian-devel] apply hash in doclength

Hurricane Tong zhangshangtong.cpp at qq.com
Sat Mar 15 06:36:11 GMT 2014


Hi,


I'm writing my proposal, besides skip list, fixed width and VSEncoding, I'd like to apply hash in storing doclength.


Firstly we devise a proper hash function h(). When dealing with a document with did1, we have a chunk storing did1 and the length of the doc. Let t = h(did1), then we use t+term as an index to save the chunk in B tree.
When  meeting did2, where h(did2) = h(did1)  = t, we append did2 and the length of doc did2 to the chunk indexed by t+term. So when we want to get the length of did2, we can exactly get the chunk containing did2 by querying B tree with h(did2)+term. 


The key is to devise such a h() that we have proper docids be hashed into same chunk. If too many docids are in same chunk, the efficiency of searching in this chunk will be too slow. If we have too many chunks, I'm afraid the efficiency of B tree will be lowered.


Besides, inserting and deleting will be easy in this way.


I think taking modulus with a prime may be a good hash function. Depending on docids, we can select different prime.


I'm trying to do further research in hashing, would you like to give some suggestions about the algorithm?


------------------
Shangtong Zhang,Second Year Undergraduate,
School of Computer Science,
Fudan University, China.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140315/b03825fa/attachment.html>


More information about the Xapian-devel mailing list