[Xapian-devel] Question about Project: Posting list encoding improvements
张旭东
zhang.xd927 at gmail.com
Fri Apr 5 14:42:28 BST 2013
Dear all:
My name is Xudong Zhang, from Peking University. I'm very interested in the
Project: Posting list encoding improvements, and I have had some experience
on encoding, and so I want to be involved in the project.
About the project,I have a question to consult:
What's the main goal of the improvement of the speed for the encoding and
decoding of posting lists? Is it OK to increase the encoding/decoding speed
greatly with the slight loss of compression?
Below is about my personal information. I'm a master student from Peking
University, Beijing, China. My research area is search engine and web
mining, and I've been focusing on compression algorithms of posting lists
in search engines since last February. I have proposed a novel decoding
algorithm called SIMD-PFORDelta, which exploit SIMD instructions in
PFORDelta. It is fast and can achieve good compression ratio. Recently, I
also did some experiments to test the state-of-the-art and proposed
encoding algorithms of posting lists. We used our own search engine called
PARADISE and the datasets are GOV2(25 million docs) and ClueWeb09B(50
million docs). Below is some experimental results of docIDs’ gap sequence
on ClueWeb09B, which is similar to Lemire's results[1] (The implementation
of SIMD_pfor_delta is somewhat different to that in Lemire's paper):
Algorithm Name
Compression Speed(M/s)
Decompression Speed(M/s)
Compression Ratio
simple9
88.2749297
500.8305639
16.72%
simple16
51.29381855
488.0998879
15.55%
var_byte
492.9379145
519.1921354
26.02%
group_var_byte
243.7189985
501.3713526
31.76%
group_var_byte_incomp_unary
124.5200775
501.8188062
28.77%
group_var_byte_comp_unary
123.1249031
461.8294703
28.70%
SIMD_group_var_byte
240.296232
793.0340184
31.76%
SIMD_group_var_byte_incomp_unary
126.3003402
1325.49703
28.77%
SIMD_group_var_byte_comp_unary
126.3347572
1128.968014
28.70%
packed_binary
252.6257354
1151.000784
27.55%
SIMD_packed_binary
254.2433557
1507.316133
27.55%
pfor_delta
26.99031483
1026.717585
19.53%
SIMD_pfor_delta
27.23024567
1580.111396
19.53%
rice
68.74742448
79.16529964
15.77%
k_gamma
172.9461803
242.8787555
15.24%
gamma_opt
61.7750133
59.30571281
14.91%
SIMD__kgamma
174.1829249
262.0594328
15.24%
Thank you very much!
P.S. In D. Lemire's paper: Decoding billions of integers per second through
vectorization(CORR'12), SIMD-PORDelta can outperforms VSEncoding w.r.t
decoding speed, while remaining nearly the same compression ratio, namely,
bits/int.
[1] D. Lemire et al. Decoding billions of integers per second through
vectorization, CORR’12, http://arxiv.org/abs/1209.2137
Best wishes,
Xudong Zhang
--
---------------------------
Xudong Zhang
+86-13488695609
Team of Search Engine and Web Mining
School of Electronic Engineering and Computer Science
Peking University, Beijing, 100871, P.R.China
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130405/a80742fe/attachment-0001.htm>
More information about the Xapian-devel
mailing list