[Xapian-devel] Question about Project: Posting list encoding improvements

张旭东 zhang.xd927 at gmail.com
Fri Apr 5 14:42:28 BST 2013


Dear all:

My name is Xudong Zhang, from Peking University. I'm very interested in the
Project: Posting list encoding improvements, and I have had some experience
on encoding, and so I want to be involved in the project.

About the project,I have a question to consult:

What's the main goal of the improvement of the speed for the encoding and
decoding of posting lists? Is it OK to increase the encoding/decoding speed
greatly with the slight loss of compression?

Below is about my personal information. I'm a master student from Peking
University, Beijing, China. My research area is search engine and web
mining, and I've been focusing on compression algorithms of posting lists
in search engines since last February. I have proposed a novel decoding
algorithm called SIMD-PFORDelta, which exploit SIMD instructions in
PFORDelta. It is fast and can achieve good compression ratio. Recently, I
also did some experiments to test the state-of-the-art and proposed
encoding algorithms of posting lists. We used our own search engine called
PARADISE and the datasets are GOV2(25 million docs) and ClueWeb09B(50
million docs). Below is some experimental results of docIDs’ gap sequence
on ClueWeb09B, which is similar to Lemire's results[1] (The implementation
of SIMD_pfor_delta is somewhat different to that in Lemire's paper):



Algorithm Name

Compression Speed(M/s)

Decompression Speed(M/s)

Compression Ratio

simple9

88.2749297

500.8305639

16.72%

simple16

51.29381855

488.0998879

15.55%

var_byte

492.9379145

519.1921354

26.02%

group_var_byte

243.7189985

501.3713526

31.76%

group_var_byte_incomp_unary

124.5200775

501.8188062

28.77%

group_var_byte_comp_unary

123.1249031

461.8294703

28.70%

SIMD_group_var_byte

240.296232

793.0340184

31.76%

SIMD_group_var_byte_incomp_unary

126.3003402

1325.49703

28.77%

SIMD_group_var_byte_comp_unary

126.3347572

1128.968014

28.70%

packed_binary

252.6257354

1151.000784

27.55%

SIMD_packed_binary

254.2433557

1507.316133

27.55%

pfor_delta

26.99031483

1026.717585

19.53%

SIMD_pfor_delta

27.23024567

1580.111396

19.53%

rice

68.74742448

79.16529964

15.77%

k_gamma

172.9461803

242.8787555

15.24%

gamma_opt

61.7750133

59.30571281

14.91%

SIMD__kgamma

174.1829249

262.0594328

15.24%


Thank you very much!

P.S. In D. Lemire's paper: Decoding billions of integers per second through
vectorization(CORR'12), SIMD-PORDelta can outperforms VSEncoding w.r.t
decoding speed, while remaining nearly the same compression ratio, namely,
bits/int.

[1] D. Lemire et al. Decoding billions of integers per second through
vectorization, CORR’12, http://arxiv.org/abs/1209.2137

Best wishes,

Xudong Zhang

-- 
---------------------------

Xudong Zhang
+86-13488695609
Team of Search Engine and Web Mining
School of Electronic Engineering  and Computer Science
Peking University, Beijing, 100871, P.R.China
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130405/a80742fe/attachment-0001.htm>


More information about the Xapian-devel mailing list