[Xapian-tickets] [Xapian] #782: Improve encode_length() performance
Xapian
nobody at xapian.org
Wed May 8 16:45:45 BST 2019
#782: Improve encode_length() performance
----------------------------+-------------------------
Reporter: Kronuz | Owner: olly
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Backend-Remote | Version:
Severity: minor | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
----------------------------+-------------------------
Comment (by Kronuz):
Yes, apparently gcc makes a better job at producing faster code here. I
tweaked a new "optimized" which really only moves the returns inside the
`if()`s (for some reason it gives a little bit of gain):
{{{#!c++
template<class T>
std::string
encode_length_optimized(T len)
{
std::string result;
if (len < 255) {
result += static_cast<unsigned char>(len);
return result;
} else {
result += '\xff';
len -= 255;
while (true) {
unsigned char b = static_cast<unsigned char>(len &
0x7f);
len >>= 7;
if (!len) {
result += b | static_cast<unsigned
char>(0x80);
break;
}
result += b;
}
return result;
}
}
}}}
gcc with `(x == 0xfff)`:
{{{
Running ./benchmark-encode_length.2
Run on (4 X 3000 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 262K (x2)
L3 Unified 4194K (x1)
Load Average: 1.70, 2.76, 3.47
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
BM_EncodeLength_Original 132 ns 132 ns 5234075
BM_EncodeLength_Optimized 131 ns 131 ns 5338377
BM_EncodeLength_Pack 108 ns 108 ns 6482442
}}}
clang with `(x == 0xfff)`:
{{{
Running ./benchmark-encode_length.2
Run on (4 X 3000 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 262K (x2)
L3 Unified 4194K (x1)
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BM_EncodeLength_Original 220 ns 220 ns 3019206
BM_EncodeLength_Optimized 219 ns 219 ns 3132061
BM_EncodeLength_Pack 55 ns 55 ns 12631731
}}}
On the other hand, this is what I'm seeing `encode_length()` being used
for:
{{{
len(sample) = 27,712,252
min(sample) = 0
max(sample) = 254
statistics.median(sample) = 3.0
statistics.mean(sample) = 20.126449596260276
statistics.stdev(sample) = 42.358357658089815
statistics.variance(sample) = 1794.2304634906563
}}}
So benchmarks with `x == 0xff` follow.
gcc with `(x == 0xff)`:
{{{
Run on (4 X 3000 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 262K (x2)
L3 Unified 4194K (x1)
Load Average: 1.24, 2.00, 2.18
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
BM_EncodeLength_Original 28.0 ns 28.0 ns 23851060
BM_EncodeLength_Optimized 20.5 ns 20.5 ns 33096144
BM_EncodeLength_Pack 19.0 ns 19.0 ns 36725357
BM_EncodeLength_Char 12.8 ns 12.8 ns 53820486
}}}
clang with `(x == 0xff)`:
{{{
2019-05-08 10:37:56
Running ./benchmark-encode_length.2
Run on (4 X 3000 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 262K (x2)
L3 Unified 4194K (x1)
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BM_EncodeLength_Original 115 ns 115 ns 5985310
BM_EncodeLength_Optimized 115 ns 114 ns 5641522
BM_EncodeLength_Pack 29 ns 29 ns 23824029
BM_EncodeLength_Char 26 ns 26 ns 26767312
}}}
Given that the statistics in a real-world example (above) show that `len`
is almost (if not always) less or equal to 254, the second set of
benchmarks is more relevant. Clearly, for clang, pack makes a big
difference, but not so much with gcc. The "old optimized" version failed
to produce any improvements in gcc no matter what, so I discarded it
altogether; the "new optimized" version, with the returns inside the
`if()`s make very little difference in clang but makes `encode_length()`
for this small numbers almost as fast as `pack_uint()` with gcc. Other
than always simply sending the length as a single char only (which is
evidently always much faster), I'd propose only moving the returns inside
the `if()`s (and using `static_cast`, for style only).
--
Ticket URL: <https://trac.xapian.org/ticket/782#comment:9>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list