[Xapian-tickets] [Xapian] #782: Improve encode_length() performance

Xapian nobody at xapian.org
Wed May 8 16:45:45 BST 2019


#782: Improve encode_length() performance
----------------------------+-------------------------
 Reporter:  Kronuz          |             Owner:  olly
     Type:  enhancement     |            Status:  new
 Priority:  normal          |         Milestone:
Component:  Backend-Remote  |           Version:
 Severity:  minor           |        Resolution:
 Keywords:                  |        Blocked By:
 Blocking:                  |  Operating System:  All
----------------------------+-------------------------

Comment (by Kronuz):

 Yes, apparently gcc makes a better job at producing faster code here. I
 tweaked a new "optimized" which really only moves the returns inside the
 `if()`s (for some reason it gives a little bit of gain):

 {{{#!c++
 template<class T>
 std::string
 encode_length_optimized(T len)
 {
         std::string result;
         if (len < 255) {
                 result += static_cast<unsigned char>(len);
                 return result;
         } else {
                 result += '\xff';
                 len -= 255;
                 while (true) {
                         unsigned char b = static_cast<unsigned char>(len &
 0x7f);
                         len >>= 7;
                         if (!len) {
                                 result += b | static_cast<unsigned
 char>(0x80);
                                 break;
                         }
                         result += b;
                 }
                 return result;
         }
 }
 }}}

 gcc with `(x == 0xfff)`:
 {{{
 Running ./benchmark-encode_length.2
 Run on (4 X 3000 MHz CPU s)
 CPU Caches:
   L1 Data 32K (x2)
   L1 Instruction 32K (x2)
   L2 Unified 262K (x2)
   L3 Unified 4194K (x1)
 Load Average: 1.70, 2.76, 3.47
 --------------------------------------------------------------------
 Benchmark                          Time             CPU   Iterations
 --------------------------------------------------------------------
 BM_EncodeLength_Original         132 ns          132 ns      5234075
 BM_EncodeLength_Optimized        131 ns          131 ns      5338377
 BM_EncodeLength_Pack             108 ns          108 ns      6482442
 }}}

 clang with `(x == 0xfff)`:
 {{{
 Running ./benchmark-encode_length.2
 Run on (4 X 3000 MHz CPU s)
 CPU Caches:
   L1 Data 32K (x2)
   L1 Instruction 32K (x2)
   L2 Unified 262K (x2)
   L3 Unified 4194K (x1)
 -----------------------------------------------------------------
 Benchmark                          Time           CPU Iterations
 -----------------------------------------------------------------
 BM_EncodeLength_Original         220 ns        220 ns    3019206
 BM_EncodeLength_Optimized        219 ns        219 ns    3132061
 BM_EncodeLength_Pack              55 ns         55 ns   12631731
 }}}

 On the other hand, this is what I'm seeing `encode_length()` being used
 for:

 {{{
   len(sample) = 27,712,252
   min(sample) = 0
   max(sample) = 254
   statistics.median(sample) = 3.0
   statistics.mean(sample) = 20.126449596260276
   statistics.stdev(sample) = 42.358357658089815
   statistics.variance(sample) = 1794.2304634906563
 }}}

 So benchmarks with `x == 0xff` follow.

 gcc with `(x == 0xff)`:
 {{{
 Run on (4 X 3000 MHz CPU s)
 CPU Caches:
   L1 Data 32K (x2)
   L1 Instruction 32K (x2)
   L2 Unified 262K (x2)
   L3 Unified 4194K (x1)
 Load Average: 1.24, 2.00, 2.18
 --------------------------------------------------------------------
 Benchmark                          Time             CPU   Iterations
 --------------------------------------------------------------------
 BM_EncodeLength_Original        28.0 ns         28.0 ns     23851060
 BM_EncodeLength_Optimized       20.5 ns         20.5 ns     33096144
 BM_EncodeLength_Pack            19.0 ns         19.0 ns     36725357
 BM_EncodeLength_Char            12.8 ns         12.8 ns     53820486
 }}}

 clang with `(x == 0xff)`:
 {{{
 2019-05-08 10:37:56
 Running ./benchmark-encode_length.2
 Run on (4 X 3000 MHz CPU s)
 CPU Caches:
   L1 Data 32K (x2)
   L1 Instruction 32K (x2)
   L2 Unified 262K (x2)
   L3 Unified 4194K (x1)
 -----------------------------------------------------------------
 Benchmark                          Time           CPU Iterations
 -----------------------------------------------------------------
 BM_EncodeLength_Original         115 ns        115 ns    5985310
 BM_EncodeLength_Optimized        115 ns        114 ns    5641522
 BM_EncodeLength_Pack              29 ns         29 ns   23824029
 BM_EncodeLength_Char              26 ns         26 ns   26767312
 }}}

 Given that the statistics in a real-world example (above) show that `len`
 is almost (if not always) less or equal to 254, the second set of
 benchmarks is more relevant. Clearly, for clang, pack makes a big
 difference, but not so much with gcc. The "old optimized" version failed
 to produce any improvements in gcc no matter what, so I discarded it
 altogether; the "new optimized" version, with the returns inside the
 `if()`s make very little difference in clang but makes `encode_length()`
 for this small numbers almost as fast as `pack_uint()` with gcc. Other
 than always simply sending the length as a single char only (which is
 evidently always much faster), I'd propose only moving the returns inside
 the `if()`s (and using `static_cast`, for style only).

--
Ticket URL: <https://trac.xapian.org/ticket/782#comment:9>
Xapian <https://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list