[Xapian-tickets] [Xapian] #782: Improve encode_length() performance

Wed May 8 08:15:19 BST 2019

#782: Improve encode_length() performance
----------------------------+-------------------------
 Reporter:  Kronuz          |             Owner:  olly
     Type:  enhancement     |            Status:  new
 Priority:  normal          |         Milestone:
Component:  Backend-Remote  |           Version:
 Severity:  minor           |        Resolution:
 Keywords:                  |        Blocked By:
 Blocking:                  |  Operating System:  All
----------------------------+-------------------------

Comment (by olly):

 I've done some more tests, and also now tested on x86-64 Linux.

 The tl;dr version is that it looks to me like your code is probably better
 with clang, but probably slower with GCC.  And that's clearly problematic.

 I made some changes to the benchmark code.  I don't have actual data, but
 I think the full range of `unsigned long long` is not a realistic test.
 Most encoded values will be term lengths or wdf values, both of which I'd
 expect to mostly fit in one byte.  So I changed the start point to be
 `0xfff`.  Using `unsigned long long` is not a representative test, as
 almost all the types encoded will be 32 bit by default.

 I also changed the benchmark loop to what the libbenchmark docs recommend
 unless you need C++98 compatibility (and they especially recommend for
 benchmarked code that doesn't take very long to execute, like ours).

 With that here's what I get on Linux:

 {{{
 $ clang++-7 -std=c++1z -pedantic -Wall -Wextra -O3 -lbenchmark
 -lbenchmark_main -o benchmark-encode_length ./benchmark-encode_length.cc
 && ./benchmark-encode_length
 2019-05-08 18:50:50
 Running ./benchmark-encode_length
 Run on (8 X 3900 MHz CPU s)
 CPU Caches:
   L1 Data 32K (x4)
   L1 Instruction 32K (x4)
   L2 Unified 256K (x4)
   L3 Unified 8192K (x1)
 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements
 may be noisy and will incur extra overhead.
 ***WARNING*** Library was built as DEBUG. Timings may be affected.
 -----------------------------------------------------------------
 Benchmark                          Time           CPU Iterations
 -----------------------------------------------------------------
 BM_EncodeLength_Original         155 ns        155 ns    4506744
 BM_EncodeLength_Optimized        113 ns        113 ns    6190602
 $ g++ -std=c++1z -pedantic -Wall -Wextra -O3 -lbenchmark -lbenchmark_main
 -o benchmark-encode_length ./benchmark-encode_length.cc && ./benchmark-
 encode_length
 2019-05-08 18:50:59
 Running ./benchmark-encode_length
 Run on (8 X 3900 MHz CPU s)
 CPU Caches:
   L1 Data 32K (x4)
   L1 Instruction 32K (x4)
   L2 Unified 256K (x4)
   L3 Unified 8192K (x1)
 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements
 may be noisy and will incur extra overhead.
 ***WARNING*** Library was built as DEBUG. Timings may be affected.
 -----------------------------------------------------------------
 Benchmark                          Time           CPU Iterations
 -----------------------------------------------------------------
 BM_EncodeLength_Original          88 ns         88 ns    7929005
 BM_EncodeLength_Optimized        117 ns        117 ns    5926981
 }}}

 So your optimised version improves with clang but regresses with GCC.
 Interestingly your version seems to have pretty much the same performance
 with both compilers, but the current code works much better with GCC than
 clang.

 BTW, this clang using libstdc++ - I tried with `-stdlib=libc++` but the
 built program segfaults.

 I see essentially the same picture on x86 Linux.

 So where do we go from here?

 I think gathering some representative data on the values which get encoded
 would be useful, so we can actually make sure we're benchmarking something
 realistic.  In particular if the single byte case dominates that makes a
 difference.

 Having some real data could also inform a decision about whether the
 current encoding is actually a good choice.  Using the simpler format
 which `pack_uint()` produces is an option for example.

 Do you have a real world xapiand deployment you could log such data from?
 If not, I can see if I can probably find somewhere to get suitable data
 from.  Or perhaps we both should to get some idea of how much variability
 there is.

--
Ticket URL: <https://trac.xapian.org/ticket/782#comment:7>
Xapian <https://xapian.org/>
Xapian