[Xapian-discuss] Tryout paches for faster chert search: http://trac.xapian.org/ticket/326

Thu Sep 8 10:51:34 BST 2011

On 7 September 2011 13:41, Henry C. <henka at cityweb.co.za> wrote:
> I've had a stab at applying some of the patches, here's some feedback:
>
> small_doclen_chunks.patch
> Tested searching on a 161GB test index (which is comprised of 40 sub-indexes):
> Avg search time goes from about 0m7.226s to about 0m7.188s.
> So a marginal speedup.

That seems likely to be in the range of random variation...  Did you
rebuild the index with this patch?  It wouldn't have an effect if you
don't do that, in which case the difference you're seeing is just
random variation.

> avoid_string_appends_1.patch fails - chert_utils.h no longer exists.  I
> searched around in the commit emails, but things have clearly changed a bit.

I suspect the relevant bit has moved to common/pack.h, but haven't looked.

> avoid_string_appends_and_loop_unroll_2.patch - ditto
> avoid_string_operations.patch - ditto

I think there's a lot of potential in these patches, to avoid a lot of
memcpy related work due to string resizing.

> optimise_unpack_uint.patch - ditto

(See more detailed notes later in this email)

> chunktypes.patch
> Code in trunk has changed quite a bit where this patch touches - I started
> massaging things based on the *.cc.rej files, but things have evolved quite a
> bit, and my lack of intimate knowledge of xapian/backend/chert internals has
> me stumped.
>
> I'm keen to try chunktypes.patch, since it's potential seems quite
> significant.  I have about 2TB total indexes to play with (segmented into
> various chunk sizes) so it may be interesting to see how it performs.
>
> optimise_unpack_uint.patch also seems to promise a speed-up worth pursuing.
>
> I'd love to try all the patches out to see what the total gains are (and will
> report them here), but need help with massaging the patches a bit.  Richard,
> since you are the original author of ticket #326 and the patches, could I beg
> your assistance?

I'll try and answer questions, but I don't have time to work much on
this at present.  I suspect the code has changed sufficiently that the
patches will need a lot of work to apply now (and for those which
change the format, it's more appropriate to try applying them against
brass rather than chert now, since the chert format is fully frozen).

A few thoughts:

Firstly, to really work on this usefully, we need some performance
tests which can easily be run on the same data by multiple developers
(ie, using shared data, and a script to setup and run the tests).
I've made stabs at this in the past, but have perhaps failed due to
being overambitious.  Ideally, we'd have a range of performance
scenarios; ie, various combinations of document sizes, database sizes,
and query complexity.  I'd settle very happily for a single scenario
and a simple framework that , though.  Currently, we have a
performance test framework in xapian-core/tests/perftest/ which may be
worth building on; the tests in there currently work on randomly
generated data, though, which isn't actually very informative.  It
also doesn't feel like quite the right way to approach it; the
framework is based on xapian's main unittest framework, which rather
gets in the way.

If someone wants to start a standalone project to run performance
tests, I'd support that effort!  Having the performance tests be in
the same repository as the library they're used to test makes it
considerably more awkward to run them on older versions of the
library; I'd suggest just making a new git project, these days. I
think it's important that the actual code which runs performance tests
is written in C++, since otherwise the overheads of the language
bindings get in the way of measurements of Xapian core, and it's much
harder to use profiling tools such as callgrind.  However, code for
downloading sample data, and pre-processing it into an easier form for
a test run can be in whatever language makes that easiest (I'd use
python).  I'd keep such a project as simple as possible; each test
case could be a separate binary, with a script to run them all in
whatever sequence is appropriate.  A good source of public sample
document data is wikipedia. I believe the stack-overflow site also
offers dumps of their complete data, which has interesting
characteristics (facet terms, for example).  Sources of realistic
query data are harder to come across - anyone got any good ideas for
that?  Randomly generated queries are better than nothing, but tend to
match loads generated by real queries very poorly.

Coming back to the patches, I think the easiest, least invasive, and
most promising for improving chert, of the patches is the
optimise_unpack_uint patch.  This patch should be quite easy to modify
so that it applies - it's basically just adding a specific overloaded
form of the unpack_uint function, which used to be in chert_utils.h,
but has now moved to common/pack.h.  I'm not sure if the patch will
have an effect on 64 bit architectures, because the overload is for a
specific 32 bit type - it may need modifying to add an overload for a
64 bit type to do so. This patch doesn't change the stored data format
at all, so could be applied to the 1.2.x branch if it works well.  If
this overload works well, adding an overload for pack_uint might help
speed up indexing, too, though I expect that to be a much less
significant effect.

The chunktypes patch was always a bit of a hack to try out an idea;
it's probably not too hard to get it applying again, but I'm not going
to have time in the near future.  There are many approaches I'd like
to investigate for improving the doclength performance, but I think
they're all dependent on getting a reliable source of performance data
to see which work in which situation.  Also, if we're making a special
storage type for doclen chunks, I'd like it to be possible to use
special storage types for value slots, too; a value slot containing
only integer values is conceptually very similar to the doclen chunks
(and indeed, could be used to store field lengths for implementing
weighting schemes like BM25F), so it would be nice to be able to reap
such benefits there.  I believe one of the features of Lucene 4 is to
allow custom codecs to be used for individual fields or terms - some
similar kind of framework for Xapian could make value and doclength
storage much more flexible, while also keeping the code reasonably
maintainable.   Before we get into any of that, though, we need a way
to measure what we're doing!

Hope that helps,

-- 
Richard