[Xapian-devel] GSOC 2015 Performance Test Suite Project
Olly Betts
olly at survex.com
Mon Feb 23 21:49:40 GMT 2015
Hi Dulitha,
On Mon, Feb 23, 2015 at 10:56:21PM +0530, Dulitha Kularathne wrote:
> In the following path it seems like some performance tests are already
> defined.
>
> /xapian/xapian-core/tests/perftest <xapian-devel at lists.xapian.org>*/*
>
> So can you give me some explanation regarding those tests. To what extent
> are they completed?? What more is expected ??
Quoting from:
http://trac.xapian.org/wiki/GSoCProjectIdeas#Project:PerformanceTestSuite
| xapian-core/tests/perftest/ contains some "performance tests", but
| they use randomly generated data, so the results may not reflect what
| users will see
We don't want to test on randomly generated data, unless we can somehow
be sure that its characteristics are representative of real data in the
ways which affect what we're trying to measure (which is hard to
ensure).
As a particular example, one benchmark using randomly generated queries
I saw many years ago showed Xapian as slower than the system they were
comparing to. But if you actually looked at the queries it was slower
on, it was cases where the query didn't match anything, and their
randomly generated queries exercised that case far more than real world
queries would - words used in real world queries don't occur anything
like independently.
If you excluded the non-matching queries, Xapian was dramatically
faster than the other system. While that obviously suggests there was
scope for improving Xapian's handling of cases where there are no
matches, I would say the main lesson to take away is that randomly
generating test data for performance tests can easily lead to bogus
results.
Hence:
| The tests should really use real-world data for both the documents
| being indexed and the queries being run.
It's not hard to find freely licensed document sets (wikipedia for
example). Finding one with suitable corresponding query logs is rather
harder, mostly because query logs tend to end up including sensitive
data (addresses, credit card numbers, phone numbers, etc) and so there's
the cost of sanitising them.
> eg:- The areas already tested in categories related to each of the
> performance requirements (speed, memory, disk space & etc.)
I would suggest you study the existing code to determine that. You'll
want to be familiar with it before writing your proposal - if you're
using it, you'll need to know what it does; if you aren't using it,
you'll need to be able to clearly explain why you're not planning to
use it.
> Also please enlighten me with the aspects that are expected to test??
>
> eg :- If performance, what kind of data is expected to process and are
> there any specific processes that are performance critical.
Searching is particularly speed sensitive (as users are usually waiting
for results while we perform the search), but indexing speed is also
important.
> I hope that the data population for a performance test would take a
> significant part.
Sorry, I don't understand what you are trying to say here.
Cheers,
Olly
More information about the Xapian-devel
mailing list