[Xapian-discuss] Search performance issues and profiling/debugging

Wed Oct 24 17:14:22 BST 2007

 Hi..

Alex:
Undoubtedly, XenSource is the cause of the OProfile problem. We looked 
into a patch for that, but there is one only for version 3 and we are 
using version 4. (and, yes, we are using the "official" XenSource one).
As for the disks, each of the 6 disks is defined as an lvm, where one is 
allocated with 80Gb for system/log and the other 5 and allocated fully 
(465GB of the 465.5GB) to the databases. Each two databases sit on the 
same disk. No other instance (since none exist) or Dom0 is using the 5 
disks of the databases.

Hope that clarifies the disk mapping.

James:
While anything we put on the server complicates things a bit, xen is not 
really the issue here I believe. If the problem was Xen related, why 
would a scheduling problem effect "no recip" in such a consistent way, 
even after compacting the databases and moving them around? If Xapian 
used DMA, undocumented interrupts or something else out of the ordinary, 
I understand why it would be something to look into first, but what 
makes you think that Xen in the mix can explain the variation in 
estimates, the strange performance issues with specific queries only, 
and other strange things we see?
We will certainly try to profile things, even test without Xen if we 
can't profile on it. Again, being the only VM instance running on that 
machine, there is little scheduling to do and no competition over IO and 
other resources. But even if there were, why would it be so constant on 
"no recip" search? Don't make too much sense unless we are missing 
something.

We indeed tested things well over 3 times. This is why I picked "no 
recip" as a search. It is constantly performing badly even when searched 
second of third time right after the first (see debug output).

Below are stats from 100 runs:

Chris:
We will try to test it without Xen as well later on. Keep in mind that 
to do so we will have to move aside 10 databases of 50GB, reinstall the 
machine and remove the database into place. We would do it first thing 
if we believed its Xen's issue, although if we can't profile things we 
might do this anyway (or test it on a different machine).

Olly: Sorry, we removed the old Database10 after compressing it. Since 
then we didn't see the seg fault. We will keep a close eye and contact 
you as soon as we see such error again.

Best regards,
Ron.

Alexandre Gauthier wrote:
> Chris Good a écrit :
>> Ron Kass wrote:
>>  
>>> Not sure what you mean by "other VMs could well be confusing your 
>>> results"
>>> We use XenServer on this machine, but we have only one instance 
>>> (DomU), and only this instance is running everything locally. So 
>>> there are no other VMs to confuse things, and even if there were, 
>>> they have nothing to do with the VM we run the test on or with the 
>>> test itself.
>>> (Can you clarify what you mean?)
>>>     
>>
>> If you have multiple VMs sharing the same hardware then activity on one
>> will obviously affect the performance on other VMs.  Since you're 
>> running
>> a lone DomU other DomUs aren't going to be competing for resources 
>> but it's possible that something in Dom0 is getting swapped in and 
>> running.
>>
>> How are you accessing your drives, is DomU accessing the raw devices 
>> or is
>> it mapped via virtual files from Dom0?
>>
>> Is it possible to run these tests either directly from Dom0 or even 
>> better
>> with a non-xen kernel?
>>
>> Given your current configuration of a single VM xen isn't adding 
>> anything so removing it would eliminate any side-effects of it.  I 
>> also suspect
>> that it would cure your oprofile issue.
>>
>> Chris
>>
>>   
> Sorry to intrude, but if I may offer some insight, the Dom0 instance 
> in a Xen set-up is just as paravirtualized as a DomU -- it just has 
> control access to memory inside DomUs, and offers the drivers back-end 
> interfaces. The Dom0 and DomUs both run on top of the Xen kernel.
>
> Also, if he is running a commercial Xen from XenSource, he won't have 
> access to the Dom0, which is a custom frankenstein mix of SuSE and 
> RHEL witth no other puprose but to control the DomUs, a bit like ESX.
>
> The question of the DomU's disk mapping is still valid, and I'd be 
> curious to hear the answer. I also think Xen is responsible for the 
> oprofile troubles, I get that on a Debian DomU as well.
>
> I hope this vaguely helps...
>
> Alex
>
>

James Aylett wrote:
> On Wed, Oct 24, 2007 at 04:04:22PM +0200, Ron Kass wrote:
>
>   
>> Although we should never rule out something completely without checking, 
>> I believe quite strongly that the issues we are seeing are not coming 
>> from Xen, as per this instance it is a regular dedicated Linux (centos 
>> 5) machine and the resources are fully dedicated to it.
>>     
>
> It seems to me that there are two distinct problems. You have some
> queries that are underperforming, which with some profiling will
> expose either something unusual about your database or code, or a
> bottleneck or optimisation problem in Xapian.
>
> The other is the variation. I agree with Chris that adding Xen into
> the mix is complicating matters considerably. Things like IO
> scheduling, for instance, become harder in even the best
> virtualisation systems. It's bad enough that a single instance of an
> OS can suddenly start doing things you don't expect, even with no
> other significant userspace clients :-/
>
> Out of interest, are your figures averages of multiple runs? If not,
> I'd be interested in seeing 1st, 2nd and 3rd query times (broken down
> as Olly suggests), but with mean & sd over say 100 runs.
>
> (Apologies if you have done that - I've been trying to follow this
> thread closely, but an explosion of posts has combined with a busy
> period at my end :-)
>
> J
>
>   
Chris Good wrote:
> Ron Kass wrote:
>   
>> I believe quite strongly that the issues we are seeing are not coming 
>> from Xen, as per this instance it is a regular dedicated Linux (centos 
>> 5) machine and the resources are fully dedicated to it.
>>     
>
> I'd still encourage you to give it a go if only to rule it out and let
> you run oprofile.  Running inside Xen certainly shouldn't affect your
> match sets but it the diskwriter process kicking in could fully explain
> some of the timing variances that you've seen when re-running queries.
>   

Olly Betts wrote:
>> Anyway, we have actually used xapian-compress on the databases to see if 
>> it helps. It appears to have rid of the segmentation fault error on 
>> database 10, but the slowness and the variations in estimates still exist.
>>     
>
> A seg fault is clearly a bug somewhere, and I'd really like to know
> where.  Do you still have the un-compacted database, or if not can you
> recreate it?  If so, please rerun the test on it under gdb as I
> requested in my previous mail!
>
> Cheers,
>     Olly