[Xapian-discuss] Xapian on SSD vs SATA

Fri Oct 23 15:00:04 BST 2009

On 23-10-2009 15:23, Henry wrote:
> 
> I was referring to very large indexes - not small ones which might fit 
> in RAM.  I'm not sure what you know about SSDs, but the performance 
> gains can be very significant depending on the application - for RDBMS, 
> eg, the gain is typically 10x from our experience (so, instead of 
> waiting for a 10s transaction to complete, the same can be completed in 
> under a second - that's night/day when dealing with customer expectations).

I know plenty of SSD and such :) My point is that the hottest parts of 
the database will fit in memory, even for databases that are more than 
twice the size of your memory.
Our "small" index is about 25GB by the way, but does indeed fit in the 
24GB memory of our server.

> You're thinking in terms of small indexes.  I'm referring to splitting a 
> very large index across many cluster nodes for performance.  I'm not 
> sure at this stage, since indexing is ongoing, but the index is already 
> ~900GB.

Well, given the fact that you're testing with "only" 4 and 14GB... If 
you're going to only put a 14GB database on a node, you should not even 
bother about ssd, but just put a sata disk and 16GB ram in the machine 
with plenty of cpu-power. With the current pricing of nehalem hardware, 
that is actually a relatively affordable strategy for up to 72GB of ram 
per node, after that the price per GB doesn't scale linearly anymore. 
Obviously, disks and even enterprise ssd's are cheaper per gb.

My guess is the difference between your disks will further increase when 
you're moving farther away from the RAM-size, i.e. a 50 or 100GB 
database will see a larger difference. After some point, even the hotter 
parts of the btree may not even fit in memory anymore, where I'd expect 
the ssd vs disk to really pay off.

What kind of amount of cluster-nodes are you thinking about? Using 4 
nodes, each with a quarter of the database and a budget of $15k per 
server will obviously have different characteristics compared to a 
20-node cluster with only $2k per server.

> 2s vs 1.2s is not insignificant (maybe for your application it is, and 
> that's fine).  You're making the assumption that *your* user expectation 
> is the same as ours.  More importantly, consider search volumes.  Yours 
> may be 1 every hour, ours might be 100-200 a minute.

We operate a large website, with performance-saffy experienced 
computer-users... So we, as developers, try to keep each server 
generated page under 0.1 second, that includes searches in that 25GB 
database :)
Our searchvolume is similar to yours, although our database is obviously 
much smaller and we only use one server.

Our normal searches where already mostly below that 0.1 second 
threshold, the ssd's, oversized ram and top-of-the-line cpu's made sure 
most phrase queries now are pretty fast as well.

> No, they're not.

I'd think you should also try and see whether a faster cpu (or one with 
more memory bandwidth) increases those single term queries. As said, in 
our benchmarks, with our ram and ssd's, we ended up being cpu-bound.
By the way, perhaps ext4 will give you some gains as well.

> Thanks for your post.  Anyway, it's imperative in our application to 
> meet customer expectations (which is why I emphasised the 2/1.2 second 
> difference).  We cannot expect our customers to wait 2-5s, never mind 
> 30s, for a query to complete (especially when they're rapid-fire and 
> have been spoilt by google).

I totally agree, there have been some reports about users loosing there 
interest if the page takes more than 0.2-1 second...

Best regards,

Arjen