[Xapian-discuss] set_cutoff <percent_cutoff> [<weight_cutoff>]

Kevin Duraj kevin.softdev at gmail.com
Fri May 11 19:30:33 BST 2007


James,

I believe you right from the point you are looking on the issue when
transferring data and dealing with large amount of data. However most of
users use single machine perhaps with multi CPU's and moderate data size for
indexing, searching and doing everything else.

These users will be hurt having spend their CPU time compressing and
uncompromising data that the size does not make any difference for them, but
make huge difference for large data sets and multi servers.

Recommendation:
Let's use environment to let user to decide whether to use compression or
not rather then force user to use compression.

thanks,

Kevin Duraj
http://myhealthcare.com



On 5/11/07, James Aylett <james-xapian at tartarus.org> wrote:
>
> On Fri, May 11, 2007 at 05:23:43AM +0100, Olly Betts wrote:
>
> > > I want the top speed during indexing and searches, and I do not care
> about
> > > smallest database. I think most of users feel the same. If "gzip -9"
> makes
> > > the indexing slightly slower, remove it. *smile* :-)
> >
> > The thing is that smaller is often faster.  Once I/O becomes the
> > limiting factor, compression will speed things up.  CPU speeds have
> > increased faster than storage speeds over time, so this is likely to
> > be more true than it ever was!
>
> This is hugely important, and is something that a lot of people
> miss. It doesn't make a huge amount of difference when you're dealing
> with small data sets (say, less than half the size of core), but then
> the delta cost should be fairly minimal. Once you get into moderately
> large data sets (say two to four times core), you're going to start
> hurting very badly if you're wasting time transferring data
> suboptimally (*). Even if you can stack enough disks to get maximum
> fibre speed, you're still only managing a few gig per second; given
> your core will be a minimum of 8G these days, cutting down your
> storage size becomes really important. (And that's assuming that only
> one machine has access to the fabric, when it's more likely to be
> shared...)
>
> David Braben has an interesting graph that backs this up (admittedly
> from the point of view of consoles). It's *more* important to get
> decent compression on your data than it was in the days of Elite and
> Exile!
>
> (*) I have a tiresome anecdote about inefficient data transfer over
> NFSv3 versus NFSv4 bringing our data centre to a standstill.
>
> J
>
> --
>
> /--------------------------------------------------------------------------\
>   James Aylett                                                  xapian.org
>   james at tartarus.org                               uncertaintydivision.org
>



-- 
Kevin Duraj
http://myhealthcare.com


More information about the Xapian-discuss mailing list