[Xapian-discuss] using Xapian as backend for google

Olly Betts olly at survex.com
Wed Dec 13 04:27:19 GMT 2006


On Mon, Dec 11, 2006 at 09:56:10AM +0100, Felix Antonius Wilhelm Ostmann wrote:
> after one weekend i think raid is the wrong way ... split the index to 
> different drives would be faster and we dont lost the space :)

You don't lose the space until a disk fails - then you lose the space
and the data that was using it.

Data loss doesn't always matter though - for some applications, the
search can be down (or missing a segment of the documents) and the
application can still be usable.

As always when building systems, you have to balance the cost of
reliability against the probability and costs of possible failures.

> >If you just want two documents from any one domain, it wouldn't be hard
> >to extend the collapse feature to leave N documents behind instead of
> >just one.
> >
> >Only collapsing "similar" results is harder - first you need to decide
> >how to define "similar" I guess.
>  
> Hmmm ... the problem is, that one domain can include 1oo.ooo or more 
> documents. When a search match 2o.ooo documents from this domain, the 
> MatchDecider must access 2o.ooo values (with the domainname) and decline 
> 19.998 documents. And perhaps the next domain has another 1oo.ooo 
> documents with 15.ooo matches. i dont know :( is the MatchDecider the 
> right way?

If you want to collapse on a value but leave more than one document
behind, I think the best approach is to enhance the collapse feature to
allow the number of documents to keep to be specified.

A search with collapsing is going to be more expensive than one without
but I recommended trying this approach before deciding that it's too
expensive!

Cheers,
    Olly



More information about the Xapian-discuss mailing list