[Xapian-discuss] using Xapian as backend for google

Fri Dec 8 05:23:27 GMT 2006

On Thu, Dec 07, 2006 at 10:02:03AM +0100, Felix Antonius Wilhelm Ostmann wrote:
> know i must figure out how we can use xapian in the best way. generating 
> many flint-indexes so we can generate it fast on many machines and merge 
> it. the frontend will be a webserver with apache and mod_perl ... is it 
> the best way to run xapian-tcpsrv on other maschines as backend? i think 
> so ... or is another webserver with mod_perl and perl-bindings the ideal 
> solution? My question: can someone tell me something about building the 
> backend for the next google? :) what is important? 

> Raid0 VS Raid1

RAID 1 should be faster for reading, and actually has redundancy so it
can survive a disk dying, but you get half as much storage volume from
the same disks.  In other words, it'll cost about twice as much.

Incidentally, there are many more RAID configurations than just these
two.  Wikipedia has an overview:

http://en.wikipedia.org/wiki/RAID

> SCSI VS SATA

It depends on budget and how big you want to grow.  SATA is cheaper and
probably similar in speed to where SCSI was a few years ago, but iSCSI
and Fibre Channel are likely to end up faster in most cases.

> many smaller backends VS some big backends?

There are definitely downsides to having too many backend servers.  But
if you have a lot of data, splitting a search over several machines can
be a win.  You'll need to profile if you want to find the sweet spot for
your setup, but I'd think it's likely to be nearer a few than a few
hundred.

Note that there's some overhead to using the remote backend, and also
some to using multiple databases.  Another possible architecture is
to just have several servers searching replicated copies of a single
large database.

> What would be the bottleneck (i think DISC I/O)?

It's likely to be.  Note that there's scope for improving matters with
enhancements to Xapian here - there are some obvious things to improve
(which I'm working my way through), and profiling should reveal more.
For a large operation, it's worth investing some time in such fine
tuning as it can seriously reduce the amount of hardware you need to buy
and house!

> Is the xapian-tcpsrv the best way? Can anyone tell me something about
> such an project?

Webtop used xapian-tcpsrv to spread searches over a number of boxes
(10 or so IIRC).  The index size was around 500 million documents, but
with modern hardware that's much less of a challenge than it was more
than 6 years ago.

Also the remote backend has been completely rewritten since then, and
the local backend Webtop used was the legacy "muscat36 da" one, which
flint should outperform by some margin.

> One other questions: "similar results from one domain".
> How can we arrive that goal? The MatchDecider watch over the values with 
> the domainname and accept only two documents from one domain? Is that 
> the way?

If you just want two documents from any one domain, it wouldn't be hard
to extend the collapse feature to leave N documents behind instead of
just one.

Only collapsing "similar" results is harder - first you need to decide
how to define "similar" I guess.

Cheers,
    Olly