[Xapian-discuss] Xapian and 10M (small) documents. What to expect?

Olly Betts olly at survex.com
Thu Sep 8 14:09:33 BST 2005


On Thu, Sep 08, 2005 at 12:44:14PM +0200, Bart van Bragt wrote:
> Does anyone have an idea what kind of hardware I would need to
> search 10 million documents (approx 4GB of text) with approx 10.000 new 
> postings per day?

It rather depends on the search load, and the character of the data
matters too.  If you can get the system running on an existing box
you can get an idea how these factors apply to your situation.

As general advice, I/O speed is likely to be the limiting factor for
database of this sort of size, so lots of RAM and fast disks are best.
I'd probably go for SATA rather than SCSI if buying new these days.
RAID will probably help.

> Do I need a dedicated machine for searching? The site isn't exactly 
> generating huge amounts of money so it would be very nice if we could 
> use a (beefy) server to do both webserving and searching. Or
> combine the database and the search but I don't think those two combine 
> really well, I'm guessing that the main bottleneck is going to be I/O?

Almost certainly.  I'd be tempted to consider two less beefy servers
which gives you more scope for adding more RAM and sharing the I/O load,
though may need more rackspace and increase hosting costs.

> The main drawback is a (very significant?) performance 
> loss I guess... Indexing topics would result in only 500k documents 
> instead of 10M.

Though with 10M documents, each will be indexed by fewer terms, and the
raw position list data size will be much the same in both cases.

> There seems to be a fairly large resemblance between gmain and phpBB 
> indexing (both are about indexing topics/threads and lots of small 
> postings). Is the gmane setup going to be public?

It should be going live as the main search in the next week or two.
The recent (and current) obstacles are all down to a disk crash and
the machine having to be reinstalled - I keep finding things which
aren't installed or aren't running and have to ask Lars to fix them.

> Is it already known what hardware this system is going to need?

The current hardware is:

Athlon64 3000+
3GB RAM
mixture of SCSI and SATA disks (not RAID AFAIK)

That's rather overspecified for the search load at present, but gmane is
growing fairly rapidly so it's good to have room for expansion.  It also
means a full rebuild of the database from scratch takes about 2 days.

It's running Debian woody x86, I think just because that was easy to
install.  I've not had a chance to compare x86 vs x86_64 for Xapian
yet on the same hardware.

Cheers,
    Olly



More information about the Xapian-discuss mailing list