[Xapian-discuss] Suitability of Xapian for my application?

Olly Betts olly at survex.com
Fri Oct 15 06:05:56 BST 2004


On Thu, Oct 14, 2004 at 08:43:33PM -0700, Eric Parusel wrote:
> I would want to feed Xapian just a list of keywords, no positional data 
> at this time.
> 
> How efficient would Xapian be if I converted my keyword search over to it?

Approximately infinitely better than your current scheme I suspect!

A friend had implemented a search in a similar way to you (except with
mysql I think).  I built a Xapian version from a SQL dump and the speed
up was startling.  That was searching around 150K documents.

> What's important to me, in no particular order:
> 1) Import speeds when the tables grow (avg # of keywords per document: 
> 150 approx)

So if there's about 150 keywords per document and 30 million or so rows,
then the corpus is of the order of 200K documents?

It's hard to say how fast a system will be without a reference point.
Indexing speed depends a lot on the hardware.  CPU speed isn't too
important.  You want lots of RAM and fast disks.

The gmane index has an average doc length of 186 terms.  It takes about
15 minutes to index 200K documents from scratch.  That's got 3G of RAM
and SATA disks.

> 2) Searching speed (I don't think this will be a problem from what I've 
> heard)

Should be fractions of a second for that size index.

> 3) keywords "database" size -- any rough estimates for what I'm working 
> with?

I'd guess something like 500MB for 200K documents.

There are plans in the pipeline to improve the packing and compression
(which should improve both index and search speed too).

> 4) Stability -- it won't corrupt, or crap out and die on me, will it? :)

I'd hope not.  We try hard to make releases stable, and there's an
extensive automated test suite to assist this aim.  We also indicate
in the release notes when major code reworking has taken place.

But as the licence says there's no warranty.  If that bothers you
(or your boss!) commercial support is available.

> 5) Backups -- Is there a backup dump utility of some sort?

There's dbtools in CVS which allows you to dump and reload databases as
XML.  But unless you want to process the dumped data, it's probably not
the right approach.  It's a lot slower to dump the contents of a database
than to just copy the files comprising it.

>    Can I take backups of the live system?

If you can pause updates during the backup you can.  There's currently
no support within Xapian for backing up while updates are happening.

>    Can I use filesystem snapshots, then back up the xapian db file 
> snapshot?

That's a good way to do it.  Make sure that there's no updates happening
and snapshot the filesystem.  Then you can restart updates and back up
from the snapshot to tape at your leisure.

Alternatively, if you keep the documents and can build your Xapian
database in ~15 minutes you might decide you can live without the Xapian
backup (especially if it takes more than 15 minutes to restore from
tape!)  Of course this decision also hinges on how critical search is to
your application.

> Anything else I should be concerned about?

Nothing comes to mind.

Cheers,
    Olly



More information about the Xapian-discuss mailing list