[Xapian-discuss] Suitability of Xapian for my application?

Olly Betts olly at survex.com
Mon Oct 18 01:50:38 BST 2004


On Fri, Oct 15, 2004 at 11:08:18AM -0700, Eric Parusel wrote:
> Olly Betts wrote:
> >It depends what you set the autoflush threshold to, but for 50000 I get
> >around 250MB process size.  You need more RAM than that as you want the
> >OS to cache database blocks.
> 
> Is "autoflush" the frequency that it fsync's?  Or is more of an internal
> Xapian thing?

The quartz backend uses 5 Btree tables on disk.  When adding documents,
changes are made to each of the tables by writing back modified versions
of blocks into unused blocks, and then everything is switched "live" by
changing over the root block of a table.

But for the posting list table (the one which maps a term to a list of
document ids) this would be very inefficient because each added document
requires modifying 10s or maybe 100s of posting lists.  It's much more
efficient to batch up changes in memory and then sweep through the
postlists in order, applying changes to each one.

This actually ends up very similar to traditional approaches to
inverting files using sorting and merging - it's just approached from a
different point of view.

By default, a flush happens automatically every 10000 documents.  You
can change the threshold, and you can also explicitly flush if you wish
(if new documents arrive at random, you may wish to ensure that a flush
happens every few minutes or hours so that a lull doesn't result in
old documents sitting in the buffer for a long time).

> Apart from the initial import, can I set it so that it fsync's after 
> each document is inserted (since I process as they come in, rather than 
> in batches -- and I don't want the updates to be atomic *and* consistent
> with the inserts to the PostgreSQL db)?

You can flush every document, though you might find it's too slow to
keep up with the rate new documents arrive.  The alternative is to
write a bit of code to recover from a crash by checking which documents
since the last checkpoint have actually been added.

> I have another question -- it concerns the ability to access/store 
> Xapian info over our network...
> Since our "import" server and our db server are different boxes,
> obviouly Xapian doesn't communicate over any network...

You can search a Xapian database across a network using the remote
backend (you can even perform a combined search of several databases
on different machines).

Currently the remote backend doesn't support writing, though it wouldn't
be very hard to add.  There's already code to serialise a document for
passing across the network as that's needed for remote retrieval.

> I suppose I could possibly create a pgsql function of some sort that 
> would call Xapian and insert the keywords, as a way of calling it from a 
> remote box?

Sorry, I don't know enough about PostgreSQL to comment.

Cheers,
    Olly



More information about the Xapian-discuss mailing list