[Xapian-discuss] Read accesses through WritableDatabase are slow

Jean-Francois Dockes jean-francois.dockes at wanadoo.fr
Mon Oct 23 15:23:45 BST 2006


I had initially thought that I was doing something stupid, but the problem
lies in fact with allterms_begin() calls, so it seems I'm doomed. I'll
explain why I am using allterms_begin() in case someone has a better idea.

Recoll can index files with multiple documents, like Inbox mail files for
example. 

While indexing, if such a multi-document file is found to be up to date, I
need to set an existence flag for all the subdocs so that they are not
purged at the end of the indexing pass.

In order to do this there is one unique term in the index for each
sub-document. It is constructed as follows:
   "PREFIX/path/to/multidoc/file|internalPath"

internalPath would be a message number for an mbox file, but it could be
something else for other future types (it's basically an opaque string for
the upper layer, only the document type handler understands it).

During indexing, I use allterms_begin() and skip_to() to find the possible
sub-document sequence for a given file, so I have allterms_begin(),
skip_to() (and possibly postlist_begin()) calls for every file (not good).

Writes happen even when there are no modifications to the index (all files
up to date).

Things I've thought of to fix this:
  - I have a bogus document for the base file. I guess I could store a list
    of sub-documents docids in there, either as terms to be used with
    termlist_begin() or in the document data.

  - I could probably also separate the indexing into a discovery pass done
    with a readonly index, and an actual indexing pass.

If somebody has already solved this problem in a different way, or has an
idea, I'd sure be glad to hear about it...  

Thanks,
J.F. Dockes

Olly Betts writes:
 > On Mon, Oct 23, 2006 at 09:45:14AM +0200, Jean-Francois Dockes wrote:
 > > Except if I'm mistaken, read accesses (like postlist_begin(),
 > > get_document()) through a Xapian::WritableDatabase seem to trigger writes
 > > on the database files, which makes them slow (because of the fsync()
 > > calls).
 > 
 > It depends on the backend, but in general this shouldn't be true for
 > most methods of Database when called on a WritableDatabase.
 >     
 > Flint should only force a flush if you call allterms_begin().  It could
 > be handled, but it's not been implemented so far since it doesn't seem
 > likely you'd call this method a lot during update.  It would be nice to
 > fix this though, as it would also remove the restriction that
 > allterms_begin() can't be called during a transaction.
 > ...



More information about the Xapian-discuss mailing list