[Xapian-discuss] Compact databases and removing stale records at the same time

Wed Jun 19 06:49:37 BST 2013

On Wed, Jun 19, 2013 at 01:29:16PM +1000, Bron Gondwana wrote:
> The advantage of compact - it runs approximately 8 times as fast (we
> are CPU limited in each case - writing to tmpfs first, then rsyncing
> to the destination) and it takes approximately 75% of the space of a
> fresh database with maximum compaction.
> 
> The downside of compact - can't delete things (or at least I can't see
> how).

A lot of the reason why compact is fast is because it pretty much just
treats the contents of each posting list chunk as opaque data (if it
renumbers, it has to adjust the header of the first chunk from each
postlist, if I remember correctly).

In order to be able to delete documents as it went, it would have to
modify any postlist chunks which contained those documents.  That's
possible, but adds complexity to the compaction code, and will probably
lose most of the speed advantages.

The destination of a document-by-document copy should be close to
compact for most of the tables.  If changes were flushed during the
copy, the postlist table may still benefit from compaction (if there
was only one batch, then the postlist table should be compact too).

I've thought before that being able to compact tables independently
might be useful.

> Does anyone have any suggestions for a better way to do this?  I'll
> paste the code for the two different functions below (Cyrus is written
> in C - hence the C-compatible API interface).
[...]
>     catch (const Xapian::Error &err) {
> 	syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s: %s",
> 		    err.get_context().c_str(), err.get_description().c_str());

If err has a context, err.get_description() will actually include it.

> 	Xapian::Enquire enquire(*srcdb);
> 	enquire.set_query(Xapian::Query::MatchAll);
> 	Xapian::MSet matches = enquire.get_mset(0, srcdb->get_doccount());
[...]
> 	/* copy all matching documents to the new DB */
> 	for (Xapian::MSetIterator i = matches.begin() ; i != matches.end() ; ++i) {
> 	    Xapian::Document doc = i.get_document();

This requires creating an in-memory structure of size get_doccount(), so
won't scale well to really big databases.

But there's no need to run a match just to be able to iterate all the
documents in the database - you can just iterate the postlist for the
empty term (via Xapian::Database::postlist_begin("")).  I'd expect
that would be a fair bit faster if you're CPU limited.

See the copydatabase example for code which uses this approach to do a
document-by-document copy.

> 		if (count % 1024 == 0) {
> 		    destdb->commit_transaction();
> 		    destdb->begin_transaction();
> 		}

There's no need to use transactions to do this - outside of
transactions, you'll get an automatic commit periodically anyway (if
you want to force a commit, you can just call destdb->commit()).

There's not currently much difference between the two approaches, but
the auto-commit is likely to get smarter with time (currently it is just
based on number of documents changed, but it should probably take memory
used to store changes as the primary factor).  Using transactions is
telling Xapian that you want those exact chunks of changes committed
atomically, which gives little room to be smarter.

Cheers,
    Olly