[Xapian-discuss] Compact databases and removing stale records at the same time

Bron Gondwana brong at fastmail.fm
Wed Jun 19 04:29:16 BST 2013


I'm trying to compact (or at least merge) multiple databases, while stripping search records which are no longer required. 

Backstory:

I've inherited the Cyrus IMAPd xapian-based search code from Greg Banks when he left Opera.

One of the unfinished parts was removing expunged emails from the search database.

We moved from having a single search database to supporting multiple databases.  In our operational environment, we actually run four separate "tiers" of search database.  The active tier is stored on tmpfs, meaning we don't pay any IO cost.  If we lose that due to a server crash, we just have to check every folder for unindexed messages.

Once per day, we compact that to "meta", which is stored on SSD.

Once per week, we compact to "data" - merging with the existing "data" database.  They get a new name each time, so for example my current databases are:

temp:91 archive:3 data:54

If I was to compress all those, I would first create a new database temp:92 and then compress the contents of those three (which are then read-only) into archive:4.  Once that's complete, I would rewrite the active file as "temp:92 archive:4".

I'd like to clean out stale records at the same time - but this doesn't seem possible via the compact API.  So I have two different functions, one that iterates, and one that uses compact.

The advantage of compact - it runs approximately 8 times as fast (we are CPU limited in each case - writing to tmpfs first, then rsyncing to the destination) and it takes approximately 75% of the space of a fresh database with maximum compaction.

The downside of compact - can't delete things (or at least I can't see how).

Does anyone have any suggestions for a better way to do this?  I'll paste the code for the two different functions below (Cyrus is written in C - hence the C-compatible API interface).

I would prefer not to write to the source databases at all - the idea is that all except the "temp" database are read-only for all callers.

Thanks,

Bron.

----

int xapian_compact_dbs(const char *dest, const char **sources)
{
    int r = 0;

    try {
	Xapian::Compactor *c = new Xapian::Compactor;

	while (*sources) {
	    c->add_source(*sources++);
	}

	c->set_destdir(dest);

	/* we never write to compresion targets again */
	c->set_compaction_level(Xapian::Compactor::FULLER);

	c->set_multipass(true);

	c->compact();
    }
    catch (const Xapian::Error &err) {
	syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s: %s",
		    err.get_context().c_str(), err.get_description().c_str());
	r = IMAP_IOERROR;
    }

    return r;
}


/* cb returns true if document should be copied, false if not */
int xapian_filter(const char *dest, const char **sources,
		  int (*cb)(const char *cyrusid, void *rock),
		  void *rock)
{
    int r = 0;
    int count = 0;

    try {
	/* set up a cursor to read from all the source databases */
	Xapian::Database *srcdb = new Xapian::Database();
	while (*sources) {
	    srcdb->add_database(Xapian::Database(*sources++));
	}
	Xapian::Enquire enquire(*srcdb);
	enquire.set_query(Xapian::Query::MatchAll);
	Xapian::MSet matches = enquire.get_mset(0, srcdb->get_doccount());

	/* create a destination database */
	Xapian::WritableDatabase *destdb = new Xapian::WritableDatabase(dest, Xapian::DB_CREATE_OR_OPEN);
	destdb->begin_transaction();

	/* copy all matching documents to the new DB */
	for (Xapian::MSetIterator i = matches.begin() ; i != matches.end() ; ++i) {
	    Xapian::Document doc = i.get_document();
	    std::string cyrusid = doc.get_value(SLOT_CYRUSID);
	    if (cb(cyrusid.c_str(), rock)) {
		destdb->add_document(doc);
		count++;
		/* commit occasionally */
		if (count % 1024 == 0) {
		    destdb->commit_transaction();
		    destdb->begin_transaction();
		}
	    }
	}

	/* commit all the remaining transactions */
	destdb->commit_transaction();
	delete destdb;

	delete srcdb;
    }
    catch (const Xapian::Error &err) {
	syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s: %s",
		    err.get_context().c_str(), err.get_description().c_str());
	r = IMAP_IOERROR;
    }

    return r;
}

-- 
  Bron Gondwana
  brong at fastmail.fm



More information about the Xapian-discuss mailing list