[Xapian-discuss] Compact databases and removing stale records at the same time

Bron Gondwana brong at fastmail.fm
Wed Jun 19 15:07:19 BST 2013


On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote:
> On Wed, Jun 19, 2013 at 01:29:16PM +1000, Bron Gondwana wrote:
> > The advantage of compact - it runs approximately 8 times as fast (we
> > are CPU limited in each case - writing to tmpfs first, then rsyncing
> > to the destination) and it takes approximately 75% of the space of a
> > fresh database with maximum compaction.
> > 
> > The downside of compact - can't delete things (or at least I can't see
> > how).
> 
> A lot of the reason why compact is fast is because it pretty much just
> treats the contents of each posting list chunk as opaque data (if it
> renumbers, it has to adjust the header of the first chunk from each
> postlist, if I remember correctly).

Yeah, fair enough!

> In order to be able to delete documents as it went, it would have to
> modify any postlist chunks which contained those documents.  That's
> possible, but adds complexity to the compaction code, and will probably
> lose most of the speed advantages.

I figured the bigger problem was actually garbage collecting the terms
which didn't have references any more - in my quick glance through the
code.  I admit I don't understand how it all works quite as well as I'd
like.

> The destination of a document-by-document copy should be close to
> compact for most of the tables.  If changes were flushed during the
> copy, the postlist table may still benefit from compaction (if there
> was only one batch, then the postlist table should be compact too).

Well, I've switched to a single pass without all the transactional foo
(see pasted below)

It still compacts a lot better with compact:

[brong at imap14 brong]$ du -s *
1198332	xapian.57
[brong at imap14 brong]$ time sudo -u cyrus /usr/cyrus/bin/squatter -C /etc/cyrus/imapd-sloti14d5p4.conf -v -u brong -z data -t data -T /tmpfs/xap.tmp
compressing data:57 to data:58 for user.brong (active temp:92,archive:3,meta:0,data:57)
compacting databases
building cyrus.indexed.db
copying from tempdir to destination
renaming tempdir into place
finished compact of user.brong (active temp:92,archive:3,meta:0,data:58)

real	1m23.956s
user	0m32.604s
sys	0m5.948s
[brong at imap14 brong]$ du -s *
759992	xapian.58


That's about 75% of the uncompacted size.

> >     catch (const Xapian::Error &err) {
> > 	syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s: %s",
> > 		    err.get_context().c_str(), err.get_description().c_str());
> 
> If err has a context, err.get_description() will actually include it.

Heh.  That's code I inherited and hadn't even looked at.  I don't think I've
ever actually seen it called.  I'll simplify it.

> > 	/* copy all matching documents to the new DB */
> > 	for (Xapian::MSetIterator i = matches.begin() ; i != matches.end() ; ++i) {
> > 	    Xapian::Document doc = i.get_document();
> 
> This requires creating an in-memory structure of size get_doccount(), so
> won't scale well to really big databases.

My test DB is about 90k documents.  Lots of terms though, particularly some of the emails which contain thousands of lines of syslog output.

[brong at imap14 brong]$ delve -1 -a xapian.58 | wc -l
6370721
[brong at imap14 brong]$ delve -1 -V0 xapian.58 | wc -l
89419

> But there's no need to run a match just to be able to iterate all the
> [...]
> There's no need to use transactions to do this - outside of
> [...]

v2:

    try {
	/* set up a cursor to read from all the source databases */
	Xapian::Database srcdb = Xapian::Database();
	while (*sources) {
	    srcdb.add_database(Xapian::Database(*sources++));
	}

	/* create a destination database */
	Xapian::WritableDatabase destdb = Xapian::WritableDatabase(dest, Xapian::DB_CREATE);

	/* copy all matching documents to the new DB */
	Xapian::PostingIterator it;
	for (it = srcdb.postlist_begin(""); it != srcdb.postlist_end(""); it++) {
	    Xapian::docid did = *it;
	    Xapian::Document doc = srcdb.get_document(did);
	    std::string cyrusid = doc.get_value(SLOT_CYRUSID);
	    if (cb(cyrusid.c_str(), rock)) {
		destdb.add_document(doc);
	    }
	}

	/* commit all changes explicitly */
	destdb.commit();
    }

FYI: SLOT_CYRUSID is just 0.

Thanks heaps for your help on this.  Honestly, it's not a deal-breaker for us to use this much CPU.  It's a pain, but it's still heaps cheaper than re-indexing everything, and our servers are IO bound more than CPU bound, so eating a bit more CPU is survivable.

Bron.


-- 
  Bron Gondwana
  brong at fastmail.fm



More information about the Xapian-discuss mailing list