[Xapian-discuss] Compact databases and removing stale records at the same time

Thu Jun 20 01:24:30 BST 2013

On Thu, Jun 20, 2013 at 12:07:19AM +1000, Bron Gondwana wrote:
> On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote:
> > In order to be able to delete documents as it went, it would have to
> > modify any postlist chunks which contained those documents.  That's
> > possible, but adds complexity to the compaction code, and will probably
> > lose most of the speed advantages.
> 
> I figured the bigger problem was actually garbage collecting the terms
> which didn't have references any more - in my quick glance through the
> code.  I admit I don't understand how it all works quite as well as I'd
> like.

Each term has a chunked list of postings (which are (docid, wdf) pairs)
so there's not really much to the "garbage collecting" part - if that
list is empty, the term is no longer present in the database.

> > The destination of a document-by-document copy should be close to
> > compact for most of the tables.  If changes were flushed during the
> > copy, the postlist table may still benefit from compaction (if there
> > was only one batch, then the postlist table should be compact too).
> 
> Well, I've switched to a single pass without all the transactional foo
> (see pasted below)
> 
> It still compacts a lot better with compact:
> 
> [brong at imap14 brong]$ du -s *
> 1198332	xapian.57
> [brong at imap14 brong]$ time sudo -u cyrus /usr/cyrus/bin/squatter -C /etc/cyrus/imapd-sloti14d5p4.conf -v -u brong -z data -t data -T /tmpfs/xap.tmp
[...]
> [brong at imap14 brong]$ du -s *
> 759992	xapian.58

How does that break down by table though?  Looking at the sizes of the
corresponding .DB files before and after will give you most of this info
(the base files are much smaller, and essentially proportional in size).

> 	Xapian::Database srcdb = Xapian::Database();
> 	while (*sources) {
> 	    srcdb.add_database(Xapian::Database(*sources++));
> 	}
> 
> 	/* create a destination database */
> 	Xapian::WritableDatabase destdb = Xapian::WritableDatabase(dest, Xapian::DB_CREATE);
> 
> 	/* copy all matching documents to the new DB */
> 	Xapian::PostingIterator it;
> 	for (it = srcdb.postlist_begin(""); it != srcdb.postlist_end(""); it++) {
> 	    Xapian::docid did = *it;
> 	    Xapian::Document doc = srcdb.get_document(did);
> 	    std::string cyrusid = doc.get_value(SLOT_CYRUSID);
> 	    if (cb(cyrusid.c_str(), rock)) {
> 		destdb.add_document(doc);
> 	    }
> 	}

With multiple databases as above, the docids are interleaved, so it
might be worth trying to open each source and copy its documents to
destdb in turn for better locality of reference, and so better cache
use.

That's assuming the raw docid order doesn't matter to you.

Is the CYRUSID value always non-empty?  If it is, you can actually
iterate that stream of values directly - something like:

	Xapian::ValueIterator it;
	for (it = srcdb.valuestream_begin(SLOT_CYRUSID); it != srcdb.valuestream_end(SLOT_CYRUSID); it++) {
	    if (cb((*it).c_str(), rock)) {
		Xapian::docid did = it->get_docid();
		Xapian::Document doc = srcdb.get_document(did);
		destdb.add_document(doc);
	    }
	}

This will omit any documents with an empty value in SLOT_CYRUSID though
(there's no distinction between an empty and unset value).

I suspect the document copying actually takes most of the time here,
unless you're discarding a lot of them.

Cheers,
    Olly