[Xapian-discuss] Compact databases and removing stale records at the same time

Bron Gondwana brong at fastmail.fm
Thu Jun 20 12:57:14 BST 2013

On Thu, Jun 20, 2013, at 10:24 AM, Olly Betts wrote:
> On Thu, Jun 20, 2013 at 12:07:19AM +1000, Bron Gondwana wrote:
> > On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote:
> > > In order to be able to delete documents as it went, it would have to
> > > modify any postlist chunks which contained those documents.  That's
> > > possible, but adds complexity to the compaction code, and will probably
> > > lose most of the speed advantages.
> > 
> > I figured the bigger problem was actually garbage collecting the terms
> > which didn't have references any more - in my quick glance through the
> > code.  I admit I don't understand how it all works quite as well as I'd
> > like.
> Each term has a chunked list of postings (which are (docid, wdf) pairs)
> so there's not really much to the "garbage collecting" part - if that
> list is empty, the term is no longer present in the database.

Sure - more knowing which postings matter (since I'd filter by a callback
based on value[0]) at compact time.

> > It still compacts a lot better with compact:
> > 
> > [brong at imap14 brong]$ du -s *
> > 1198332	xapian.57
> > [brong at imap14 brong]$ time sudo -u cyrus /usr/cyrus/bin/squatter -C /etc/cyrus/imapd-sloti14d5p4.conf -v -u brong -z data -t data -T /tmpfs/xap.tmp
> [...]
> > [brong at imap14 brong]$ du -s *
> > 759992	xapian.58
> How does that break down by table though?  Looking at the sizes of the
> corresponding .DB files before and after will give you most of this info
> (the base files are much smaller, and essentially proportional in size).

DB is slightly larger now (another week's data indexed), but it should be fine.

compact result: (ignore cyrus.indexed.db - that's our internal format to track which records need to be indexed).

[brong at imap14 brong]$ du -s xapian.60/*
8	xapian.60/cyrus.indexed.db
4	xapian.60/iamchert
4	xapian.60/position.baseA
8	xapian.60/position.baseB
496332	xapian.60/position.DB
4	xapian.60/postlist.baseA
4	xapian.60/postlist.baseB
214840	xapian.60/postlist.DB
4	xapian.60/record.baseA
4	xapian.60/record.baseB
1072	xapian.60/record.DB
4	xapian.60/termlist.baseA
4	xapian.60/termlist.baseB
67100	xapian.60/termlist.DB

Using the direct copy version.  Looks like most of the difference is the postlist.

[brong at imap14 brong]$ du -s xapian.61/*
8	xapian.61/cyrus.indexed.db
0	xapian.61/flintlock
4	xapian.61/iamchert
8	xapian.61/position.baseA
8	xapian.61/position.baseB
500224	xapian.61/position.DB
12	xapian.61/postlist.baseA
12	xapian.61/postlist.baseB
619196	xapian.61/postlist.DB
4	xapian.61/record.baseA
4	xapian.61/record.baseB
1088	xapian.61/record.DB
4	xapian.61/termlist.baseA
4	xapian.61/termlist.baseB
93680	xapian.61/termlist.DB

> With multiple databases as above, the docids are interleaved, so it
> might be worth trying to open each source and copy its documents to
> destdb in turn for better locality of reference, and so better cache
> use.

Sounds sane.  I'll try that.

> That's assuming the raw docid order doesn't matter to you.

Not at all.  I really don't care about docids at all.

> Is the CYRUSID value always non-empty?  If it is, you can actually
> iterate that stream of values directly - something like:

It sure should be - I've had a couple of cases where it wound up without CyrusID on a message... only discovered because it triggered assertion failures on read.  They should always have a CyrusId.

> 	Xapian::ValueIterator it;
> 	for (it = srcdb.valuestream_begin(SLOT_CYRUSID); it != srcdb.valuestream_end(SLOT_CYRUSID); it++) {
> 	    if (cb((*it).c_str(), rock)) {
> 		Xapian::docid did = it->get_docid();
> 		Xapian::Document doc = srcdb.get_document(did);
> 		destdb.add_document(doc);
> 	    }
> 	}

Going to give that a go, with separate document reads.  Thanks.

> I suspect the document copying actually takes most of the time here,
> unless you're discarding a lot of them.

Yeah, I think so too.  Anyway - I'll keep working on this code.  We need something that does what it does.

Thanks again,


  Bron Gondwana
  brong at fastmail.fm

More information about the Xapian-discuss mailing list