[Xapian-discuss] Compact databases and removing stale records at the same time
Bron Gondwana
brong at fastmail.fm
Wed Jun 19 15:07:19 BST 2013
On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote:
> On Wed, Jun 19, 2013 at 01:29:16PM +1000, Bron Gondwana wrote:
> > The advantage of compact - it runs approximately 8 times as fast (we
> > are CPU limited in each case - writing to tmpfs first, then rsyncing
> > to the destination) and it takes approximately 75% of the space of a
> > fresh database with maximum compaction.
> >
> > The downside of compact - can't delete things (or at least I can't see
> > how).
>
> A lot of the reason why compact is fast is because it pretty much just
> treats the contents of each posting list chunk as opaque data (if it
> renumbers, it has to adjust the header of the first chunk from each
> postlist, if I remember correctly).
Yeah, fair enough!
> In order to be able to delete documents as it went, it would have to
> modify any postlist chunks which contained those documents. That's
> possible, but adds complexity to the compaction code, and will probably
> lose most of the speed advantages.
I figured the bigger problem was actually garbage collecting the terms
which didn't have references any more - in my quick glance through the
code. I admit I don't understand how it all works quite as well as I'd
like.
> The destination of a document-by-document copy should be close to
> compact for most of the tables. If changes were flushed during the
> copy, the postlist table may still benefit from compaction (if there
> was only one batch, then the postlist table should be compact too).
Well, I've switched to a single pass without all the transactional foo
(see pasted below)
It still compacts a lot better with compact:
[brong at imap14 brong]$ du -s *
1198332 xapian.57
[brong at imap14 brong]$ time sudo -u cyrus /usr/cyrus/bin/squatter -C /etc/cyrus/imapd-sloti14d5p4.conf -v -u brong -z data -t data -T /tmpfs/xap.tmp
compressing data:57 to data:58 for user.brong (active temp:92,archive:3,meta:0,data:57)
compacting databases
building cyrus.indexed.db
copying from tempdir to destination
renaming tempdir into place
finished compact of user.brong (active temp:92,archive:3,meta:0,data:58)
real 1m23.956s
user 0m32.604s
sys 0m5.948s
[brong at imap14 brong]$ du -s *
759992 xapian.58
That's about 75% of the uncompacted size.
> > catch (const Xapian::Error &err) {
> > syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s: %s",
> > err.get_context().c_str(), err.get_description().c_str());
>
> If err has a context, err.get_description() will actually include it.
Heh. That's code I inherited and hadn't even looked at. I don't think I've
ever actually seen it called. I'll simplify it.
> > /* copy all matching documents to the new DB */
> > for (Xapian::MSetIterator i = matches.begin() ; i != matches.end() ; ++i) {
> > Xapian::Document doc = i.get_document();
>
> This requires creating an in-memory structure of size get_doccount(), so
> won't scale well to really big databases.
My test DB is about 90k documents. Lots of terms though, particularly some of the emails which contain thousands of lines of syslog output.
[brong at imap14 brong]$ delve -1 -a xapian.58 | wc -l
6370721
[brong at imap14 brong]$ delve -1 -V0 xapian.58 | wc -l
89419
> But there's no need to run a match just to be able to iterate all the
> [...]
> There's no need to use transactions to do this - outside of
> [...]
v2:
try {
/* set up a cursor to read from all the source databases */
Xapian::Database srcdb = Xapian::Database();
while (*sources) {
srcdb.add_database(Xapian::Database(*sources++));
}
/* create a destination database */
Xapian::WritableDatabase destdb = Xapian::WritableDatabase(dest, Xapian::DB_CREATE);
/* copy all matching documents to the new DB */
Xapian::PostingIterator it;
for (it = srcdb.postlist_begin(""); it != srcdb.postlist_end(""); it++) {
Xapian::docid did = *it;
Xapian::Document doc = srcdb.get_document(did);
std::string cyrusid = doc.get_value(SLOT_CYRUSID);
if (cb(cyrusid.c_str(), rock)) {
destdb.add_document(doc);
}
}
/* commit all changes explicitly */
destdb.commit();
}
FYI: SLOT_CYRUSID is just 0.
Thanks heaps for your help on this. Honestly, it's not a deal-breaker for us to use this much CPU. It's a pain, but it's still heaps cheaper than re-indexing everything, and our servers are IO bound more than CPU bound, so eating a bit more CPU is survivable.
Bron.
--
Bron Gondwana
brong at fastmail.fm
More information about the Xapian-discuss
mailing list