[Xapian-discuss] Compact databases and removing stale records at the same time
Bron Gondwana
brong at fastmail.fm
Thu Jun 20 12:57:14 BST 2013
On Thu, Jun 20, 2013, at 10:24 AM, Olly Betts wrote:
> On Thu, Jun 20, 2013 at 12:07:19AM +1000, Bron Gondwana wrote:
> > On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote:
> > > In order to be able to delete documents as it went, it would have to
> > > modify any postlist chunks which contained those documents. That's
> > > possible, but adds complexity to the compaction code, and will probably
> > > lose most of the speed advantages.
> >
> > I figured the bigger problem was actually garbage collecting the terms
> > which didn't have references any more - in my quick glance through the
> > code. I admit I don't understand how it all works quite as well as I'd
> > like.
>
> Each term has a chunked list of postings (which are (docid, wdf) pairs)
> so there's not really much to the "garbage collecting" part - if that
> list is empty, the term is no longer present in the database.
Sure - more knowing which postings matter (since I'd filter by a callback
based on value[0]) at compact time.
> > It still compacts a lot better with compact:
> >
> > [brong at imap14 brong]$ du -s *
> > 1198332 xapian.57
> > [brong at imap14 brong]$ time sudo -u cyrus /usr/cyrus/bin/squatter -C /etc/cyrus/imapd-sloti14d5p4.conf -v -u brong -z data -t data -T /tmpfs/xap.tmp
> [...]
> > [brong at imap14 brong]$ du -s *
> > 759992 xapian.58
>
> How does that break down by table though? Looking at the sizes of the
> corresponding .DB files before and after will give you most of this info
> (the base files are much smaller, and essentially proportional in size).
DB is slightly larger now (another week's data indexed), but it should be fine.
compact result: (ignore cyrus.indexed.db - that's our internal format to track which records need to be indexed).
[brong at imap14 brong]$ du -s xapian.60/*
8 xapian.60/cyrus.indexed.db
4 xapian.60/iamchert
4 xapian.60/position.baseA
8 xapian.60/position.baseB
496332 xapian.60/position.DB
4 xapian.60/postlist.baseA
4 xapian.60/postlist.baseB
214840 xapian.60/postlist.DB
4 xapian.60/record.baseA
4 xapian.60/record.baseB
1072 xapian.60/record.DB
4 xapian.60/termlist.baseA
4 xapian.60/termlist.baseB
67100 xapian.60/termlist.DB
Using the direct copy version. Looks like most of the difference is the postlist.
[brong at imap14 brong]$ du -s xapian.61/*
8 xapian.61/cyrus.indexed.db
0 xapian.61/flintlock
4 xapian.61/iamchert
8 xapian.61/position.baseA
8 xapian.61/position.baseB
500224 xapian.61/position.DB
12 xapian.61/postlist.baseA
12 xapian.61/postlist.baseB
619196 xapian.61/postlist.DB
4 xapian.61/record.baseA
4 xapian.61/record.baseB
1088 xapian.61/record.DB
4 xapian.61/termlist.baseA
4 xapian.61/termlist.baseB
93680 xapian.61/termlist.DB
> With multiple databases as above, the docids are interleaved, so it
> might be worth trying to open each source and copy its documents to
> destdb in turn for better locality of reference, and so better cache
> use.
Sounds sane. I'll try that.
> That's assuming the raw docid order doesn't matter to you.
Not at all. I really don't care about docids at all.
> Is the CYRUSID value always non-empty? If it is, you can actually
> iterate that stream of values directly - something like:
It sure should be - I've had a couple of cases where it wound up without CyrusID on a message... only discovered because it triggered assertion failures on read. They should always have a CyrusId.
> Xapian::ValueIterator it;
> for (it = srcdb.valuestream_begin(SLOT_CYRUSID); it != srcdb.valuestream_end(SLOT_CYRUSID); it++) {
> if (cb((*it).c_str(), rock)) {
> Xapian::docid did = it->get_docid();
> Xapian::Document doc = srcdb.get_document(did);
> destdb.add_document(doc);
> }
> }
Going to give that a go, with separate document reads. Thanks.
> I suspect the document copying actually takes most of the time here,
> unless you're discarding a lot of them.
Yeah, I think so too. Anyway - I'll keep working on this code. We need something that does what it does.
Thanks again,
Bron.
--
Bron Gondwana
brong at fastmail.fm
More information about the Xapian-discuss
mailing list