does Xapian::Enquire hold an MVCC revision?

Sat Aug 19 23:52:00 BST 2023

Olly Betts <olly at survex.com> wrote:
> On Fri, Aug 18, 2023 at 10:41:52AM +0000, Eric Wong wrote:
> > Olly Betts <olly at survex.com> wrote:
> > > While the match is running, get_mset(2000, 1000) needs to track
> > > 3000 entries so this won't reduce your heap usage (at least not
> > > peak usage).
> > > 
> > > Is the heap usage problematic?
> > 
> > Yes, roughly ~1.3GB (in a Perl process) for ~17 million (and
> > growing) docs in the worst case of a search returning everything.
> > Those numbers appears inline with the 88 bytes w/ 64-bit libstdc++
> > you noted.
> 
> I suppose for an mbox export you may not be too bothered about order (or
> are happy to have the raw order be that in which messages were added)
> in which case we only need to track the docid, so that could be just 4
> bytes per result which is ~65MB.

Right, that's great news as we creep towards 50 or 100 million docs

> Incidentally you don't mind the export order and only have single term
> queries you can just use a PostingIterator to get a stream of document
> ids matching a particular term (in the order documents were added),
> which should use at most ~80KB (per shard if you're using a sharded
> database).

Thanks for that tip on PostingIterator, I'll keep it in mind;
but I think there's usually >= 2 terms.

> > > If this structure was dynamically sized it could be as little as just
> > > 4 bytes per entry for a boolean search, or 12 for a search without
> > > collapsing or sorting on a key (though at least x86-64 wants to align
> > > a double on an 8 byte boundary which means 4 bytes of padding per
> > > entry - that could be avoided by splitting into separate arrays).
> > 
> > Yeah, it seems separate arrays would be appropriate since collapse
> > isn't commonly used AFAIK.
> 
> I think that would probably be a git master only change.

No worries on the timeline; I think I can wait for 1.6, even.