does Xapian::Enquire hold an MVCC revision?
Eric Wong
e at 80x24.org
Sat Aug 19 23:52:00 BST 2023
Olly Betts <olly at survex.com> wrote:
> On Fri, Aug 18, 2023 at 10:41:52AM +0000, Eric Wong wrote:
> > Olly Betts <olly at survex.com> wrote:
> > > While the match is running, get_mset(2000, 1000) needs to track
> > > 3000 entries so this won't reduce your heap usage (at least not
> > > peak usage).
> > >
> > > Is the heap usage problematic?
> >
> > Yes, roughly ~1.3GB (in a Perl process) for ~17 million (and
> > growing) docs in the worst case of a search returning everything.
> > Those numbers appears inline with the 88 bytes w/ 64-bit libstdc++
> > you noted.
>
> I suppose for an mbox export you may not be too bothered about order (or
> are happy to have the raw order be that in which messages were added)
> in which case we only need to track the docid, so that could be just 4
> bytes per result which is ~65MB.
Right, that's great news as we creep towards 50 or 100 million docs
> Incidentally you don't mind the export order and only have single term
> queries you can just use a PostingIterator to get a stream of document
> ids matching a particular term (in the order documents were added),
> which should use at most ~80KB (per shard if you're using a sharded
> database).
Thanks for that tip on PostingIterator, I'll keep it in mind;
but I think there's usually >= 2 terms.
> > > If this structure was dynamically sized it could be as little as just
> > > 4 bytes per entry for a boolean search, or 12 for a search without
> > > collapsing or sorting on a key (though at least x86-64 wants to align
> > > a double on an 8 byte boundary which means 4 bytes of padding per
> > > entry - that could be avoided by splitting into separate arrays).
> >
> > Yeah, it seems separate arrays would be appropriate since collapse
> > isn't commonly used AFAIK.
>
> I think that would probably be a git master only change.
No worries on the timeline; I think I can wait for 1.6, even.
More information about the Xapian-discuss
mailing list