manual flushing thresholds for deletes?

Mon Mar 27 12:22:09 BST 2023

Olly Betts <olly at survex.com> wrote:
> On Fri, Mar 24, 2023 at 10:37:41AM +0000, Eric Wong wrote:
> > Realizing I had documents of hugely varying sizes (0.5KB..20MB)
> > and little RAM, I instead tracked the number of raw bytes in the
> > text being indexed and flushed whenever I'd seen a configurable
> > byte count.  Not the most scientific way, but it seems to work
> > well enough on low-end systems.
> > 
> > Now, I'm dealing with many deletes at once and hitting OOM
> > again.  Since the raw text is no longer available and I didn't
> > store its original size anywhere, would calculating something
> > based on get_doclength be a reasonable approximation?
> > 
> > I'm wondering if something like:
> > 
> >   get_doclength * (1 + 3) * mean_term_length
> > 
> >   where 1x is for the mean term length itself,
> >   and 3x for the position overhead
> 
> If I follow you want an approximation to the number of raw bytes in the
> text to match the non-delete case, so I think you want something like:
> 
> get_doclength() / 2 * (mean_word_length + 1)
> 
> The /2 is assuming you're indexing both stemmed and unstemmed terms
> since with the default indexing strategy one word in the document
> generates one of each.
> 
> The +1 is for the spaces between words in the text.  This is
> likely to underestimate due to punctuation and runs of whitespace,
> So perhaps +1.<something> is better (and perhaps better to overestimate
> slightly and flush a little more often rather than risk OOM).

Thanks for the response.

> > And perhaps assume mean_term_length is 10 bytes, so maybe:
> > 
> >   get_doclength * 40
> 
> 10 seems too long.  You want the mean word length weighted by frequency
> of occurrence.  For English that's typically around 5 characters, which
> is 5 bytes.  If we go for +1 that's:

Actually, 10 may be too short in my case since there's a lot of
40-byte SHA-1 hex (and likely SHA-256 in the future) from git; and
also long function names, etc...

Without capitalized prefixes, I get a mean length from delve as 15.3571:

	xapian-delve -a -1 . | tr -d A-Z \
	  awk '{ d = length - mean; mean += d/NR } END { print mean }'

(that awk bit should be overflow-free)

> > I'm using Search::Xapian XS since it's in Debian stable;
> > and don't think there's a standard way to show the amount
> > of memory uncommitted changes are taking up.
> 
> We don't have an easy way to calculate this in any version.  It would
> need us to take more control of the allocation of the memory used to
> store these changes.  We probably need to do that as the threshold
> here really should be in terms of memory used to store the pending
> changes, but it's not a trivial change.

Understood.