manual flushing thresholds for deletes?

Eric Wong e at 80x24.org
Fri Mar 24 10:37:41 GMT 2023


Years ago, I ran into OOM problems with the default flush
threshold of 10000 documents while indexing (add/replace).

Realizing I had documents of hugely varying sizes (0.5KB..20MB)
and little RAM, I instead tracked the number of raw bytes in the
text being indexed and flushed whenever I'd seen a configurable
byte count.  Not the most scientific way, but it seems to work
well enough on low-end systems.


Now, I'm dealing with many deletes at once and hitting OOM
again.  Since the raw text is no longer available and I didn't
store its original size anywhere, would calculating something
based on get_doclength be a reasonable approximation?

I'm wondering if something like:

  get_doclength * (1 + 3) * mean_term_length

  where 1x is for the mean term length itself,
  and 3x for the position overhead

And perhaps assume mean_term_length is 10 bytes, so maybe:

  get_doclength * 40

?

I'm using Search::Xapian XS since it's in Debian stable;
and don't think there's a standard way to show the amount
of memory uncommitted changes are taking up.

Thanks for any thoughts you have.



More information about the Xapian-discuss mailing list