manual flushing thresholds for deletes?
Eric Wong
e at 80x24.org
Fri Mar 24 10:37:41 GMT 2023
Years ago, I ran into OOM problems with the default flush
threshold of 10000 documents while indexing (add/replace).
Realizing I had documents of hugely varying sizes (0.5KB..20MB)
and little RAM, I instead tracked the number of raw bytes in the
text being indexed and flushed whenever I'd seen a configurable
byte count. Not the most scientific way, but it seems to work
well enough on low-end systems.
Now, I'm dealing with many deletes at once and hitting OOM
again. Since the raw text is no longer available and I didn't
store its original size anywhere, would calculating something
based on get_doclength be a reasonable approximation?
I'm wondering if something like:
get_doclength * (1 + 3) * mean_term_length
where 1x is for the mean term length itself,
and 3x for the position overhead
And perhaps assume mean_term_length is 10 bytes, so maybe:
get_doclength * 40
?
I'm using Search::Xapian XS since it's in Debian stable;
and don't think there's a standard way to show the amount
of memory uncommitted changes are taking up.
Thanks for any thoughts you have.
More information about the Xapian-discuss
mailing list