manual flushing thresholds for deletes?

Mon Mar 27 19:31:26 BST 2023

On Mon, Mar 27, 2023 at 11:22:09AM +0000, Eric Wong wrote:
> Olly Betts <olly at survex.com> wrote:
> > 10 seems too long.  You want the mean word length weighted by frequency
> > of occurrence.  For English that's typically around 5 characters, which
> > is 5 bytes.  If we go for +1 that's:
> 
> Actually, 10 may be too short in my case since there's a lot of
> 40-byte SHA-1 hex (and likely SHA-256 in the future) from git; and
> also long function names, etc...
> 
> Without capitalized prefixes, I get a mean length from delve as 15.3571:
> 
> 	xapian-delve -a -1 . | tr -d A-Z \
> 	  awk '{ d = length - mean; mean += d/NR } END { print mean }'

That's not weighted by frequency though, and short words tend to be more
frequent, so you're likely skewing the answer.  Also it'll include
boolean terms which didn't come from the document text.

You can take frequency into account with something like this:

xapian-delve -avv1 .|tr -d A-Z|awk '{t + = length($1)*$3; n += $3} END {print t/n}'

This will also effectively ignore boolean terms, assuming you're giving
them wdf of 0 (because $3 here is the collection frequency, which is
sum(wdf(term)) over all documents).

> (that awk bit should be overflow-free)

I don't see how to do the above as a rolling mean, so to be accurate for
a large database it seems with awk you'll need to make two passes over
the data - one to calculate `n`, then use that in a second pass like you
do with NR.  Or use a language which supports arbitrary precision
numbers.

Cheers,
    Olly