manual flushing thresholds for deletes?

Wed May 3 22:02:11 BST 2023

On Wed, May 03, 2023 at 12:38:15PM +0000, Eric Wong wrote:
> Olly Betts <olly at survex.com> wrote:
> > This will also effectively ignore boolean terms, assuming you're giving
> > them wdf of 0 (because $3 here is the collection frequency, which is
> > sum(wdf(term)) over all documents).
> 
> Should boolean terms be ignored when estimating flushing
> thresholds?  They do have a wdf of 0 in my case.  I'm indexing
> git commit SHA-1 hex (and soon SHA-256), so that's a lot of
> 40-64 char terms.  Every commit has the commit OID itself, and
> the parent OID(s); commit OIDs are unique to each document,
> but parents are not always unique, though many are...
> 
> Surely boolean terms are not free when accounting for memory use
> on deletes, right?  I account for them when indexing since
> I extract boolean terms from raw text (and rely on the length of
> the raw text (including whitespace) to account for flushing).

They take space to store, but it's probably not helpful to try to
include them when estimating the average term length as a multiplier for
the count of non-boolean terms (which is what the document length is).
If that's not clear, keep reading...

> Anyways, your above awk snippet gave me 5.82497.  Though, if I
> wanted to account for boolean terms, I'd use ($3 + 1)?. e.g:
> 
> 	awk 'NR > 1 {t += length($1)*($3+1); n += ($3+1)} END {print t/n}'
> 	# (also added "NR > 1" to ignore the delve header line)
> 
> Which gives me 6.00067, so rounding to 6 seems fine either way.

Adding one to each collection frequency makes little sense to me.

I'm guessing the idea is to count each boolean term once for each
document it's in?  If so, you want to use the collection frequency
for non-boolean terms and the term frequency for boolean terms,
so that's:

xapian-delve -avv1 .|tr -d A-Z|awk '{f = $3 ? $3 : $2; t += length($1)*f; n += f} END {print t/n}'

> My Perl deletion code is something like:
> 
> 	my $EST_LEN = 6;
> 	...
> 	for my $docid (@docids) {
> 		$TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN;

However you're using that estimate here, and the document length
doesn't include boolean terms (it's sum(wdf) over the terms in the
document), so including them in $EST_LEN seems wrong.  For you doing
so increases $EST_LEN, so you'll tend to overestimate for long documents
and underestimate for short ones.

You'd probably do better to add a separate fixed per-document
contribution to allow for boolean terms.  You could come up with an
average size of boolean terms per document to use by poking at a sample
database.

Cheers,
    Olly