manual flushing thresholds for deletes?

Eric Wong e at 80x24.org
Thu May 4 09:45:59 BST 2023


Olly Betts <olly at survex.com> wrote:
> On Wed, May 03, 2023 at 12:38:15PM +0000, Eric Wong wrote:
> > Olly Betts <olly at survex.com> wrote:
> Adding one to each collection frequency makes little sense to me.
> 
> I'm guessing the idea is to count each boolean term once for each
> document it's in?

Yes.

> If so, you want to use the collection frequency
> for non-boolean terms and the term frequency for boolean terms,
> so that's:
>
> xapian-delve -avv1 .|tr -d A-Z|awk '{f = $3 ? $3 : $2; t += length($1)*f; n += f} END {print t/n}'

OK, that's still roughly 6 with my dataset.  I'll keep the
original and ignore $2 since that's irrelevant to get_doclength

> > My Perl deletion code is something like:
> > 
> > 	my $EST_LEN = 6;
> > 	...
> > 	for my $docid (@docids) {
> > 		$TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN;
> 
> However you're using that estimate here, and the document length
> doesn't include boolean terms (it's sum(wdf) over the terms in the
> document), so including them in $EST_LEN seems wrong.  For you doing
> so increases $EST_LEN, so you'll tend to overestimate for long documents
> and underestimate for short ones.
> 
> You'd probably do better to add a separate fixed per-document
> contribution to allow for boolean terms.  You could come up with an
> average size of boolean terms per document to use by poking at a sample
> database.

OK, every document has one commit OID (40 bytes for SHA-1);
and git.git has ~1.26 parents per-commit[1].  Thus I should be
subtracting an extra 90 bytes (40 + (40 * 1.26))?

	my $EST_LEN = 6;
	my $EST_BOOL_LEN = 90; # length(commit) + mean_length(parents)
	for my $docid (@docids) {
		$TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN;
		$TXN_BYTES -= $EST_BOOL_LEN;
		...
	}

Thanks.


[1] Not sure if there's a quick way to dump all terms of a
    single prefixfor each doc via delve or quest; but
    `git | awk' on individual repos seems OK for getting
    parents-per-commit in a single repo:

    git log --pretty=%P | awk '{ n += NF } END { print n/NR }'

git.git has fairly high merge use, and above prints ~1.26
parents-per-commit.  torvalds/linux.git only has ~1.08
parents-per-commit, but I'd rather err on the side of flushing
too frequently to limit memory use; escp git.git also has
the `seen' (formerly `pu') branch which gets pruned often.



More information about the Xapian-discuss mailing list