choosing between probabilistic and boolean prefixes for terms

Eric Wong e at 80x24.org
Thu Jul 19 21:32:23 BST 2018


Hi all,

public-inbox allows searching for git blob names (e.g. "badc0ffee")
in patches.  Initially, I chose to use add_prefix for probabilistic
terms, since I assumed it could be a superset of what boolean
searching offered.  Unfortunately, it doesn't seem to be the case
because stemming is interfering.

So switching to boolean filtering seems to work; and it is
fine for mechanical searches I plan on doing:

  https://public-inbox.org/meta/20180716040734.30104-1-e@80x24.org/

Now I wonder, is there a way to get the best-of-both-worlds so
a human can still use wildcards?

public-inbox also allows searches on pathnames, and maybe that
should use boolean filtering, too...

My setup for the query parser isn't anything special:

our $LANG = 'english';
sub stemmer { Search::Xapian::Stem->new($LANG) }

sub qp {
	my ($self) = @_;

	my $qp = $self->{query_parser};
	return $qp if $qp;

	# new parser
	$qp = Search::Xapian::QueryParser->new;
	$qp->set_default_op(OP_AND);
	$qp->set_database($self->{xdb});
	$qp->set_stemmer($self->stemmer);
	$qp->set_stemming_strategy(STEM_SOME);
	$qp->set_max_wildcard_expansion(100);
	$qp->add_valuerangeprocessor(
		Search::Xapian::NumberValueRangeProcessor->new(YYYYMMDD, 'd:'));
	$qp->add_valuerangeprocessor(
		Search::Xapian::NumberValueRangeProcessor->new(DT, 'dt:'));

In any case, all the code is available via:

	git clone https://public-inbox.org/public-inbox



More information about the Xapian-discuss mailing list