choosing between probabilistic and boolean prefixes for terms
Eric Wong
e at 80x24.org
Thu Jul 19 21:32:23 BST 2018
Hi all,
public-inbox allows searching for git blob names (e.g. "badc0ffee")
in patches. Initially, I chose to use add_prefix for probabilistic
terms, since I assumed it could be a superset of what boolean
searching offered. Unfortunately, it doesn't seem to be the case
because stemming is interfering.
So switching to boolean filtering seems to work; and it is
fine for mechanical searches I plan on doing:
https://public-inbox.org/meta/20180716040734.30104-1-e@80x24.org/
Now I wonder, is there a way to get the best-of-both-worlds so
a human can still use wildcards?
public-inbox also allows searches on pathnames, and maybe that
should use boolean filtering, too...
My setup for the query parser isn't anything special:
our $LANG = 'english';
sub stemmer { Search::Xapian::Stem->new($LANG) }
sub qp {
my ($self) = @_;
my $qp = $self->{query_parser};
return $qp if $qp;
# new parser
$qp = Search::Xapian::QueryParser->new;
$qp->set_default_op(OP_AND);
$qp->set_database($self->{xdb});
$qp->set_stemmer($self->stemmer);
$qp->set_stemming_strategy(STEM_SOME);
$qp->set_max_wildcard_expansion(100);
$qp->add_valuerangeprocessor(
Search::Xapian::NumberValueRangeProcessor->new(YYYYMMDD, 'd:'));
$qp->add_valuerangeprocessor(
Search::Xapian::NumberValueRangeProcessor->new(DT, 'dt:'));
In any case, all the code is available via:
git clone https://public-inbox.org/public-inbox
More information about the Xapian-discuss
mailing list