choosing between probabilistic and boolean prefixes for terms
olly at survex.com
Wed Jul 25 05:45:25 BST 2018
On Thu, Jul 19, 2018 at 08:32:23PM +0000, Eric Wong wrote:
> public-inbox allows searching for git blob names (e.g. "badc0ffee")
> in patches. Initially, I chose to use add_prefix for probabilistic
> terms, since I assumed it could be a superset of what boolean
> searching offered. Unfortunately, it doesn't seem to be the case
> because stemming is interfering.
> So switching to boolean filtering seems to work; and it is
> fine for mechanical searches I plan on doing:
> Now I wonder, is there a way to get the best-of-both-worlds so
> a human can still use wildcards?
I struggle to think of a situation in which one would you want to
wildcard search for a git sha...
> public-inbox also allows searches on pathnames, and maybe that
> should use boolean filtering, too...
...but for a pathname that's more believable.
Currently you can't specify a different stemmer (or stemming mode)
per prefix. Perhaps that should be supported - there are common
cases such as "author" fields where the stemming can be harmful,
but currently you'd have to have a separate text entry field for the
author search to support that directly.
I think you could use add_prefix() with a FieldProcessor object
since that get passed the term without stemming, but FieldProcessor
isn't wrapped by Search::Xapian (the SWIG-based Perl bindings do wrap
it, but the API isn't 100% the same as Search::Xapian's so you'd need
to test and probably adjust some of your code to port to that - it is
the future for using Xapian from Perl, but I've been hoping to sort out
the incompatibilities before pushing it more).
There isn't currently a flag to enable wildcards for boolean terms
but that could be supported I think. It mostly isn't by default
because it seems less useful, and because it's assumed you could
have any character in a boolean term and "*" being special works
against that. Some of the options to limit expansion don't really
make sense for a boolean filter, but I guess that's a case of "well
don't do that then".
More information about the Xapian-discuss