Formulating Advanced Queries with Xapian-Omega
Olly Betts
olly at survex.com
Wed Jan 4 02:08:29 GMT 2017
On Thu, Dec 29, 2016 at 05:44:50PM +0100, Giulio Teslano wrote:
> a. What other types of extended wild card(s) options are there ?
>
> or is this still currently limited to these two characters '*?' ?
As I said, the branch "adds support for arbitrary glob-style wildcard
patterns (where * matches 0 or more characters and ? a single
character)".
> b. Apart from 0 or more and single char options are there any other
> options ?
Not that are currently implemented on that branch.
> Were you suggesting that one possibility would be trying something
> similar to :
>
> isbn:?-???-?????-? as a very loose general query for ISBNs ?
>
> (so long as the option is enabled).
It seems you must be talking about the query the user would write here,
but then I'm not sure what the "isbn:" prefix would map to.
But yes, that's the sort of pattern you'd have to use.
One wrinkle with this is that (assuming you use the Xapian::TermGenerator
class) "-" is a word separator character at index time - i.e. you'll get
terms from 1-234-56789-0 and OP_WILDCARD only matches within a term. So
you need different word splitting behaviour for this to work, which
currently means you'll need to do it yourself instead of using
TermGenerator as that isn't currently configurable.
> 1 Could you mention how one enables and can take advantage of your
> extended option in Omega and/or Xapian ? (working example ?)
Currently you need to use one of the WILDCARD_PATTERN_* constants
when constructing an OP_WILDCARD Query object, e.g.:
Xapian::Query wild(Xapian::OP_QUERY_WILDCARD, "?-???-?????-?", 0, Xapian::Query::OP_WILDCARD_GLOB);
There isn't yet any integration into omega (or even into
Xapian::QueryParser). Such are the pitfalls of using code from unmerged
branches I'm afraid.
> 2 The ? Wild Char is for general characters, is it not ?
>
> ie. It cannot distinguish between digits and letters and thus cannot
> act as a RE \d or [0-9] ?
"?" matches any single character. The project this branch is for only
required allowing "*" anywhere in a term (rather than it only being
supported at the end) and adding support for "?", so there's not
currently a plan to support pattern styles other than globbing, or
additional glob-style patterns. The flags to control this were picked
such that either could be done in the future.
> > If you have particular "code" patterns which are important in your domain,
> > I'd consider pulling them out at index time and adding them as a filter
> > term
>
> It is no doubt due to my lack of understanding but how would this
> interesting option 'pulling them out at index time ...' be implemented
> ?
For example in Perl, at index time:
while ($text =~ /(\b\d-\d{3}-\d{5}-[\dX]\b)/g) {
$doc->add_boolean_term("XISBN$1");
}
With this approach, you could also easily do additional validation (such
as checking the check digit for codes which have one, as ISBNs do).
Then at query time:
$queryparser->add_boolean_prefix("isbn", "XISBN");
Then the user can use isbn:1-234-56789-0 to filter only documents
mentioning that ISBN. Or if you want to be able to find documents
which mention any ISBN (or anything which looks like one) then:
if ($text =~ /(\b\d-\d{3}-\d{5}-[\dX]\b)/) {
$doc->add_boolean_term("XHASisbn");
}
Then at query time:
$queryparser->add_boolean_prefix("has", "XHAS");
And then the user can filter a search by: has:isbn
> It would be very useful if there were some working examples in
> relation to these themes, (at least for those less expert than the
> xapian developer level). Xapian-Omega appears to be a very interesting
> solution and with an RE option it would be one of the most flexible
> and versatile SEs currently available on the net
I suspect that most end users wanting "regexp search" don't just want to
search for terms matching a specified regexp (which is how OP_WILDCARD
inherently works), but rather to perform regexp matches over the whole
document (like https://codesearch.debian.net/ does for source code).
To do that efficiently you need a different index structure.
Cheers,
Olly
More information about the Xapian-discuss
mailing list