Formulating Advanced Queries with Xapian-Omega

Olly Betts olly at survex.com
Wed Jan 4 02:08:29 GMT 2017


On Thu, Dec 29, 2016 at 05:44:50PM +0100, Giulio Teslano wrote:
> a. What other types of extended wild card(s) options are there ?
> 
>    or is this still currently limited to these two characters '*?' ?

As I said, the branch "adds support for arbitrary glob-style wildcard
patterns (where * matches 0 or more characters and ? a single                                                                                                                                                                                
character)".

> b. Apart from 0 or more and single char options are there any other
> options ?

Not that are currently implemented on that branch.

> Were you suggesting that one possibility would be trying something
> similar to : 
> 
> isbn:?-???-?????-? as a very loose general query for ISBNs ?
> 
> (so long as the option is enabled).

It seems you must be talking about the query the user would write here,
but then I'm not sure what the "isbn:" prefix would map to.

But yes, that's the sort of pattern you'd have to use.

One wrinkle with this is that (assuming you use the Xapian::TermGenerator
class) "-" is a word separator character at index time - i.e. you'll get
terms from 1-234-56789-0 and OP_WILDCARD only matches within a term.  So
you need different word splitting behaviour for this to work, which
currently means you'll need to do it yourself instead of using
TermGenerator as that isn't currently configurable.

> 1 Could you mention how one enables and can take advantage of your
> extended option in Omega and/or Xapian ? (working example ?)

Currently you need to use one of the WILDCARD_PATTERN_* constants
when constructing an OP_WILDCARD Query object, e.g.:

Xapian::Query wild(Xapian::OP_QUERY_WILDCARD, "?-???-?????-?", 0, Xapian::Query::OP_WILDCARD_GLOB);

There isn't yet any integration into omega (or even into
Xapian::QueryParser).  Such are the pitfalls of using code from unmerged
branches I'm afraid.

> 2 The ? Wild Char is for general characters, is it not ?
> 
>   ie. It cannot distinguish between digits and letters and thus cannot
> act as a RE \d or [0-9] ?

"?" matches any single character.  The project this branch is for only
required allowing "*" anywhere in a term (rather than it only being
supported at the end) and adding support for "?", so there's not
currently a plan to support pattern styles other than globbing, or
additional glob-style patterns.  The flags to control this were picked
such that either could be done in the future.

> > If you have particular "code" patterns which are important in your domain,
> > I'd consider pulling them out at index time and adding them as a filter
> > term
> 
> It is no doubt due to my lack of understanding but how would this
> interesting option 'pulling them out at index time ...' be implemented
> ?

For example in Perl, at index time:

	while ($text =~ /(\b\d-\d{3}-\d{5}-[\dX]\b)/g) {
	    $doc->add_boolean_term("XISBN$1");
	}

With this approach, you could also easily do additional validation (such
as checking the check digit for codes which have one, as ISBNs do).

Then at query time:

        $queryparser->add_boolean_prefix("isbn", "XISBN");

Then the user can use isbn:1-234-56789-0 to filter only documents
mentioning that ISBN.  Or if you want to be able to find documents
which mention any ISBN (or anything which looks like one) then:

	if ($text =~ /(\b\d-\d{3}-\d{5}-[\dX]\b)/) {
	    $doc->add_boolean_term("XHASisbn");
	}

Then at query time:

        $queryparser->add_boolean_prefix("has", "XHAS");

And then the user can filter a search by: has:isbn

> It would be very useful if there were some working examples in
> relation to these themes, (at least for those less expert than the
> xapian developer level). Xapian-Omega appears to be a very interesting
> solution and with an RE option it would be one of the most flexible
> and versatile SEs currently available on the net

I suspect that most end users wanting "regexp search" don't just want to
search for terms matching a specified regexp (which is how OP_WILDCARD
inherently works), but rather to perform regexp matches over the whole
document (like https://codesearch.debian.net/ does for source code).
To do that efficiently you need a different index structure.

Cheers,
    Olly



More information about the Xapian-discuss mailing list