[Xapian-discuss] Search queries with wildcards

Olly Betts olly at survex.com
Wed Dec 15 16:31:11 GMT 2004


On Wed, Dec 15, 2004 at 11:53:08AM +0100, rm at fabula.de wrote:
> On Wed, Dec 15, 2004 at 11:21:57AM +0100, Timo Haberkern wrote:
> > A1590-789
> > A1590-555
> > A6719-9911
> > 
> > Where the first 5 characters are an article-group identifier. The user 
> > should be capable to search for all documents with articles of an 
> > arcticle group. Therefore he should be able to use for exmpample the 
> > search query: "A1590*" or "A1590-*"
> 
> That's what i feared :-) Such a technique is _often_ used in the database
> world to "fake" a semantic query - and a lot of customers/users are so 
> used to "prefix" searching that they'll expect all IR systems to work like 
> this.

This is sadly true.  Wildcarding has been so often used as a crude
one-size-fits-all solution to the problems of word inflection, compound
words, and misspellings that now often people expect it as a feature in
its own right.

> IMHO there's a major design flaw here: A1590-789  encapsulates 
> _two_ TERMS (Article-group is 'A1590' AND Article ID is '555').

Omega will index is as two terms - "a1590" and "555", so a search for
"A1590*" will just search for "a1590" pull up the documents wanted.
Searching for "A1590-789" produces a phrase search, and again works
as expected.

> Hard to do. That kind of query is hard to handle for all sorts of
> indexing (even in a RDBMS where such stuff is handled better). As long
> as you can restrict your queries to right _or_ left6  truncation (i.e.
> either "Blah*" _or_ "*Blah") one can use a (b)tree/suffix-array etc. index
> scheme, but if you need "*blah*" or "bl*h" you pretty much end up with a
> full scan of all terms.

"bl*h" can be done by pulling up all terms starting "bl" and filtering
for those matching the pattern.  But wildcards at both ends is harder.
You end up needing a second index of of all terms in the document by
their n-grams or something like that.

> One thing you _could_ do (but it's b*t-uggly): parse your query and
> expand all wildcards to alternative terms from the index.
> So a query for "par*" would trigger a scan over the index that expands the
> query into "parent OR part OR partner OR ...".

This is what I'd recommend doing.  Using skip_to("par") means you only
iterate over the keys of the postlist tree you are going to be pulling
postlists from anyway, and the multi-way OR should be handled fairly
efficiently by the matcher.

Cheers,
    Olly



More information about the Xapian-discuss mailing list