[Xapian-discuss] Search queries with wildcards

Olly Betts olly at survex.com
Wed Dec 15 11:13:27 GMT 2004


On Wed, Dec 15, 2004 at 11:06:45AM +0100, Timo Haberkern wrote:
> You are right with that, but the problem is how to detect the acceptabe 
> fragments of a compound word. The application has to index technical 
> documents and there are many, many, many (...) compound words that never 
> occure in any dictionary. At the moment i don't see a practical way to 
> solve this problem as you described. Or do you have an idea to do so?

The compound words don't need to appear in any dictionary - just the
words that they consist of.  In fact you can probably use the
non-appearance of the compound word as a criterion for treating it
as a compound word!  For example, "Fußball" has evolved past
being "Fuß" + "Ball", and it probably isn't useful to treat as a
compound word.  There no doubt better examples, but my German is rather
limited.

Compound word splitting algorithms exist, though I don't know of any
open source ones.

> Don't be sure if i understood that right. Is the only possible way to 
> implement wildcards that i have to store all possible substrings in the 
> index database?? So if i have the word "car" i need to store in the 
> database:
> 
> - "c"
> - "ca"
> - "car"
> 
> for doing a simple "c*" wildcardsearch?

No.  Create a TermIterator.  skip_to("c").  Read terms until you get to
one which doesn't start "c".  Add them all to the query as an "OR".

At least one person has implemented this - see the list archives for
code.

Wildcarding at the front (e.g. "*c") can also be done, but unless your
database is small, you'll want to index all terms in reversed form as
well.

Note that you probably want to limit how short the stem can be.  A
search for "a*" "e*" "t*" "o*" would provide an easy way for someone
malicious to overload your search otherwise.

Cheers,
    Olly



More information about the Xapian-discuss mailing list