[Xapian-discuss] *wildcard* support?

Eric Parusel eparusel at creativens.com
Sun Oct 9 21:46:07 BST 2005


Olly Betts wrote:
> So the reason we don't have it is that it's harder to implement
> efficiently, and not commonly requested.

Ok, thanks.  I figured it hurt efficiency, especially on larger bodies
of text.  Fortunately I'm looking for something that works on relatively
small bodies of text (some metadata).


>>I would generally only use it for shorter strings, like email headers.
>>I *could* potentially use right-truncating in sequence as below, but I'm
>>not sure if this is too insane.
>>terms indexed: Axapian Aapian Apian Aian Aan
>>search for: A<user_inputted_partial_term>*
>>Which would obviously use alot of terms!
> 
> Doing the work at index time is the right approach, but the trick to use
> is to index your terms reversed and use right truncation!
> 
> The best approach is probably n-grams.  You create a second index where
> the "documents" are terms from the main index, and these are indexed by
> n-grams (substrings of length n).  So "xapian" might be indexed by "^x",
> "^xa", "xap", "api", "pia", "ian", "an$", and "n$" (and perhaps all the
> bigrams and monograms too if you're doing this for real rather than
> typing in an example!)

I'm basically just interested in "is this string in the metadata?".
Monograms might be preferable, but I'd think it would have the following
pros and cons?:
(eg. From: bob at market.com, terms are:
Fb Fo Fb F@ Fm Fa Fr Fk Fe Ft F. Fc Fo Fm

pro) Can search for the instance of a single character, if desired

pro) Doesn't use wildcards, but search uses phrases and positional
indexes have to be used (do they have to be used in your above example?)

con) No indication of the spaces between words, which you handle above
with ^ and $

con) Takes up a few more terms than bigrams or trigrams, if that's the word

con) Would the search be slow, with a search for "market" in the From
field being: ("Fm Fa Fr Fk Fe Ft")?  Unsure.  I'll have to do some testing.


> The same n-gram index can actually be used for spelling correction too -
> search for all the n-grams from a possibly misspelled term and the
> results are a ranked set of terms they might want.

Interesting definitely -- not the direction I need to go initially, but
it could have some value in the future...  I'll keep it in mind.


> As for roadmaps, I'm afraid left truncation is low on my list of things
> to work on.  Spelling correction overlaps somewhat and is higher but I'm
> trying to concentrate on flint right now.  But if you want to work on it
> yourself, I'm happy to give pointers on where to hook in to Xapian.  And
> if done cleanly, I think it's something that's worth including in
> Xapian.

If/when I am able to work on a spelling correction feature, I'll get in
touch.

Thanks again for Xapian!
Eric



More information about the Xapian-discuss mailing list