[Xapian-discuss] Stopping wildcard expansion at some point

Olly Betts olly at survex.com
Thu Mar 19 23:31:42 GMT 2009


On Fri, Mar 06, 2009 at 10:43:15AM +0100, Adam Sjøgren wrote:
> On Fri, 6 Mar 2009 01:24:17 +0000, Olly wrote:
> 
> > Or perhaps we should allow a lower limit on the number of characters
> > before the wildcard rather than a limit on the number of expansions
> > (so if this limit were 3, a* and ab* wouldn't be allowed, but abc*
> > would).
> 
> This would be a setback in my case where some short wildcards expand
> immensely (and are useless), while other short ones do not - and those
> searches are still valuable.
> 
> I may have a hundred thousand terms starting with MM, but only 20
> starting with A, and it would be sad for me if the user couldn't search
> for A*.

That's probably extreme, but it's likely to be true for English text
that e* might be undesirable while z* is fine.  I'm not sure if either
is actually useful for English though.

> > (Hmm, or would the limit be better per parsed query than per wildcard
> > expansion?)
> 
> Yeah, that would probably make more sense - but I don't know how
> significant it is, I get total meltdown just expanding the wildcard for
> a single "bad one", i.e. for the problematic one I never get to actually
> searching.
> 
> Maybe I am biased by not having encountered queries with a lot of
> wildcards yet, though.

I guess it partly depends if malicious users are a concern.

On Fri, Mar 06, 2009 at 12:06:52PM +0100, Adam Sjøgren wrote:
> Attached is a patch updated from the feedback (Xapian::termcount,
> QueryParserError, error message) for further consideration.

I'm still wondering what to do about this if we don't want to prevent
ourselves being able to push the wildcard expanding into the database
backends.  We could perhaps push this check with it, but then the
rejection potentially happens rather late on.  Or the check stays and
we end up counting the matches up front if this option is on.

Can you attach the patch to a ticket in trac for now, so that it doesn't
get forgotten about?

> I wasn't quite sure how, in the error message, to display the term
> exactly as the user entered it, the closest I found was "unstemmed",
> which hasn't got the '*'.

Yeah, that's probably the best choice (and just append a "*" to it).

Cheers,
    Olly



More information about the Xapian-discuss mailing list