[Xapian-discuss] Stopping wildcard expansion at some point

Adam Sjøgren asjo at koldfront.dk
Fri Mar 6 09:43:15 GMT 2009


On Fri, 6 Mar 2009 01:24:17 +0000, Olly wrote:

> One potential problem - it might be good to push the wildcard expansion
> into the backend, as it may be able to optimise it better that way (it
> should at least be able to avoid generating a Query object per expanded
> term):
    
> http://trac.xapian.org/ticket/48

Ah, yes, that sounds like a nice improvement.

[...]

> Perhaps this is fixing the symptom rather than the problem - if these are
> actually useful searches we're rejecting, rather than for example
> accidental invocation of the wildcard facility, or malicious users, it
> would be better to make these cases work more efficiently than impose a
> somewhat arbitrary limit.

In my (perhaps special) case, the search isn't really useful (returns
pretty much all the data), but nevertheless it is an obvious "mistake"
to make for a user that doesn't know the underpinnings.

I agree that it is quite arbitrary to stop at a certain maximum number,
but I don't think I'm really up to tackling the larger problem of prefix
terms:

> For example, by storing prefix terms:

> http://trac.xapian.org/ticket/207

That looks cool too.

> Or perhaps we should allow a lower limit on the number of characters
> before the wildcard rather than a limit on the number of expansions
> (so if this limit were 3, a* and ab* wouldn't be allowed, but abc*
> would).

This would be a setback in my case where some short wildcards expand
immensely (and are useless), while other short ones do not - and those
searches are still valuable.

I may have a hundred thousand terms starting with MM, but only 20
starting with A, and it would be sad for me if the user couldn't search
for A*.

>> Do I need to adjust something?

> It should use Xapian::termcount rather than long.

Ah, I should have looked around more (I started by choosing unsigned
int, but failed for some reason I didn't understand, switched to long
and got the package to build.)

> The error class should be QueryParserError - InvalidOperationError
> "indicates the API was used in an invalid way", which isn't the case
> here.

Too much copy/paste, too little reading/thinking. Thanks!

> The error message is likely to be shown to the user, so should really
> mention the wildcard expansion which was the problem, in case there's
> more than one in the query, or the user doesn't know about the wildcard
> syntax and accidentally invoked the feature.

Oh. Ah. Yes. I simply report the entire query in the application and the
user has to guess which one is the problem if there are more than one
wildcard in the query. Which isn't that friendly.

> (Hmm, or would the limit be better per parsed query than per wildcard
> expansion?)

Yeah, that would probably make more sense - but I don't know how
significant it is, I get total meltdown just expanding the wildcard for
a single "bad one", i.e. for the problematic one I never get to actually
searching.

Maybe I am biased by not having encountered queries with a lot of
wildcards yet, though.


Thanks for the feedback!


  Best regards,

    Adam

-- 
 "Soon we'll have spent a whole month at sea,                 Adam Sjøgren
  splitting atoms for no apparent reason"                asjo at koldfront.dk




More information about the Xapian-discuss mailing list