[Xapian-discuss] Spaces in bool prefixes

Rick Olson rick at napalmriot.com
Fri Feb 22 11:44:30 GMT 2008


[apologies in advance if the formatting on this email comes out really bad]

Olly Betts wrote:
> On Thu, Feb 21, 2008 at 11:01:22PM -0800, Rick Olson wrote:
>   
>> Thanks for the suggestions.  The idea of removing the spaces from my 
>> terms has occurred to me; I was just hoping there was another way I 
>> could accomplish filtering while also allowing spaces.
>>     
>
> Not currently, but see bug#128:
>
> http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=128
>   

Ah, it seems my issue is mentioned in that bug, and it seems that the 
proposed ideas at solving it would fit in with what I'm trying to 
accomplish.

> I think what makes sense is that country:"united states" would map to
> a single boolean term (something like `XCOUNTRY:united states').
>   

Exactly what I'm looking to accomplish, which is what I hacked out in 
queryparser.cc (which corresponds directly to the lemon file).

>> I wrote some modifications in the Xapian queryparser (while I was 
>> waiting for a response) which would not restrict spaces from boolean 
>> prefixes, and I'm trying to figure out how badly allowing the extra 
>> space[s] would affect performance.  Using delve, the exact term is: 
>> 'X_country:united states' so that's what's causing me some confusion in 
>> understanding the performance impact entirely. 
>>     
>
> Terms are essentially opaque blobs of data, so a space is no different
> to any other character (actually, this isn't quite true - there's
> currently some special handling for embedded zero bytes, but other
> characters are handled opaquely, and the special handling for zero bytes
> should be eliminated in the next major backend revision).
>   

This also corresponds to what I thought I was seeing (and what seems 
naturally correct to me), so now I feel slightly less lost in the 
internals of it all if I comprehend correctly.
>   
>> If the amount of engine overhead required to allow for such a thing 
>> isn't horribly awful, would there be some chance of allowing a 
>> FLAG_BOOLEAN_PHRASE flag which would enable such behavior?
>>     
>
> I don't think this should be thought of as a "boolean phrase" - although
> quotes can indicate a probabilistic phrase, here they are indicating the
> bounds for the text to put in the term.
>   

It has been a tossup between "boolean phrase" and "spaces are 
significant in my data" at the time of writing.  I couldn't think of any 
proper way to describe what I was requesting besides using comments from 
the C code and inventing new things on my own :)  I refer specifically 
to add_boolean_prefix() when I talk about spaces being significant.

> But anyway, see bug#128 for previous discussion of this issue.
>   
I took a brief look at the initial patch provided with the issue, but 
haven't had the chance to look at it closer; it seemed from the comments 
that the scope of the change was beyond the submitted patch anyway (and 
beyond my requirements even, I think?).

Is there any near-future plans in the roadmap for one of the primary 
Xapian developers to resolve that feature request (perhaps in 1.1[.x] 
branch), or is it on the back-burner until necessity demands it?  If 
it's not feasible for any of the usual developers to get to it in the 
near future, and if no work has been done on it yet, would a preliminary 
conceptual patch be accepted for consideration for further development?

I think that either way, I will have to implement a solution to handle 
this because of our requirements specifically with FILTER stuff.  Since 
the queryparser's functionality is something we rely on, and is 
seemingly fundamental to Xapian, us branching it off internally is not a 
very good idea and would cause me many headaches, so any solution that 
could make it into the official core would be very much appreciated, 
whether on your end via a normal release schedule or ours via a 
functional patch.

Thanks,

Rick



More information about the Xapian-discuss mailing list