[Xapian-discuss] Spaces in bool prefixes
Rick Olson
rick at napalmriot.com
Fri Feb 22 11:44:30 GMT 2008
[apologies in advance if the formatting on this email comes out really bad]
Olly Betts wrote:
> On Thu, Feb 21, 2008 at 11:01:22PM -0800, Rick Olson wrote:
>
>> Thanks for the suggestions. The idea of removing the spaces from my
>> terms has occurred to me; I was just hoping there was another way I
>> could accomplish filtering while also allowing spaces.
>>
>
> Not currently, but see bug#128:
>
> http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=128
>
Ah, it seems my issue is mentioned in that bug, and it seems that the
proposed ideas at solving it would fit in with what I'm trying to
accomplish.
> I think what makes sense is that country:"united states" would map to
> a single boolean term (something like `XCOUNTRY:united states').
>
Exactly what I'm looking to accomplish, which is what I hacked out in
queryparser.cc (which corresponds directly to the lemon file).
>> I wrote some modifications in the Xapian queryparser (while I was
>> waiting for a response) which would not restrict spaces from boolean
>> prefixes, and I'm trying to figure out how badly allowing the extra
>> space[s] would affect performance. Using delve, the exact term is:
>> 'X_country:united states' so that's what's causing me some confusion in
>> understanding the performance impact entirely.
>>
>
> Terms are essentially opaque blobs of data, so a space is no different
> to any other character (actually, this isn't quite true - there's
> currently some special handling for embedded zero bytes, but other
> characters are handled opaquely, and the special handling for zero bytes
> should be eliminated in the next major backend revision).
>
This also corresponds to what I thought I was seeing (and what seems
naturally correct to me), so now I feel slightly less lost in the
internals of it all if I comprehend correctly.
>
>> If the amount of engine overhead required to allow for such a thing
>> isn't horribly awful, would there be some chance of allowing a
>> FLAG_BOOLEAN_PHRASE flag which would enable such behavior?
>>
>
> I don't think this should be thought of as a "boolean phrase" - although
> quotes can indicate a probabilistic phrase, here they are indicating the
> bounds for the text to put in the term.
>
It has been a tossup between "boolean phrase" and "spaces are
significant in my data" at the time of writing. I couldn't think of any
proper way to describe what I was requesting besides using comments from
the C code and inventing new things on my own :) I refer specifically
to add_boolean_prefix() when I talk about spaces being significant.
> But anyway, see bug#128 for previous discussion of this issue.
>
I took a brief look at the initial patch provided with the issue, but
haven't had the chance to look at it closer; it seemed from the comments
that the scope of the change was beyond the submitted patch anyway (and
beyond my requirements even, I think?).
Is there any near-future plans in the roadmap for one of the primary
Xapian developers to resolve that feature request (perhaps in 1.1[.x]
branch), or is it on the back-burner until necessity demands it? If
it's not feasible for any of the usual developers to get to it in the
near future, and if no work has been done on it yet, would a preliminary
conceptual patch be accepted for consideration for further development?
I think that either way, I will have to implement a solution to handle
this because of our requirements specifically with FILTER stuff. Since
the queryparser's functionality is something we rely on, and is
seemingly fundamental to Xapian, us branching it off internally is not a
very good idea and would cause me many headaches, so any solution that
could make it into the official core would be very much appreciated,
whether on your end via a normal release schedule or ours via a
functional patch.
Thanks,
Rick
More information about the Xapian-discuss
mailing list