[Xapian-discuss] add_prefix() versus add_boolean_prefix()

Daniel Ménard Daniel.Menard at ehesp.fr
Tue Nov 18 16:37:54 GMT 2008


Thanks a lot, Olly, for answering my previous question and forgive me 
for replying so lately : I didn't get a chance to answer before...

I'm not completely sure that I totally understood your explanations, but 
now I'm pretty sure that I have a  bad conception of what boolean 
prefixes are. I saw that Torsten Foertsch had a quite similar question 
(http://thread.gmane.org/gmane.comp.search.xapian.general/6711) and have 
read the answers Jim Lynch and you gave him. I also searched the site 
and the list archives for something giving a definition but only found 
examples of what they can be used for and details about how to use them 
(api), which is good, but I'm still not completely clear about what 
boolean prefixes can do and what they can't...

I will try to explain what I intended to do so, hopefully, someone will 
see where I'm wrong (sorry, it's a bit long...)

Our use case is not the general one: we're using xapian to search 
bibliographical records containing things like titles, abstracts, notes 
and other fields which can be used to restrict a search : typical ones 
(type of document, year of publication, language, country...) but also 
fields containing "controlled vocabulary" (keywords, publisher, 
collection, organization, periodical...)

What we want to do is to define "views" on these records. In our mind, a 
view is a set of rules which restrict the corpus of records against 
which the query will be searched. Some of these rules are "hardcoded" in 
our application (they are combined with the user query by using a 
OP_FILTER operator), but some of these rules can also be defined by the 
user (e.g. health promotion date:2008 publisher:"Editions Masson"). In 
our mind, such a request really means : find records about health 
promotion but only those published in 2008 by Masson.

Assuming that the default operator used by the query parser is OR, the 
above query will not give "good" results: from a user point of view, any 
document which is not from Editions Masson or were not published in 2008 
is just "noise". Using AND as the default operator would help but is a 
bit too strict : "health promotion" really is a free text query and a 
document having only the term "promotion" is probably still a good 
answer for our database (of course it will works fine if the user 
manually add the AND operators at the good places in her query).

I also suppose that these filters will impact the score obtained by the 
free text part of the query if I use OR or AND (I'm not completely clear 
about what it means with AND : the doc says that OP_AND sums the score 
from both branches but how does it impact the final MSet and its order?).

So I thought that I had to use OP_FILTER (or perhaps OP_SCALE_WEIGHT 
with a factor of 0 ? is-it the same ?) and that boolean prefixes were 
the good way to do that... I define "publisher", "date" and so on as 
boolean prefixes and the query parser "magically" do what I want : it 
extracts the filters from the user query and combine them in a OP_FILTER 
clause which will have no impact on the ranking...

It works fine if my filters are simple terms (e.g. date:2008) but not if 
I use something which is more complex : phrases, brackets or even 
wildcards... hence my previous mail which Olly replied.

I understood from Olly's answer that this behavior was indeed expected 
(boolean prefixes are not intended to do what I'm trying to do) but I 
fail to understand why..
I'm pretty sure I'm missing something which is obvious for others... 
Perhaps I'm just lacking some theoretical background... (and english is 
not my native language, which does not help!)

Thanks a lot for your patience,

Daniel

PS : below, some precisions interleaved with Olly's replies.


Olly Betts a écrit :
>> [test author:(john doe)]
>>     
> It's a bad example to use "author:" here, since that would naturally
> be a free-text search, and it means that examples which looks reasonable
> don't necessarily make much sense in the actual boolean prefix case.
>   
I still don't get it... In my mind, the author clause is a filter : 
either the document is written by this author, either it is not, which 
looks like a boolean clause for me..
And ideally, the scoring would only take "test" into account, ignoring 
any weight contributed by this filter clause.
> [...] you can't apply a boolean prefix to a subexpression [...]
Is it a current limitation of the query parser or is there a fundamental 
reason why it can't be possible ?
> In this case the subexpression isn't boolean, so as a better
> example, it's like this where "type:" is a boolean prefix:
>
> type:(html pdf)
>
> I'm not really sure that makes a lot of sense
I read it as a bracketed expression containing two terms which would be 
combined using the QueryParser's default operator giving a pure boolean 
query like
Xapian::Query(0 * (XTYPEhtml OR XTYPEpdf))
> I can see that there's a natural meaning for this case, which I don't
> think we currently handle:
>
> type:(html OR pdf)
>   
I confirm: currently, the query parser gives me
pdf:(pos=2) FILTER type:(html
for this query.
>> A similar problem appear if I try a phrase search: [test author:"john 
>> doe"] gives
>> Xapian::Query(((test:(pos=1) OR doe:(pos=2)) FILTER A"john))
>>     
> I'm not really sure what you expect this to mean - a phrase isn't a
> boolean sub-expression, and I wouldn't expect boolean filter terms to
> have positional information.
>   
As above, I don't get it... (I'm feeling really sorry...)
By using the api, I can create the following query
Xapian::Query((test:(pos=1) FILTER (XAUTHORjohn:(pos=1) PHRASE 2 
XAUTHORdoe:(pos=2))))

but I can't generate it by using the query parser.
I'm sure there is a very good reason for the query parser to parse so 
differently depending on the fact that a prefix is declared as boolean 
or normal, but, once again, I miss it...
> Looking at a better example, what would you expect this to mean?
>
> type:"html pdf"
>   
for me, it means a pure boolean query (only a filter clause) containing 
a phrase search... something like this :
Xapian::Query(FILTER (XTYPEhtml:(pos=1) PHRASE 2 XTYPEpdf:(pos=2))))
or perhaps like this :
Xapian::Query(0 * (XTYPEhtml:(pos=1) PHRASE 2 XTYPEpdf:(pos=2)))

> Incidentally, http://trac.xapian.org/ticket/128 suggests it should be a
> single filter term with a space in, which seems a reasonable way to
> allow that to be specified.  So in this case, the term would be:
>
> XTYPEhtml pdf
>   
I'm not sure to understand how it correlates to boolean prefixes...

Again, thank you for your patience,

-- 

Daniel Ménard




More information about the Xapian-discuss mailing list