[Xapian-devel] QueryParser : some remarks

Daniel Ménard Daniel.Menard at bdsp.tm.fr
Thu Nov 8 17:26:53 GMT 2007


Hi to all,

First, I would like to say a big thank you for the work which was done 
on my 'wish bug' to allow mapping one field to multiple prefixes 
(http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=93).
That's great!

I have upgraded to 1.0.4 and I am revisiting my code, replacing the php 
query parser I wrote with Xapian's one.

Everything works well, but I have some remarks:


1. Adding a stopper to the query parser can make apache hangs under 
windows (using php bindings)
I already reported this problem in the past, see thread:
http://thread.gmane.org/gmane.comp.search.xapian.general/4599/focus=1198
but I did not filled a bug report and it was never addressed.

It is not critical for me, as I have a workaround (store the stopper in 
a global variable or property so it is not destroyed too early, see 
above thread for details), but it would be nice if we can finally 
address it...


2. Wildcards: no limits?
It seems that there is no limit on the number of terms a wildcard will 
generate: the query "a*" will generate a huge query OR'ing all the terms 
which start with an 'a' that will take lot of resources and time to 
execute (this is a problem: a malicious user can exploit this to deny 
access to others).

In my old parser, I had two independent limits:
- minimum number of chars before the '*' (e.g. 3 would alllow abs* but 
not ab*)
- maximum number of terms a wildcard can expand to (e.g. 100= abs* is 
allowed if there are less than 100 terms else an exception is raised)

Perhaps it would be useful to add something like this to xapian, with an 
api to allow user to change these limits like 
qp->set_wildcard_limits(3,100) ?


3. Spelling
The new spelling stuff is fantastic!

 From the doc (by the way, spelling.rst is not linked from 
xapian.orgs/doc), only non-prefixed terms are corrected: is there a plan 
to also support spelling of prefixed terms in the future or is it 
something which is not likely to happen? Being able to give the correct 
spelling for an author's name, for example, would be great...

Also, I wonder about how to manage the spellings on the long run:
- if I add a document, new spellings are added in the database via 
add_spelling().
- if I remove a document, the spellings for that document won't go away 
(I mean decrease frequency, delete if 0 or less), unless I call 
remove_spelling() myself.
However, there's no API way to get the list of spellings for that document.
- if I modify a document (correcting bad spellings, for example!), new 
spellings will be added, but the old ones (corresponding to words 
deleted from the document) won't go away.
So (if my assumptions are correct), on a frequently updated database, I 
can get in the situation where I have spellings which do not longer 
appear in any document, and the query parser can even suggest a bad 
spelling if there is no better suggestion?

An answer would be to periodically clean the spellings table (hum... can 
we iterate over them ?) and to re-index all the documents, but it is not 
very convenient...

Any thought?


4. QueryParser tolerance, reporting query errors
It seems that XapianQueryParser is very tolerant: if I parse a 'bad' 
query (e.g. unmatched brackets, unmatched quotes, nonexistent field 
name...), it will ignore the error and produce a query.
I imagine that this is 'by design' and this is probably the best 
approach for most users, but I have many cases where it does not work 
very well for me
(on the left, the query given to qp.parse_query -> on the right, a 
clean-up version of query->get_description) :

- xapian NOT (lucene OR zebra -> xapian OR not OR lucene OR or OR zebra
Brackets are unmatched, which (in my opinion), should result in a 'bad 
query' exception.
The resulting query is really bad for two reasons: a) the 'not' and 'or' 
operators are not recognized anymore b) the mset contains lucene 
documents ;-)

- tit:xapian not lucene) -> Tit=xapian OR not OR lucene
another variant of unmached brackets, same results

- tit:(xapian -> title OR xapian
(assuming that 'tit' is an existing prefix added via 
qp->add_prefix('tit', 'xxx'))
Unmatched brackets again, this time, this is the field name which is not 
recognized anymore, resulting in an empty mset.
Once again, I would prefer a 'bad query' exception.

- something:xapian -> something PHRASE 2 xapian
(assuming that 'something' is not a prefix, just a word followed by ':' 
typed in by the user)
I would really prefer a 'unknown field' exception.

- "hospit*" -> hospit
Wildcard does not seem to be allowed in a phrase query and the wildcard 
is removed, resulting in a single word query 'hospit' which matches no 
documents (whereas I have a lot of documents about hospitals, 
hospitalization and so on!).
In that case, I would prefer a "not implemented" exception rather than 
letting the user think we have nothing about that subject...


Would it be possible to add options to the query_parser so the user can 
choose if she wants a tolerant parser or not?
Perhaps it could be a bitfield, something like
qp.report_errors(UNMATCHED_BRACKETS | UNMATCHED_QUOTES | UNKNOWN_FIELDS 
| ...)
with the default being no error report at all to keep intact the actual 
behavior (although it actually reports some errors like bad operators 
usage : a and and b).

Your thoughts?


5. Names of operators, prefix names case sensitivity

We use boolean operators a lot... but in french ('et' for 'and', 'ou' 
for 'or', 'sauf' for 'not' and so on).
Currently, I pre-process the query, replacing \bet|ou|sauf\b with the 
correct operator, but it does not work in some cases (phrases...)
Would it be possible to have something like qp->add_operator(OP_OR, 
'ou') in the API?

Also: QueryParser has flag FLAG_BOOLEAN_ANY_CASE to allow boolean 
operators in any case.
Would it be possible to have something similar for the prefix names 
(title:xx, Title:xx, TITLE:xx would all be recognized)?


That's all!
Re-reading my mail, it's rather a big wish list than 'some remarks' ;-)
Please forgive me for being so long and for asking so much!

Cheers,

-- 

Daniel Ménard






More information about the Xapian-devel mailing list