[Xapian-devel] QueryParser : some remarks
Daniel Ménard
Daniel.Menard at bdsp.tm.fr
Thu Nov 8 17:26:53 GMT 2007
Hi to all,
First, I would like to say a big thank you for the work which was done
on my 'wish bug' to allow mapping one field to multiple prefixes
(http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=93).
That's great!
I have upgraded to 1.0.4 and I am revisiting my code, replacing the php
query parser I wrote with Xapian's one.
Everything works well, but I have some remarks:
1. Adding a stopper to the query parser can make apache hangs under
windows (using php bindings)
I already reported this problem in the past, see thread:
http://thread.gmane.org/gmane.comp.search.xapian.general/4599/focus=1198
but I did not filled a bug report and it was never addressed.
It is not critical for me, as I have a workaround (store the stopper in
a global variable or property so it is not destroyed too early, see
above thread for details), but it would be nice if we can finally
address it...
2. Wildcards: no limits?
It seems that there is no limit on the number of terms a wildcard will
generate: the query "a*" will generate a huge query OR'ing all the terms
which start with an 'a' that will take lot of resources and time to
execute (this is a problem: a malicious user can exploit this to deny
access to others).
In my old parser, I had two independent limits:
- minimum number of chars before the '*' (e.g. 3 would alllow abs* but
not ab*)
- maximum number of terms a wildcard can expand to (e.g. 100= abs* is
allowed if there are less than 100 terms else an exception is raised)
Perhaps it would be useful to add something like this to xapian, with an
api to allow user to change these limits like
qp->set_wildcard_limits(3,100) ?
3. Spelling
The new spelling stuff is fantastic!
From the doc (by the way, spelling.rst is not linked from
xapian.orgs/doc), only non-prefixed terms are corrected: is there a plan
to also support spelling of prefixed terms in the future or is it
something which is not likely to happen? Being able to give the correct
spelling for an author's name, for example, would be great...
Also, I wonder about how to manage the spellings on the long run:
- if I add a document, new spellings are added in the database via
add_spelling().
- if I remove a document, the spellings for that document won't go away
(I mean decrease frequency, delete if 0 or less), unless I call
remove_spelling() myself.
However, there's no API way to get the list of spellings for that document.
- if I modify a document (correcting bad spellings, for example!), new
spellings will be added, but the old ones (corresponding to words
deleted from the document) won't go away.
So (if my assumptions are correct), on a frequently updated database, I
can get in the situation where I have spellings which do not longer
appear in any document, and the query parser can even suggest a bad
spelling if there is no better suggestion?
An answer would be to periodically clean the spellings table (hum... can
we iterate over them ?) and to re-index all the documents, but it is not
very convenient...
Any thought?
4. QueryParser tolerance, reporting query errors
It seems that XapianQueryParser is very tolerant: if I parse a 'bad'
query (e.g. unmatched brackets, unmatched quotes, nonexistent field
name...), it will ignore the error and produce a query.
I imagine that this is 'by design' and this is probably the best
approach for most users, but I have many cases where it does not work
very well for me
(on the left, the query given to qp.parse_query -> on the right, a
clean-up version of query->get_description) :
- xapian NOT (lucene OR zebra -> xapian OR not OR lucene OR or OR zebra
Brackets are unmatched, which (in my opinion), should result in a 'bad
query' exception.
The resulting query is really bad for two reasons: a) the 'not' and 'or'
operators are not recognized anymore b) the mset contains lucene
documents ;-)
- tit:xapian not lucene) -> Tit=xapian OR not OR lucene
another variant of unmached brackets, same results
- tit:(xapian -> title OR xapian
(assuming that 'tit' is an existing prefix added via
qp->add_prefix('tit', 'xxx'))
Unmatched brackets again, this time, this is the field name which is not
recognized anymore, resulting in an empty mset.
Once again, I would prefer a 'bad query' exception.
- something:xapian -> something PHRASE 2 xapian
(assuming that 'something' is not a prefix, just a word followed by ':'
typed in by the user)
I would really prefer a 'unknown field' exception.
- "hospit*" -> hospit
Wildcard does not seem to be allowed in a phrase query and the wildcard
is removed, resulting in a single word query 'hospit' which matches no
documents (whereas I have a lot of documents about hospitals,
hospitalization and so on!).
In that case, I would prefer a "not implemented" exception rather than
letting the user think we have nothing about that subject...
Would it be possible to add options to the query_parser so the user can
choose if she wants a tolerant parser or not?
Perhaps it could be a bitfield, something like
qp.report_errors(UNMATCHED_BRACKETS | UNMATCHED_QUOTES | UNKNOWN_FIELDS
| ...)
with the default being no error report at all to keep intact the actual
behavior (although it actually reports some errors like bad operators
usage : a and and b).
Your thoughts?
5. Names of operators, prefix names case sensitivity
We use boolean operators a lot... but in french ('et' for 'and', 'ou'
for 'or', 'sauf' for 'not' and so on).
Currently, I pre-process the query, replacing \bet|ou|sauf\b with the
correct operator, but it does not work in some cases (phrases...)
Would it be possible to have something like qp->add_operator(OP_OR,
'ou') in the API?
Also: QueryParser has flag FLAG_BOOLEAN_ANY_CASE to allow boolean
operators in any case.
Would it be possible to have something similar for the prefix names
(title:xx, Title:xx, TITLE:xx would all be recognized)?
That's all!
Re-reading my mail, it's rather a big wish list than 'some remarks' ;-)
Please forgive me for being so long and for asking so much!
Cheers,
--
Daniel Ménard
More information about the Xapian-devel
mailing list