[Xapian-devel] QueryParser : some remarks

Sat Nov 10 12:35:55 GMT 2007

On Thu, Nov 08, 2007 at 06:26:53PM +0100, Daniel M?nard wrote:
> 1. Adding a stopper to the query parser can make apache hangs under 
> windows (using php bindings)
> I already reported this problem in the past, see thread:
> http://thread.gmane.org/gmane.comp.search.xapian.general/4599/focus=1198
> but I did not filled a bug report and it was never addressed.

It's not been forgotten - the best way to fix this would be to have a
reference counting mechanism for Stopper (and other similar classes
like MatchDecider) but we can't use the same technique we use for most
of the API classes as it doesn't allow subclassing.  Bug#186 is about
this:

http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=186

But that's bound to involve ABI changes, so it's something for Xapian 1.1.

> 2. Wildcards: no limits?
> It seems that there is no limit on the number of terms a wildcard will 
> generate: the query "a*" will generate a huge query OR'ing all the terms 
> which start with an 'a' that will take lot of resources and time to 
> execute (this is a problem: a malicious user can exploit this to deny 
> access to others).
> 
> In my old parser, I had two independent limits:
> - minimum number of chars before the '*' (e.g. 3 would alllow abs* but 
> not ab*)
> - maximum number of terms a wildcard can expand to (e.g. 100= abs* is 
> allowed if there are less than 100 terms else an exception is raised)

One problem with the latter limit is that you have to do work to
discover the limit has been exceeded.  I've not profiled how expensive
the wildcard expansion is versus the time to run the match against it
so perhaps this doesn't matter.

It's perhaps worth noting that these don't really protect from an even
slightly determined attacker.  Although there are a lot of potential
combinations of 3 characters, most of these don't start terms.  For
example, in the "SOWPODS" English Scrabble word list, only 3575 of the
17576 possible lower case letter combinations actually occur, and a lot
of these only start a small number of words.

So a query for the 10 most common 3 letter prefixes could expand to
16471 terms, but only need a 49 character query.  A real database
probably won't have a lot of these words (unless the documents include
wordlists!), but will have a lot of real names, misspellings, and
garbage terms (e.g. from ASCII ART, hex codes, etc).  At any rate,
you'll generally have a Zipf-like distribution of frequencies.

> Perhaps it would be useful to add something like this to xapian, with an 
> api to allow user to change these limits like 
> qp->set_wildcard_limits(3,100) ?

Possibly, though perhaps the second limit should be on total query
length (the number of terms in the parsed Query, not the number of
characters in the query string) rather than the number of terms per
wildcard.

Bug#207 "Add ability to accelerate wildcard queries for short terms" is
relevant to this issue too:

http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=207

> 3. Spelling
> The new spelling stuff is fantastic!
> 
> From the doc (by the way, spelling.rst is not linked from 
> xapian.orgs/doc)

Nor from xapian.org/docs !

I've fixed that (and linked in a couple of other unlinked documents):

    http://www.xapian.org/docs/

> only non-prefixed terms are corrected: is there a plan 
> to also support spelling of prefixed terms in the future or is it 
> something which is not likely to happen? Being able to give the correct 
> spelling for an author's name, for example, would be great...

It would be nice to support.  We weren't sure how it should work, so we
didn't implement it as part of the first version - for some fields you
definitely want prefix-specific spelling (e.g. author) but for others
it might be better to merge spelling data (e.g. title and body).  The
spelling will probably work better with more data because the statistics
will be better, although you may be offered a correction to a title
which doesn't match a word which ever occurs in a title.  But if that's
the word you misspelled, perhaps that's better.  Some experimentation
is needed I think.

> Also, I wonder about how to manage the spellings on the long run:
> - if I add a document, new spellings are added in the database via 
> add_spelling().
> - if I remove a document, the spellings for that document won't go away 
> (I mean decrease frequency, delete if 0 or less), unless I call 
> remove_spelling() myself.
> However, there's no API way to get the list of spellings for that document.

That's because we don't store the spellings for a document.  We could,
but it's quite a space overhead.  The termlist could be used, but won't
necessarily match the spellings added (and you might only index all terms
stemmed).  We considered tying spellings to the document termlist, but
decided it was better to allow more flexibility (for example, you can
use a static dictionary, or preseed the spelling dictionary).

Overall we concluded that a document which had been in the database
would still reflect the topics covered by data in the database, so
keeping terms from old documents would probably be beneficial overall.

> - if I modify a document (correcting bad spellings, for example!), new 
> spellings will be added, but the old ones (corresponding to words 
> deleted from the document) won't go away.

True, but if the misspelling don't occur frequently, they won't be
suggested as corrections.

> So (if my assumptions are correct), on a frequently updated database, I 
> can get in the situation where I have spellings which do not longer 
> appear in any document, and the query parser can even suggest a bad 
> spelling if there is no better suggestion?

It's probably possible, but also unlikely in reality I think.  I think
in this case, actual behaviour is more important than what could happen.

> An answer would be to periodically clean the spellings table (hum... can 
> we iterate over them ?)

Yes - see Xapian::Database::spellings_begin():

http://www.xapian.org/docs/apidoc/html/classXapian_1_1Database.html#2f78aa0fe39a3b68e10ca865dc0f86ef

> and to re-index all the documents, but it is not very convenient...
> 
> Any thought?

It's hard to forsee all the issues when designing a feature like this
one.  We tried to consider everything, and have a rationale for the
choices made, but we did realise that we might make the wrong choice
in some cases and that there might problems that we hadn't foreseen.
But a released spelling feature which works pretty well is better for
most users than an unreleased one which might better one day, and we can
address unforseen problems with the benefit of real-world experience.

> 4. QueryParser tolerance, reporting query errors
> It seems that XapianQueryParser is very tolerant: if I parse a 'bad' 
> query (e.g. unmatched brackets, unmatched quotes, nonexistent field 
> name...), it will ignore the error and produce a query.
> I imagine that this is 'by design' and this is probably the best 
> approach for most users, but I have many cases where it does not work 
> very well for me

There's a "reparse" feature, which should be configurable, but currently
is hard wired.  The idea behind it is to cope better when users
cut-and-paste text into the query box (e.g. an error message).

Perhaps (at least in some cases) this would be better treated like
spelling correction - with a "did you mean".

> (on the left, the query given to qp.parse_query -> on the right, a 
> clean-up version of query->get_description) :
> 
> - xapian NOT (lucene OR zebra -> xapian OR not OR lucene OR or OR zebra
> Brackets are unmatched, which (in my opinion), should result in a 'bad 
> query' exception.
> [More unmatched bracket cases]

In these cases, I think we shouldn't be reparsing at all by default.

> - "hospit*" -> hospit
> Wildcard does not seem to be allowed in a phrase query and the wildcard 
> is removed, resulting in a single word query 'hospit' which matches no 
> documents (whereas I have a lot of documents about hospitals, 
> hospitalization and so on!).
> In that case, I would prefer a "not implemented" exception rather than 
> letting the user think we have nothing about that subject...

I think you got what you asked for here - quoting a single term is taken
to mean "literally", so for example stopwording and stemming don't apply.

> Would it be possible to add options to the query_parser so the user can 
> choose if she wants a tolerant parser or not?
> Perhaps it could be a bitfield, something like
> qp.report_errors(UNMATCHED_BRACKETS | UNMATCHED_QUOTES | UNKNOWN_FIELDS 
> | ...)
> with the default being no error report at all to keep intact the actual 
> behavior (although it actually reports some errors like bad operators 
> usage : a and and b).

I think it would be better to just make the reparsing configurable.  And
to sort it out to not reparse on mismatched brackets by default (I had
a quick look, but it's trickier than it initially appears).

> Would it be possible to have something like qp->add_operator(OP_OR, 
> 'ou') in the API?

Yes, that certainly ought to be possible.

> Also: QueryParser has flag FLAG_BOOLEAN_ANY_CASE to allow boolean 
> operators in any case.
> Would it be possible to have something similar for the prefix names 
> (title:xx, Title:xx, TITLE:xx would all be recognized)?

Perhaps they should always be.  For boolean operators, there's the issue
of them appearing in normal text, but that's not really an issue here.

> Re-reading my mail, it's rather a big wish list than 'some remarks' ;-)
> Please forgive me for being so long and for asking so much!

It would probably be best to file wishlist bugs for them (ideally
separate ones rather than a monster list in one bug).  If you want
to work on them yourself and submit patches, that would be even better!

Cheers,
    Olly