[Xapian-discuss] Phrase Search on Stemmed Data

Sat Jan 12 15:53:33 GMT 2008

On Sat, Jan 12, 2008 at 03:48:49PM +0100, dd wrote:
> Just an example, where neither strategy is working. If I lowercase the 
> whole querystring before parsing then the terms all get stemmed what 
> leads to my desired behaviour.

In general, QueryParser is driven by heuristics, so it's a bad idea to
try to manipulate the string the user enters.  Forcing it to lowercasing
is less likely to trip you up than some other things people try though,
so as a short-term workaround, it's not too bad.

> If a querystring with a phrase occurs 
> now, I won't get a match if there have been words with leading capital 
> letters occur (during indexing).

This sounds like a bug - capitalisation shouldn't make a difference in a
phrase.  If I test this, I find that both these cases parse the same:

    " Xapian QueryParser" parses queries
    " xapian queryparser" parses queries

Both parse as:

Xapian::Query(((xapian:(pos=1) PHRASE 2 queryparser:(pos=2)) OR Zpars:(pos=3) OR Zqueri:(pos=4)))

Here's the patch to queryparsertest I used:

http://oligarchy.co.uk/xapian/patches/qp-phrase-case-test.patch

Or perhaps the problem is with the indexing.  Are you using TermGenerator?

> >Not at the moment, but we should add a way, and it's not hard to do.
> >Could you please file a wishlist bug for this?
> >  
> Sure, should I create an entry in the bugtracker?

Yes.

> I've looked up queryparser_internal.cc, maybe you can spot me the 
> location where I can change the source
> (found something like should_stem, where the decision is made, if a word 
> should be stemmed or not, but I'm no C++ expert ;-) )

"queryparser_internal.cc" is generated from "queryparser.lemony" by the
lemon parser generator (which is similar to bison/yacc, but trivially
produces reentrant parsers, and is generally easier to work with), so
that's the file we ultimately need to patch.  But this part is copied
verbatim, so patching the generated file is probably OK for now, and
a patch for that should be easy to transfer.

The should_stem() function is indeed what checks the initial character
of a term.

If you're able to come up with a patch, that would be great.  Some
automated testcases (tests/queryparsertest.cc) would be even better.

Cheers,
    Olly