[Xapian-discuss] Stemming non-protein

Olly Betts olly at survex.com
Fri Mar 31 17:21:01 BST 2006


On Thu, Mar 30, 2006 at 11:41:17AM +0100, James Aylett wrote:
> On Wed, Mar 29, 2006 at 12:18:45PM -0500, Peter Masiar wrote:
> 
> > Say, my user queries for "protein". Document might say "non-protein". 
> > Will xapian match it? Is it possible to disable such matches?
> 
> Currently (I believe - Olly may need to correct me) what will happen
> is that both "non" and "protein" will be generated as terms (well,
> they'll be stemmed too), but someone searching for "non-protein" will
> generate a PHRASE search "non" PHRASE(n) "protein" where n is
> something appropriate (probably 2?).

Exactly.

I don't think it's clearcut that a document which contains the phrase
"non-protein" is not relevant to the query `protein'.  Using a more
everyday example, a news article which contains "anti-war protests" and
"Iraq" is probably relevant to the query `iraq war'.

But if you want to disable this, you could always write a tokeniser
which doesn't split certain hyphenated terms, but instead treats them as
a single term.  And similarly at query time.

Cheers,
    Olly



More information about the Xapian-discuss mailing list