[Xapian-discuss] Stemming non-protein
Olly Betts
olly at survex.com
Fri Mar 31 17:21:01 BST 2006
On Thu, Mar 30, 2006 at 11:41:17AM +0100, James Aylett wrote:
> On Wed, Mar 29, 2006 at 12:18:45PM -0500, Peter Masiar wrote:
>
> > Say, my user queries for "protein". Document might say "non-protein".
> > Will xapian match it? Is it possible to disable such matches?
>
> Currently (I believe - Olly may need to correct me) what will happen
> is that both "non" and "protein" will be generated as terms (well,
> they'll be stemmed too), but someone searching for "non-protein" will
> generate a PHRASE search "non" PHRASE(n) "protein" where n is
> something appropriate (probably 2?).
Exactly.
I don't think it's clearcut that a document which contains the phrase
"non-protein" is not relevant to the query `protein'. Using a more
everyday example, a news article which contains "anti-war protests" and
"Iraq" is probably relevant to the query `iraq war'.
But if you want to disable this, you could always write a tokeniser
which doesn't split certain hyphenated terms, but instead treats them as
a single term. And similarly at query time.
Cheers,
Olly
More information about the Xapian-discuss
mailing list