[Xapian-discuss] Proper noun stemming

James Aylett james-xapian at tartarus.org
Thu Mar 27 12:26:52 GMT 2008


On Thu, Mar 27, 2008 at 12:05:05PM +0000, Colin Bell wrote:

> I was wondering if anyone had a solution for the following problem.
> 
> I user QueryParser to stem my documents before adding them to a  
> database.

You may have more luck using the TermGenerator, which is intended for
use during the index process:
<http://xapian.org/docs/termgenerator.html> and
<http://xapian.org/docs/apidoc/html/classXapian_1_1TermGenerator.html>.

> During the stemming process I would like to find a way of keeping
> proper nouns that span two or more words together as a phrase.  For
> example "New York" or "Gordon Brown" or "Prime Minister" get spilt
> up. I see the STEM_SOME allows some operators, but I can't see how
> these might help in this situation.

What are you actually trying to achieve? Do you want the sentence "I
went to meet Gordon Brown" to generate the following terms (ignoring
all stemming):

I
went
to
meet
Gordon Brown

? That would be unusual, because (for instance) a search on 'Prime
Minister Brown' wouldn't pick the above up at all.

As one of the above documents says, the convention is to store
unstemmed forms with positional information, so the proximity of
'Gordon' to 'Brown' is retained in the database, and PHRASE and NEAR
searches will be able to take advantage of that. (So the search
'meeting "Gordon Brown"' should match the above well.)

Of course, if you're rolling your own terms, you may choose to solve
these problems in another way. The conventions are helpful in many
common cases, though.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list