[Xapian-discuss] about stemming

Olly Betts olly at survex.com
Tue Apr 4 05:26:49 BST 2006


On Tue, Apr 04, 2006 at 09:08:40AM +0530, durga bidaye wrote:
>  >>If you also index the unstemmed form of every term, you could transform
> each term T in the query into (T OR stem(T)).
> 
> On doing this,will I get results where doc containg "T"
> will have higher ranking than Doc containing "stem(T)" ?? No i suppose?

Yes, because the document indexed by "T" will be indexed by "stem(T)"
too.

> >>I'm not convinced it'll improve retrieval results though.  I'd suggest
> >>trying it with a quick prototype before investing a lot of time and
> >>energy into it.
> 
> I am working on a search engine where searching is done on a set of "names".
> So it makes sense to give a higher ranking to a name(result) which
> exactly matches the search query and a lower ranking to a name (result) which
> is similar to the search query.

If you're searching names, stemming is unlikely to be appropriate
(unless perhaps you're working in a language where names are inflected).

If you want to match names allowing for misspellings, then something
like soundex or metaphone is much more appropriate:

http://en.wikipedia.org/wiki/Phonetic_algorithm

But beware that most of these algorithms seem to have been developed to
handle names common in the USA...

> Thus, suppose I search for "John" I should get results where doc containing
> "John" will have higher ranking and docs containing "Johnathan", doc
> containing "Jonny" lower ranking.

But none of "Johnathan", "Jonny", or "John" stem to the same string...

Cheers,
    Olly



More information about the Xapian-discuss mailing list