[Xapian-discuss] Question on how to handle some bad result from stem algorithm?

Bruce Zhang bruce.zhang at trustgo.com
Thu Oct 6 08:49:29 BST 2011


Sorry, I will do as you said, to use new thread.

Thank you for reply. It is very good info for me,

Bruce

On Wed, Oct 5, 2011 at 10:27 PM, Olly Betts <olly at survex.com> wrote:

> It's confusing to start an unrelated discussion by replying to an
> existing thread - better to send a new email.
>
> On Fri, Sep 30, 2011 at 04:25:13PM +0800, Bruce Zhang wrote:
> > When using Stem library, it works well in most case,
> >
> > however we also notice some bad result caused by stem, some examples are:
> >
> > Community, communication and communicator can be searched by each other,
> > though we thought they are not same,
>
> It sounds like you're using the "porter" stemmer which conflates these
> three (to "commun").  Use "english" instead, which produces "communiti"
> for "community", and "communic" for the other two (which seems reasonable
> as they are closely related).  The "porter" stemmer is just there for
> people who really want the original version of Martin Porter's
> algorithm.
>
> > Anime, animal, animated can be searched by each other
>
> These three are still conflated by the "english" stemmer.  The first and
> last doesn't seem so bad ("anime" is a particular sort of "animated"
> film) but "animal" seems rather unhelpful.
>
> We just take the algorithms from the snowball project though, so that's
> the best place to report problematic cases:
>
> http://snowball.tartarus.org/
>
> > What's the thought? is any good way to avoid this?
> > Is any other equivalent algorithm but simple?
>
> The "english" algorithm is a better option than "porter".
>
> I've heard just performing the first steps of the Porter algorithm is
> pretty effective, but we don't have an implementation of that currently.
>
> Cheers,
>     Olly
>


More information about the Xapian-discuss mailing list