[Xapian-discuss] Question on how to handle some bad result from stem algorithm?
bruce.zhang at trustgo.com
Thu Oct 6 08:49:29 BST 2011
Sorry, I will do as you said, to use new thread.
Thank you for reply. It is very good info for me,
On Wed, Oct 5, 2011 at 10:27 PM, Olly Betts <olly at survex.com> wrote:
> It's confusing to start an unrelated discussion by replying to an
> existing thread - better to send a new email.
> On Fri, Sep 30, 2011 at 04:25:13PM +0800, Bruce Zhang wrote:
> > When using Stem library, it works well in most case,
> > however we also notice some bad result caused by stem, some examples are:
> > Community, communication and communicator can be searched by each other,
> > though we thought they are not same,
> It sounds like you're using the "porter" stemmer which conflates these
> three (to "commun"). Use "english" instead, which produces "communiti"
> for "community", and "communic" for the other two (which seems reasonable
> as they are closely related). The "porter" stemmer is just there for
> people who really want the original version of Martin Porter's
> > Anime, animal, animated can be searched by each other
> These three are still conflated by the "english" stemmer. The first and
> last doesn't seem so bad ("anime" is a particular sort of "animated"
> film) but "animal" seems rather unhelpful.
> We just take the algorithms from the snowball project though, so that's
> the best place to report problematic cases:
> > What's the thought? is any good way to avoid this?
> > Is any other equivalent algorithm but simple?
> The "english" algorithm is a better option than "porter".
> I've heard just performing the first steps of the Porter algorithm is
> pretty effective, but we don't have an implementation of that currently.
More information about the Xapian-discuss