[Xapian-discuss] Stemmer Modifications

Olly Betts olly at survex.com
Thu Oct 27 10:01:08 BST 2005


On Tue, Oct 25, 2005 at 11:47:08AM -0400, Mike Boone wrote:
> I'm finally getting back to this topic after some distractions. I was
> reading the Snowball documentation here:
> 
> http://snowball.tartarus.org/algorithms/english/stemmer.html
> 
> Which has a section on "exceptional forms." That looks like what I want
> to do, but that language isn't familiar. I looked over the source code
> in xapian-core/languages/snowball_english.cc and the stemmer code looks
> pretty cryptic.

That's generated code, which is why it's hard to follow.  The
snowball-to-C convertor doesn't regard readability of the generated code
as important!

> Is it possible to put my exception word list into that code?

It's probably workable as a local patch, but it's not something we could
really contemplate including in the mainstream releases.  Actually, you
might find it easier to tap into Xapian::Stem::operator() in
api/omstem.cc - that's human written code and is where we call the
Snowball generated code from.

Or you could modify the Snowball stemmer code, then regenerate the C
code in snowball_english.cc (it's really just C, but we compile it as
C++ because I found it causes fewer headaches).  The Snowball language
takes a little bit of getting used to, but is fairly elegant once you
get the hang of it.

> That seems clearer in my mind than subclassing the Stem class and
> then exposing that subclass to PHP through the SWIG bindings (maybe
> that's easier than it sounds?).

I don't think SWIG's PHP bindings support what they call "directors" yet
so no, you probably can't currently subclass in PHP.  Someone's recently
picked up the reins as SWIG PHP maintainer, so it's possible this might be
implemented sometime (assuming PHP permits it).

Cheers,
    Olly



More information about the Xapian-discuss mailing list