[Xapian-devel] let's discuss http://trac.xapian.org/ticket/448 "Allow usage of custom stemmers"

Eugene! esizikov at gmail.com
Sun Feb 14 15:06:38 GMT 2010


Hello Xapian developers!

First of all, I'd like to thank you guys for the Xapian project at
all. Great work! Xapian has decent performance and is very easy to
enter.

However, it has some "missing" features I really need to have, and the
most noticeable is dictionary-based stemming and spelling available to
be used from Python code.

The current code of Xapian::Stem doesn't do anything to provide such a
functionality even for C++ level, not talking about SWIG bindings at
all.

That is a no-go for me, thus I have to try to find some way to deal with it.

I'm not a big fan of SWIG thus I have a very little knowledge of it,
thus I decided not to go deep into the current Xapian SWIG bindings
and concentrate on developing a standalone extension which would be a
prototype or proof of concept for my work.

The main idea which I've got after looked into the Xapian C++ code for
Xapian::Stem if that it is mostly ready for using custome stemming
engines right now! All that is required is to make it having a vtable!
The easiest way to do that (an the best way for C++ subclassing) is to
make the Xapian::~Stem() destructor to become virtual destructor.

Also, in the first attempt I've tried to make virtual Stem::operator()
but that DIDN'T work for me because of C++ type casting (e.g. for
StemGenerator
{{{
class XAPIAN_VISIBILITY_DEFAULT TermGenerator {
  public:
   ...
    /// Set the Xapian::Stem object to be used for generating stemmed terms.
    void set_stemmer(const Xapian::Stem & stemmer);
  ...
}

class MyStem : public Xapian::Stem {
  ...
    std::string operator()(const std::string &word) const;   // we've
patched Xapian::Stem to have virtual  std::string operator()(const
std::string &word) const;
}

MyStem stem("english");
Xapian::TermGenerator tg;

tg,set_stemmer(stem);      // C++ type cast issue - stem will be
treated as Xapian::Stem and our overloaded operator() won't be used
}}}

Then I noticed the presence of Xapian::Stem::Internal
reference-counted pointer and that led me to the working solution.
What I need it a 2-step thing:
 1. Subclass Xapian::Stem::Internal to use dictionary-based stemmer
(Hunspell in my case) in exactly the same way as it is done for
different languages in the current code
 2.  Subclass the Xapian::Stem in order to create instance of my own
implementation of the Xapian::Stem::Internal
 3, profit

That is all I need because even after my derived class will be treated
as Xapian::Stem that will not be a problem any more, as it will
continue using the `internal' attribute which do the actual work! The
using of the derived class instance and refcounting it is supported by
the copy constructor and operator=() of the Xapian::Stem :
{{{
Stem::Stem(const Stem & o) : internal(o.internal) { }

void
Stem::operator=(const Stem & o)
{
    internal = o.internal;
}
}}}

The last step was to point SWIG to apply its "director" feature to the
Xapian::Stem which now has the vtable. Viola, I have my Hunspell
stemmer been used by Xapian during indexing and query parsing from my
Python code!

To share my research and to make the trivial patches been incorporated
into the Xapian trunk I've created the ticked
http://trac.xapian.org/ticket/448. But, it has been closed without
been thoroughly analysed, and I am now trying to convince you guys to
take a second look on it. I've attached to the ticket my minimal
patches for Xapian-core and Xapian-bindings and my research prototype
C++/SWIG extension for Python which demonstrated the approach.

I could continue use the patched Xapian for myself, but it would be
better to go upstream is it:
 1. doesn't change the current architecture
 2. has no side effects for the current code and usage patterns
 3. is really trivial

Any comments?



More information about the Xapian-devel mailing list