[Xapian-devel] let's discuss http://trac.xapian.org/ticket/448 "Allow usage of custom stemmers"

Kevin Duraj kevinduraj at gmail.com
Sun Feb 14 20:17:58 GMT 2010


Python is extremely slow interpreted programming language and is
definitely not suitable for running high performance search engines.
You mention that you need to modify Xapian for your work project and
thus your modification would affect all the millions of Xapian users
around the world. You did not mention how much salary you want to
share with us but I assume that is nothing. You just want to add some
overhead to Xapian and make it slower because you need it for your
work. Interesting, but definitely you make me laugh this Saturday

Kevin Duraj

On Sun, Feb 14, 2010 at 7:06 AM, Eugene! <esizikov at gmail.com> wrote:
> Hello Xapian developers!
> First of all, I'd like to thank you guys for the Xapian project at
> all. Great work! Xapian has decent performance and is very easy to
> enter.
> However, it has some "missing" features I really need to have, and the
> most noticeable is dictionary-based stemming and spelling available to
> be used from Python code.
> The current code of Xapian::Stem doesn't do anything to provide such a
> functionality even for C++ level, not talking about SWIG bindings at
> all.
> That is a no-go for me, thus I have to try to find some way to deal with it.
> I'm not a big fan of SWIG thus I have a very little knowledge of it,
> thus I decided not to go deep into the current Xapian SWIG bindings
> and concentrate on developing a standalone extension which would be a
> prototype or proof of concept for my work.
> The main idea which I've got after looked into the Xapian C++ code for
> Xapian::Stem if that it is mostly ready for using custome stemming
> engines right now! All that is required is to make it having a vtable!
> The easiest way to do that (an the best way for C++ subclassing) is to
> make the Xapian::~Stem() destructor to become virtual destructor.
> Also, in the first attempt I've tried to make virtual Stem::operator()
> but that DIDN'T work for me because of C++ type casting (e.g. for
> StemGenerator
> {{{
>  public:
>   ...
>    /// Set the Xapian::Stem object to be used for generating stemmed terms.
>    void set_stemmer(const Xapian::Stem & stemmer);
>  ...
> }
> class MyStem : public Xapian::Stem {
>  ...
>    std::string operator()(const std::string &word) const;   // we've
> patched Xapian::Stem to have virtual  std::string operator()(const
> std::string &word) const;
> }
> MyStem stem("english");
> Xapian::TermGenerator tg;
> tg,set_stemmer(stem);      // C++ type cast issue - stem will be
> treated as Xapian::Stem and our overloaded operator() won't be used
> }}}
> Then I noticed the presence of Xapian::Stem::Internal
> reference-counted pointer and that led me to the working solution.
> What I need it a 2-step thing:
>  1. Subclass Xapian::Stem::Internal to use dictionary-based stemmer
> (Hunspell in my case) in exactly the same way as it is done for
> different languages in the current code
>  2.  Subclass the Xapian::Stem in order to create instance of my own
> implementation of the Xapian::Stem::Internal
>  3, profit
> That is all I need because even after my derived class will be treated
> as Xapian::Stem that will not be a problem any more, as it will
> continue using the `internal' attribute which do the actual work! The
> using of the derived class instance and refcounting it is supported by
> the copy constructor and operator=() of the Xapian::Stem :
> {{{
> Stem::Stem(const Stem & o) : internal(o.internal) { }
> void
> Stem::operator=(const Stem & o)
> {
>    internal = o.internal;
> }
> }}}
> The last step was to point SWIG to apply its "director" feature to the
> Xapian::Stem which now has the vtable. Viola, I have my Hunspell
> stemmer been used by Xapian during indexing and query parsing from my
> Python code!
> To share my research and to make the trivial patches been incorporated
> into the Xapian trunk I've created the ticked
> http://trac.xapian.org/ticket/448. But, it has been closed without
> been thoroughly analysed, and I am now trying to convince you guys to
> take a second look on it. I've attached to the ticket my minimal
> patches for Xapian-core and Xapian-bindings and my research prototype
> C++/SWIG extension for Python which demonstrated the approach.
> I could continue use the patched Xapian for myself, but it would be
> better to go upstream is it:
>  1. doesn't change the current architecture
>  2. has no side effects for the current code and usage patterns
>  3. is really trivial
> Any comments?
> _______________________________________________
> Xapian-devel mailing list
> Xapian-devel at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-devel

More information about the Xapian-devel mailing list