[Xapian-tickets] [Xapian] #448: Allow usage of custom stemmers

Xapian nobody at xapian.org
Tue Apr 13 15:10:40 BST 2010


#448: Allow usage of custom stemmers
-------------------------+--------------------------------------------------
 Reporter:  esizikov     |        Owner:  olly    
     Type:  enhancement  |       Status:  reopened
 Priority:  normal       |    Milestone:  1.2.x   
Component:  Library API  |      Version:  1.0.17  
 Severity:  normal       |   Resolution:          
 Keywords:               |    Blockedby:          
 Platform:  All          |     Blocking:          
-------------------------+--------------------------------------------------

Comment(by esizikov):

 I've attached a patch which makes {{{StemImplementation}}} to be a SWIG
 "director" class, which allows its overloading in scripting languages.

 Issues:

  1. Had to change {{{get_description()}}} function from {{{const char *
 get_description() const = 0;}}} to {{{const std::string get_description()
 const = 0;}}}  - approve needed.

  2. I've missed something related to threading because I'm always having
 the {{{Fatal Python error: PyEval_SaveThread: NULL tstate}}} error with
 process been aborted just before the Python script is going to terminate.

 Besides these 2 points it works (as a proof of concept): I'm now able to
 use a custom stemmer from a Python script:
 {{{
 #!python
 # -*- coding: utf-8 -*-

 import sys
 sys.path.insert(0, '/home/esizikov/svn/xapian/build/xapian-
 bindings/python/xapian')
 sys.path.insert(0, '/home/esizikov/svn/xapian/build/xapian-
 bindings/python/modern')

 import xapian
 import hunspell

 class HunspellStemmer(xapian.StemImplementation):
     def __init__(self, lang):
         super(HunspellStemmer, self).__init__()
         self._h = hunspell.HunSpell('/usr/share/myspell/%s.dic' % lang,
                                     '/usr/share/myspell/%s.aff' % lang)
         self._enc = self._h.get_dic_encoding()

     def __call__(self, s):
         return self._h.stem(unicode(s,
 'utf-8').encode(self._enc))[0].decode(self._enc)

 def main():
     text = 'платья из золота на продажу'

     stem_impl = HunspellStemmer('ru_RU')
     stem = xapian.Stem(stem_impl)

     print stem('платья')

     doc = xapian.Document()

     generator = xapian.TermGenerator()
     generator.set_document(doc)
     generator.set_stemmer(stem)
     generator.index_text(text)

     ti = doc.termlist_begin()
     print ti.get_term()

     query_parser = xapian.QueryParser()
     query_parser.set_stemmer(stem)
     query_parser.set_stemming_strategy(xapian.QueryParser.STEM_ALL)
     for term in query_parser.parse_query(text):
         print term,
     print

 if __name__ == '__main__':
     main()
 }}}

-- 
Ticket URL: <http://trac.xapian.org/ticket/448#comment:23>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list