[Xapian-tickets] [Xapian] #448: Allow usage of custom stemmers
Xapian
nobody at xapian.org
Tue Apr 13 15:10:40 BST 2010
#448: Allow usage of custom stemmers
-------------------------+--------------------------------------------------
Reporter: esizikov | Owner: olly
Type: enhancement | Status: reopened
Priority: normal | Milestone: 1.2.x
Component: Library API | Version: 1.0.17
Severity: normal | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
-------------------------+--------------------------------------------------
Comment(by esizikov):
I've attached a patch which makes {{{StemImplementation}}} to be a SWIG
"director" class, which allows its overloading in scripting languages.
Issues:
1. Had to change {{{get_description()}}} function from {{{const char *
get_description() const = 0;}}} to {{{const std::string get_description()
const = 0;}}} - approve needed.
2. I've missed something related to threading because I'm always having
the {{{Fatal Python error: PyEval_SaveThread: NULL tstate}}} error with
process been aborted just before the Python script is going to terminate.
Besides these 2 points it works (as a proof of concept): I'm now able to
use a custom stemmer from a Python script:
{{{
#!python
# -*- coding: utf-8 -*-
import sys
sys.path.insert(0, '/home/esizikov/svn/xapian/build/xapian-
bindings/python/xapian')
sys.path.insert(0, '/home/esizikov/svn/xapian/build/xapian-
bindings/python/modern')
import xapian
import hunspell
class HunspellStemmer(xapian.StemImplementation):
def __init__(self, lang):
super(HunspellStemmer, self).__init__()
self._h = hunspell.HunSpell('/usr/share/myspell/%s.dic' % lang,
'/usr/share/myspell/%s.aff' % lang)
self._enc = self._h.get_dic_encoding()
def __call__(self, s):
return self._h.stem(unicode(s,
'utf-8').encode(self._enc))[0].decode(self._enc)
def main():
text = 'платья из золота на продажу'
stem_impl = HunspellStemmer('ru_RU')
stem = xapian.Stem(stem_impl)
print stem('платья')
doc = xapian.Document()
generator = xapian.TermGenerator()
generator.set_document(doc)
generator.set_stemmer(stem)
generator.index_text(text)
ti = doc.termlist_begin()
print ti.get_term()
query_parser = xapian.QueryParser()
query_parser.set_stemmer(stem)
query_parser.set_stemming_strategy(xapian.QueryParser.STEM_ALL)
for term in query_parser.parse_query(text):
print term,
print
if __name__ == '__main__':
main()
}}}
--
Ticket URL: <http://trac.xapian.org/ticket/448#comment:23>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list