<div dir="ltr">Hello.<div><br></div><div>I'm interested in creating a Hebrew stemmer to use with Xapian. Hebrew is a complicated language to stem, as it uses the semitic "root" system, rather than prefixes and suffixes, and has many irregularities in accidence (morphology).</div>
<div><br></div><div>Fortunately, two bright fellows from the Technion University in Israel have already created a Hebrew morphological analyzer as part of their Hebrew spellchecker project (hspell), which is the de-facto standard free Hebrew spellchecker (used in GMail etc.). This analyzer is heavily lexicon-based, and is therefore difficult to express as a Snowball program.</div>
<div><br></div><div>Since hspell offers a convenient API (give a word, get a list of possible stems -- yes, Hebrew is very ambiguous, too, so a single form may have two or even more possible stems -- I mean completely different words, not variations), I want to leverage libhspell in Xapian without going through Snowball at all.</div>
<div><br></div><div>I took a quick look at xapian-core, and I see that stem.cc seems to have some accommodation for an abstraction of a stemming algorithm, but on the other hand, get_available_languages() would return LANGSTRING, which is generated in the allsnowballheaders.h file, which assumes Snowball is used for all stemmers.</div>
<div><br></div><div>So I'm a little confused about this. Can anyone shed light on the status of generic stemming -- is this half-written support?</div><div><br></div><div>It seems to me I could instantiate an ExternalHebrewStemmer of my own making, calling libhspell instead of Snowball. What do you think?</div>
<div><br></div><div>Thanks,</div><div><br></div><div> Asaf Bartov, Wikimedia Israel<br clear="all">-- <br>Asaf Bartov <<a href="mailto:asaf.bartov@gmail.com">asaf.bartov@gmail.com</a>><br>
</div></div>