<div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>> The suggested project description mentions switching to a language-specific<br>
> segmentation algorithm; since a particular algorithm isn't specified I'm<br>
> guessing that the evaluation and selection of an algorithm would<br>
> necessarily be part of the project? I've worked with Japanese text mostly<br>
> in Python and so I've used TinySegmenter in<br>
</div>> Python<<a href="https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py" target="_blank">https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py</a>><br>
<br>
Wow, that certainly lives up to its name. How effective is it?<br></blockquote><div><br></div><p class="MsoNormal"></p><div>It’s kind of a nifty little thing, for what it is. It’s
really effective at breaking up a lot of average text (e.g. news articles) with
reasonable accuracy, and obviously it's ridiculously easy to use. It tends to have
trouble with long compound words, unusual conjunctions, etc. - the lack of dictionary support means it does end up being greedy at times. It’s definitely
not ideal where precision is a top priority. Here's a bit more information on
it.</div><div><br></div><div><a href="http://lilyx.net/tinysegmenter-in-python/" target="_blank">http://lilyx.net/tinysegmenter-in-python/</a> </div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><br>
> when<br>
> I needed segmentation; I've run across a couple of other options that were<br>
> available in C++ but I haven't really had occasion to use them. So<br>
> basically, I've dealt with Japanese segmenters before, but am interested to<br>
> know if you had anything specific in mind.<br>
<br>
</div>We don't have anything particularly in mind, though mecab has been<br>
mentioned both last year and this:<br>
<br>
<a href="http://code.google.com/p/mecab/" target="_blank">http://code.google.com/p/mecab/</a><br>
<br>
Evaluation and selection could certainly be part of the project.<br>
<br>
There's a plan to try to get to the point where we can relicense<br>
Xapian (probably as MIT/X), so segmentation libraries with a liberal<br>
licence (such as MIT/X or new BSD) or perhaps LGPL would be better.<br></blockquote><div><br></div><div>The others that I've run across are Chasen and Juman, but from a quick look it does seem that Mecab has the most people saying good things about it. </div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><br>
> I've been looking at the code in xapian-core/languages and<br>
> xapian-core/queryparser to get an idea of how other languages are<br>
> implemented and used. Am I correct in assuming that the ultimate goal is to<br>
> deprecate the n-gram based tokenizer in favor of individual ways of<br>
> handling Japanese and Korean (and presumably Chinese)? That is, is the idea<br>
> to make language support entirely language-specific, or would there still<br>
> be kind of a generic "CJK Stuff" class as well? Also, is there anything<br>
> else beyond what I mentioned that I should make a point of looking at in<br>
> the code base to better understand what would need to be done?<br>
<br>
</div>It probably would make sense to deprecate the n-gram approach once we<br>
have support for segmenters for all the languages which need it.<br>
<br>
There was a project implementing a Chinese segmentation algorithm in<br>
GSoC last year, which isn't merged yet, and also a patch to support a<br>
third party segmenter, which is why there's no corresponding "Chinese"<br>
idea this year - we need too consolidate what we already have there<br>
first really.<br></blockquote><div> </div><div>I'll spend some time looking at the Chinese algorithm and the third party segementer. I looked over the records of the Chinese segmentation project; do you know of any documentation or other information about the algorithm (or rather algorithms, as I noticed that a couple were tried out) that was implemented? I'm wondering if there were any particular features or other selection criteria I'd want to keep in mind.</div>
<div><br></div><div>Thanks,</div><div>Julia</div></div>