<div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>&gt; The suggested project description mentions switching to a language-specific<br>


&gt; segmentation algorithm; since a particular algorithm isn&#39;t specified I&#39;m<br>

&gt; guessing that the evaluation and selection of an algorithm would<br>

&gt; necessarily be part of the project? I&#39;ve worked with Japanese text mostly<br>

&gt; in Python and so I&#39;ve used TinySegmenter in<br>

</div>&gt; Python&lt;<a href="https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py" target="_blank">https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py</a>&gt;<br>


<br>

Wow, that certainly lives up to its name.  How effective is it?<br></blockquote><div><br></div><p class="MsoNormal"></p><div>It’s kind of a nifty little thing, for what it is. It’s

really effective at breaking up a lot of average text (e.g. news articles) with

reasonable accuracy, and obviously it&#39;s ridiculously easy to use. It tends to have

trouble with long compound words, unusual conjunctions, etc. -  the lack of dictionary support means it does end up being greedy at times. It’s definitely

not ideal where precision is a top priority. Here&#39;s a bit more information on

it.</div><div><br></div><div><a href="http://lilyx.net/tinysegmenter-in-python/" target="_blank">http://lilyx.net/tinysegmenter-in-python/</a> </div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div><br>

&gt; when<br>

&gt; I needed segmentation; I&#39;ve run across a couple of other options that were<br>

&gt; available in C++ but I haven&#39;t really had occasion to use them. So<br>

&gt; basically, I&#39;ve dealt with Japanese segmenters before, but am interested to<br>

&gt; know if you had anything specific in mind.<br>

<br>

</div>We don&#39;t have anything particularly in mind, though mecab has been<br>

mentioned both last year and this:<br>

<br>

<a href="http://code.google.com/p/mecab/" target="_blank">http://code.google.com/p/mecab/</a><br>

<br>

Evaluation and selection could certainly be part of the project.<br>

<br>

There&#39;s a plan to try to get to the point where we can relicense<br>

Xapian (probably as MIT/X), so segmentation libraries with a liberal<br>

licence (such as MIT/X or new BSD) or perhaps LGPL would be better.<br></blockquote><div><br></div><div>The others that I&#39;ve run across are Chasen and Juman, but from a quick look it does seem that Mecab has the most people saying good things about it.  </div>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div><br>

&gt; I&#39;ve been looking at the code in xapian-core/languages and<br>

&gt; xapian-core/queryparser to get an idea of how other languages are<br>

&gt; implemented and used. Am I correct in assuming that the ultimate goal is to<br>

&gt; deprecate the n-gram based tokenizer in favor of individual ways of<br>

&gt; handling Japanese and Korean (and presumably Chinese)? That is, is the idea<br>

&gt; to make language support entirely language-specific, or would there still<br>

&gt; be kind of a generic &quot;CJK Stuff&quot; class as well? Also, is there anything<br>

&gt; else beyond what I mentioned that I should make a point of looking at in<br>

&gt; the code base to better understand what would need to be done?<br>

<br>

</div>It probably would make sense to deprecate the n-gram approach once we<br>

have support for segmenters for all the languages which need it.<br>

<br>

There was a project implementing a Chinese segmentation algorithm in<br>

GSoC last year, which isn&#39;t merged yet, and also a patch to support a<br>

third party segmenter, which is why there&#39;s no corresponding &quot;Chinese&quot;<br>

idea this year - we need too consolidate what we already have there<br>

first really.<br></blockquote><div> </div><div>I&#39;ll spend some time looking at the Chinese algorithm and the third party segementer. I looked over the records of the Chinese segmentation project; do you know of any documentation or other information about the algorithm (or rather algorithms, as I noticed that a couple were tried out) that was implemented? I&#39;m wondering if there were any particular features or other selection criteria I&#39;d want to keep in mind.</div>

<div><br></div><div>Thanks,</div><div>Julia</div></div>