[Xapian-devel] GSoC - Improve Japanese Support

Fri Mar 30 02:40:52 BST 2012

>
> > The suggested project description mentions switching to a
> language-specific
> > segmentation algorithm; since a particular algorithm isn't specified I'm
> > guessing that the evaluation and selection of an algorithm would
> > necessarily be part of the project? I've worked with Japanese text mostly
> > in Python and so I've used TinySegmenter in
> > Python<
> https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py
> >
>
> Wow, that certainly lives up to its name.  How effective is it?
>

It’s kind of a nifty little thing, for what it is. It’s really effective at
breaking up a lot of average text (e.g. news articles) with reasonable
accuracy, and obviously it's ridiculously easy to use. It tends to have
trouble with long compound words, unusual conjunctions, etc. -  the lack of
dictionary support means it does end up being greedy at times. It’s
definitely not ideal where precision is a top priority. Here's a bit more
information on it.

http://lilyx.net/tinysegmenter-in-python/

> > when
> > I needed segmentation; I've run across a couple of other options that
> were
> > available in C++ but I haven't really had occasion to use them. So
> > basically, I've dealt with Japanese segmenters before, but am interested
> to
> > know if you had anything specific in mind.
>
> We don't have anything particularly in mind, though mecab has been
> mentioned both last year and this:
>
> http://code.google.com/p/mecab/
>
> Evaluation and selection could certainly be part of the project.
>
> There's a plan to try to get to the point where we can relicense
> Xapian (probably as MIT/X), so segmentation libraries with a liberal
> licence (such as MIT/X or new BSD) or perhaps LGPL would be better.
>

The others that I've run across are Chasen and Juman, but from a quick look
it does seem that Mecab has the most people saying good things about it.

>
> > I've been looking at the code in xapian-core/languages and
> > xapian-core/queryparser to get an idea of how other languages are
> > implemented and used. Am I correct in assuming that the ultimate goal is
> to
> > deprecate the n-gram based tokenizer in favor of individual ways of
> > handling Japanese and Korean (and presumably Chinese)? That is, is the
> idea
> > to make language support entirely language-specific, or would there still
> > be kind of a generic "CJK Stuff" class as well? Also, is there anything
> > else beyond what I mentioned that I should make a point of looking at in
> > the code base to better understand what would need to be done?
>
> It probably would make sense to deprecate the n-gram approach once we
> have support for segmenters for all the languages which need it.
>
> There was a project implementing a Chinese segmentation algorithm in
> GSoC last year, which isn't merged yet, and also a patch to support a
> third party segmenter, which is why there's no corresponding "Chinese"
> idea this year - we need too consolidate what we already have there
> first really.
>

I'll spend some time looking at the Chinese algorithm and the third party
segementer. I looked over the records of the Chinese segmentation project;
do you know of any documentation or other information about the algorithm
(or rather algorithms, as I noticed that a couple were tried out) that was
implemented? I'm wondering if there were any particular features or other
selection criteria I'd want to keep in mind.

Thanks,
Julia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120329/080ad841/attachment.htm>