[Xapian-devel] GSoC - Improve Japanese Support

Thu Mar 29 03:51:39 BST 2012

On Wed, Mar 28, 2012 at 09:51:50PM -0400, Julia Wilson wrote:
> I know Japanese - I'm not a native speaker by any means, but I'm pretty
> good - and I'm really interested in the specifics of how dealing with
> Japanese text differs from dealing with English and other languages that
> use roman scripts. Currently I'm doing some research on improving Japanese
> to English translation algorithms using linguistic data (topicalization and
> pronoun resolution, specifically).

It sounds like your knowledge of Japanese wouldn't be an issue.

> The suggested project description mentions switching to a language-specific
> segmentation algorithm; since a particular algorithm isn't specified I'm
> guessing that the evaluation and selection of an algorithm would
> necessarily be part of the project? I've worked with Japanese text mostly
> in Python and so I've used TinySegmenter in
> Python<https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py>

Wow, that certainly lives up to its name.  How effective is it?

> when
> I needed segmentation; I've run across a couple of other options that were
> available in C++ but I haven't really had occasion to use them. So
> basically, I've dealt with Japanese segmenters before, but am interested to
> know if you had anything specific in mind.

We don't have anything particularly in mind, though mecab has been
mentioned both last year and this:

http://code.google.com/p/mecab/

Evaluation and selection could certainly be part of the project.

There's a plan to try to get to the point where we can relicense
Xapian (probably as MIT/X), so segmentation libraries with a liberal
licence (such as MIT/X or new BSD) or perhaps LGPL would be better.

> I've been looking at the code in xapian-core/languages and
> xapian-core/queryparser to get an idea of how other languages are
> implemented and used. Am I correct in assuming that the ultimate goal is to
> deprecate the n-gram based tokenizer in favor of individual ways of
> handling Japanese and Korean (and presumably Chinese)? That is, is the idea
> to make language support entirely language-specific, or would there still
> be kind of a generic "CJK Stuff" class as well? Also, is there anything
> else beyond what I mentioned that I should make a point of looking at in
> the code base to better understand what would need to be done?

It probably would make sense to deprecate the n-gram approach once we
have support for segmenters for all the languages which need it.

There was a project implementing a Chinese segmentation algorithm in
GSoC last year, which isn't merged yet, and also a patch to support a
third party segmenter, which is why there's no corresponding "Chinese"
idea this year - we need too consolidate what we already have there
first really.

I understand segmentation is also relevant to old-style Vietnamese,
but modern Vietnamese uses a Latin alphabet:

http://en.wikipedia.org/wiki/Vietnamese_language

I don't know if there are any other languages.

Cheers,
    Olly