[Xapian-devel] GSoC - Improve Japanese Support

Olly Betts olly at survex.com
Fri Mar 30 03:17:15 BST 2012


On Thu, Mar 29, 2012 at 09:40:52PM -0400, Julia Wilson wrote:
> > > https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py
> >
> > Wow, that certainly lives up to its name.  How effective is it?
> 
> It?s kind of a nifty little thing, for what it is. It?s really effective at
> breaking up a lot of average text (e.g. news articles) with reasonable
> accuracy, and obviously it's ridiculously easy to use. It tends to have
> trouble with long compound words, unusual conjunctions, etc. -  the lack of
> dictionary support means it does end up being greedy at times. It?s
> definitely not ideal where precision is a top priority. Here's a bit more
> information on it.
> 
> http://lilyx.net/tinysegmenter-in-python/

If there's a reason why people might want to use this (e.g. perhaps it
is much faster, or better for some uses) we could offer it as an
alternative.  We provide alternative stemming algorithms for some
languages, which is conceptually similar.

> > We don't have anything particularly in mind, though mecab has been
> > mentioned both last year and this:
> >
> > http://code.google.com/p/mecab/
> 
> The others that I've run across are Chasen and Juman, but from a quick look
> it does seem that Mecab has the most people saying good things about it.

It seems Mecab is a successor of sorts to Chasen.  At least one of the
Mecab developers has an @chasen.org address for example.

> > There was a project implementing a Chinese segmentation algorithm in
> > GSoC last year, which isn't merged yet, and also a patch to support a
> > third party segmenter, which is why there's no corresponding "Chinese"
> > idea this year - we need too consolidate what we already have there
> > first really.
> 
> I'll spend some time looking at the Chinese algorithm and the third party
> segementer. I looked over the records of the Chinese segmentation project;
> do you know of any documentation or other information about the algorithm
> (or rather algorithms, as I noticed that a couple were tried out) that was
> implemented? I'm wondering if there were any particular features or other
> selection criteria I'd want to keep in mind.

There's some information here if you didn't already read that:

http://trac.xapian.org/wiki/GSoC2011/ChineseSegmentationAnalysis

I don't think there's much beyond that.

There didn't seem to be an existing segmentation library in a suitable
language and with a suitable licence, which is why we went the route
of trying to implement from scratch.  It's possible there was something
that was overlooked though - I've noticed these segmentation libraries
seem to often only have websites in their native language, which is
understandable, but unhelpful if you don't understand it!

But this meant the project didn't get to the stage of integrating this
into Xapian, which is a shame.  Hopefully it'll get done eventually,
but if there's a good existing segmentation library you can use, that
should help avoid this happening again.

The third party segmenter patch is:

http://thread.gmane.org/gmane.comp.search.xapian.general/9052

With some previous discussion at:

http://thread.gmane.org/gmane.comp.search.xapian.general/9047

So that library is BSD now, though probably had an unclear licence at
this time last year when the Chinese Segmentation GSoC project proposal
was being written.

Cheers,
    Olly



More information about the Xapian-devel mailing list