[Xapian-discuss] Japanese / UTF-8 support

Fri Aug 11 05:42:01 BST 2006

> Ultimately it would be nice to support this kind of thing. The first
> step is UTF-8 support, which Olly has been working on.

Omega is producing excellent results for French/Italian/German/Spanish
against UTF-8 HTML files.

> > Lots of work to make this sort of thing work automatically. If anyone
> > knows about word splitting for CJK, that'd be a huge help ...

Chinese is easy; one word per character. I suspect a basic unicode
lookup table (e.g. this range is for Chinese characters) and a few simple
rules would go a long way.

However, if serious expertise is desired I suspect Toshiyuki Kimura
(toshi AT apache.org) might be willing to answer questions. Or will be
able to refer someone.

> And what about automatic language detection?
> That would help me also tremendously as I have about 60% english, 20%
> german, 5% french, 5% korean, 5% japanese, and 5% italian.

Not sure why this is useful, except perhaps for stemming. Even then
you will be in trouble for mixed language documents. Seems a little
outside the scope of Xapian, at least from my newbie perspective.
Anyway, there's a couple n-gram based language detectors in open
source land which work fairly well, but the error rate is noticible.

> Automatic charset detection would of course help also. Aren't there any
> libraries out there?

Not that I know about, except for more probablistic n-gram stuff
that's even flakier than language detection. I thought UTF-8 solved
this problem. Documents not using unicode? The horror!!