[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Wed Jun 6 02:29:30 BST 2007

On Tue, Jun 05, 2007 at 02:37:27PM -0700, Kevin Duraj wrote:
> I am looking for Chinese Japanese and Korean tokenizer that could can
> be use to tokenize terms for CJK languages. I am not very familiar
> with these languages however I think that these languages contains one
> or more words in one symbol which it make more difficult to tokenize
> into searchable terms.

I've not investigated Japanese much or Korean at all, but I know a
little about Chinese.

Chinese "characters" are themselves words, but many words are formed
from multiple characters.  For example, the Chinese capital Beijing is
formed from two characters (which literally mean something like "North
Capital").

The difficulty is that Chinese text is usually written without any
indication of how the symbols group, so you need an algorithm to
identify them if you want to index such groups as terms.  I understand
that's quite a hard problem.

However, perhaps you don't need to do that.  You could just index each
symbol as a word and use phrase searching, or something like it.

> Lucene has CJK Tokenizer ... and I am looking around if there is some
> open source that we could use with Xapian.
> 
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html

That doesn't provide much information, but if you can find the source
code, you could analyse the algorithm used and if it's any good
implement it for use with Xapian.

Cheers,
    Olly