[Xapian-discuss] Chinese segmentation

Thu Apr 21 10:59:40 BST 2011

Hi,

Big5 was designed only for zh_TW, while GBK was designed only for zh_CN.  It
is better to convert everything to Unicode for segmentation.

For converting from Big5/GBK to UTF-8, iconv can serve the purpose.
If you use Perl, you may consider using Encode::HanConvert
http://search.cpan.org/dist/Encode-HanConvert/

If you code in C++, you may consider cjk-tokenizer.
http://code.google.com/p/cjk-tokenizer/

Language detection on character level for Chinese is fairly easy. You just
need to check the range of characters. Detection for Japanese would be
slightly complicated because Japanese is mixed with Kanji, Hiragara, and
Katakana, buf if you add some predefined rules, it is not so complicated.

Best,
Yung-chung Lin

2011/4/21 戴优丽 <daiyli1984 at gmail.com>

> hello, I have finished reading the papers, and i think it is time to design
> my project.
> First step will be determine the input characters are Chinese. i see the
> past post that cjk-tokenizer is just dealing with UTF-8 and unicode, but i
> see some other code system such as gbk and big5. i am wondering that should
> i just deal with UTF-8 and unicode?
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>