[Xapian-discuss] Chinese segmentation

戴优丽 daiyli1984 at gmail.com
Thu Apr 21 14:00:18 BST 2011


ok, i understand that now, thanks.

在 2011年4月21日 下午5:59,☼ 林永忠 ☼ (Yung-chung Lin) <henearkrxern at gmail.com>写道:

> Hi,
>
> Big5 was designed only for zh_TW, while GBK was designed only for zh_CN.
>  It is better to convert everything to Unicode for segmentation.
>
> For converting from Big5/GBK to UTF-8, iconv can serve the purpose.
> If you use Perl, you may consider using Encode::HanConvert
> http://search.cpan.org/dist/Encode-HanConvert/
>
> If you code in C++, you may consider cjk-tokenizer.
> http://code.google.com/p/cjk-tokenizer/
>
> Language detection on character level for Chinese is fairly easy. You just
> need to check the range of characters. Detection for Japanese would be
> slightly complicated because Japanese is mixed with Kanji, Hiragara, and
> Katakana, buf if you add some predefined rules, it is not so complicated.
>
> Best,
> Yung-chung Lin
>
> 2011/4/21 戴优丽 <daiyli1984 at gmail.com>
>
>> hello, I have finished reading the papers, and i think it is time to
>> design
>> my project.
>> First step will be determine the input characters are Chinese. i see the
>> past post that cjk-tokenizer is just dealing with UTF-8 and unicode, but i
>> see some other code system such as gbk and big5. i am wondering that
>> should
>> i just deal with UTF-8 and unicode?
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>
>
>


More information about the Xapian-discuss mailing list