[Xapian-devel] GSOC 2011- CJK Support

yong zhang zhangyzchina at gmail.com
Thu Apr 7 16:11:13 BST 2011


   Hello, erver one, I am Yongzhi Zhang, a chinese student.

I'm interested in CJK Support(also known as Chinese, Japanese, and Korean
Support),

I have 6 years experience in software development (c/C++ and java) .

I want to work on this project "CJK Support", I come from Beijing of china.

Chinese is my native language. This is my advantage for “CJK Support” .

I have fixed a bug for the indexing problem in Chinese version of help
system for OpenOffice. The OpenOffice use Lucene to implement the indexing .


I'll be happy to participate in this project during Google Summer ofCode
2011 program and implement CJK Support.

As Chinese letters are not delimited by whitespace, we cannotdistinguish
them easily. After my investigation, I find three methods to resolve this
issue, and I prefer the last one.

   1.

   Set each letter as a key to index, This is used by Lucene as default.

   The class is *StandardAnalyzer*
   2.

   Every two letter as a key to index. This is used by Lucene for “CJK
   support”

   The java class name is
CJKAnalyzer<http://svn.services.openoffice.org/opengrok/s?defs=CJKAnalyzer&project=/DEV300_m103>
   3.

   Follow the dictionary rule to distinguish group of characters.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20110407/46337fe9/attachment.htm>


More information about the Xapian-devel mailing list