[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Thu Jul 5 03:56:27 BST 2007

On Thu, Jul 05, 2007 at 10:30:10AM +0800, ??? ????????? ??? (Yung-chung Lin) wrote:
> I have altered the source code so that the tokenizer can deal with
> n-gram cjk tokenization now.
> Please go to http://code.google.com/p/cjk-tokenizer/

I have a question - if I read the code correctly, it treats Unicode code
points 0x4000 to 0x9fff as CJK characters, but that seems to omit quite
a lot of CJK characters - 0x2E80-0x3fff (with a few exceptions), and
0xf900-0xfaff:

http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Basic_Multilingual_Plane

Are the omitted characters not relevant here, or is this an oversight?

Also the Supplementary Ideographic Plane is ignored, but those are
described as seldom used, so I can understand why.

Cheers,
    Olly