[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Thu Jul 5 04:03:31 BST 2007

You are correct. I forgot to include all CJK characters. I will do it
in the next revision. The macros were used in one previous project
that was scratched up in a short time. I will fix this.

Best,
Yung-chung Lin

On 7/5/07, Olly Betts <olly at survex.com> wrote:
> On Thu, Jul 05, 2007 at 10:30:10AM +0800, ??? ????????? ??? (Yung-chung Lin) wrote:
> > I have altered the source code so that the tokenizer can deal with
> > n-gram cjk tokenization now.
> > Please go to http://code.google.com/p/cjk-tokenizer/
>
> I have a question - if I read the code correctly, it treats Unicode code
> points 0x4000 to 0x9fff as CJK characters, but that seems to omit quite
> a lot of CJK characters - 0x2E80-0x3fff (with a few exceptions), and
> 0xf900-0xfaff:
>
> http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Basic_Multilingual_Plane
>
> Are the omitted characters not relevant here, or is this an oversight?
>
> Also the Supplementary Ideographic Plane is ignored, but those are
> described as seldom used, so I can understand why.
>
> Cheers,
>     Olly
>