[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Thu Jul 5 10:29:08 BST 2007

The code has been updated. Now cjk-tokenizer treats the following
characters as CJK characters. Will consider to use some flags to
ignore unwanted character blocks during tokenization.

// 2E80..2EFF; CJK Radicals Supplement
// 3000..303F; CJK Symbols and Punctuation
// 3040..309F; Hiragana
// 30A0..30FF; Katakana
// 3100..312F; Bopomofo
// 3130..318F; Hangul Compatibility Jamo
// 3190..319F; Kanbun
// 31A0..31BF; Bopomofo Extended
// 31C0..31EF; CJK Strokes
// 31F0..31FF; Katakana Phonetic Extensions
// 3200..32FF; Enclosed CJK Letters and Months
// 3300..33FF; CJK Compatibility
// 3400..4DBF; CJK Unified Ideographs Extension A
// 4DC0..4DFF; Yijing Hexagram Symbols
// 4E00..9FFF; CJK Unified Ideographs
// A700..A71F; Modifier Tone Letters
// AC00..D7AF; Hangul Syllables
// F900..FAFF; CJK Compatibility Ideographs
// FE30..FE4F; CJK Compatibility Forms
// FF00..FFEF; Halfwidth and Fullwidth Forms
// 20000..2A6DF; CJK Unified Ideographs Extension B
// 2F800..2FA1F; CJK Compatibility Ideographs Supplement

Best,
Yung-chung Lin

On 7/5/07, ☼ 林永忠 ☼ (Yung-chung Lin) <henearkrxern at gmail.com> wrote:
> You are correct. I forgot to include all CJK characters. I will do it
> in the next revision. The macros were used in one previous project
> that was scratched up in a short time. I will fix this.
>
> Best,
> Yung-chung Lin
>
> On 7/5/07, Olly Betts <olly at survex.com> wrote:
> > On Thu, Jul 05, 2007 at 10:30:10AM +0800, ??? ????????? ??? (Yung-chung Lin) wrote:
> > > I have altered the source code so that the tokenizer can deal with
> > > n-gram cjk tokenization now.
> > > Please go to http://code.google.com/p/cjk-tokenizer/
> >
> > I have a question - if I read the code correctly, it treats Unicode code
> > points 0x4000 to 0x9fff as CJK characters, but that seems to omit quite
> > a lot of CJK characters - 0x2E80-0x3fff (with a few exceptions), and
> > 0xf900-0xfaff:
> >
> > http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Basic_Multilingual_Plane
> >
> > Are the omitted characters not relevant here, or is this an oversight?
> >
> > Also the Supplementary Ideographic Plane is ignored, but those are
> > described as seldom used, so I can understand why.
> >
> > Cheers,
> >     Olly
> >
>