[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.
Olly Betts
olly at survex.com
Thu Jul 5 03:56:27 BST 2007
On Thu, Jul 05, 2007 at 10:30:10AM +0800, ??? ????????? ??? (Yung-chung Lin) wrote:
> I have altered the source code so that the tokenizer can deal with
> n-gram cjk tokenization now.
> Please go to http://code.google.com/p/cjk-tokenizer/
I have a question - if I read the code correctly, it treats Unicode code
points 0x4000 to 0x9fff as CJK characters, but that seems to omit quite
a lot of CJK characters - 0x2E80-0x3fff (with a few exceptions), and
0xf900-0xfaff:
http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Basic_Multilingual_Plane
Are the omitted characters not relevant here, or is this an oversight?
Also the Supplementary Ideographic Plane is ignored, but those are
described as seldom used, so I can understand why.
Cheers,
Olly
More information about the Xapian-discuss
mailing list