[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Wed Jun 6 02:58:19 BST 2007

2007/6/6, Olly Betts <olly at survex.com>:
>
> I've not investigated Japanese much or Korean at all, but I know a
> little about Chinese.
>
> Chinese "characters" are themselves words, but many words are formed
> from multiple characters.  For example, the Chinese capital Beijing is
> formed from two characters (which literally mean something like "North
> Capital").
>
> The difficulty is that Chinese text is usually written without any
> indication of how the symbols group, so you need an algorithm to
> identify them if you want to index such groups as terms.  I understand
> that's quite a hard problem.
>
> However, perhaps you don't need to do that.  You could just index each
> symbol as a word and use phrase searching, or something like it.
>
>
Hi,Olly Betts

When the Chinese word is utf-8 encode.  QueryParser.parse_query()  have
problem. It can not output the right Chinese word.