[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Thu Jul 5 03:30:10 BST 2007

Hi,

I have altered the source code so that the tokenizer can deal with
n-gram cjk tokenization now.
Please go to http://code.google.com/p/cjk-tokenizer/
Thank you.

Best,
Yung-chung Lin

On 6/6/07, Kevin Duraj <kevin.softdev at gmail.com> wrote:
> Hi,
>
> I am looking for Chinese Japanese and Korean tokenizer that could can
> be use to tokenize terms for CJK languages. I am not very familiar
> with these languages however I think that these languages contains one
> or more words in one symbol which it make more difficult to tokenize
> into searchable terms.
>
> Lucene has CJK Tokenizer ... and I am looking around if there is some
> open source that we could use with Xapian.
>
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html
>
> Cheers
>   -Kevin Duraj
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>