[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Fri Jun 29 03:20:50 BST 2007

Ah, forgot one point. The tokenizer is dependent on libunicode.

Best,
Yung-chung Lin

On 6/29/07, ☼ 林永忠 ☼ (Yung-chung Lin, a.k.a. Kaspar or xern)
<henearkrxern at gmail.com> wrote:
> A ready-to-use bigram CJK tokenizer is attached to this mail. Enjoy it. Thanks.
>
> Best,
> Yung-chung Lin
>
> On 6/6/07, Kevin Duraj <kevin.softdev at gmail.com> wrote:
> > Hi,
> >
> > I am looking for Chinese Japanese and Korean tokenizer that could can
> > be use to tokenize terms for CJK languages. I am not very familiar
> > with these languages however I think that these languages contains one
> > or more words in one symbol which it make more difficult to tokenize
> > into searchable terms.
> >
> > Lucene has CJK Tokenizer ... and I am looking around if there is some
> > open source that we could use with Xapian.
> >
> > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html
> >
> > Cheers
> >   -Kevin Duraj
> >
> > _______________________________________________
> > Xapian-discuss mailing list
> > Xapian-discuss at lists.xapian.org
> > http://lists.xapian.org/mailman/listinfo/xapian-discuss
> >
>
>