[Xapian-discuss] chinese/japanese index support

Tue Feb 26 12:31:27 GMT 2008

On Tue, 26 Feb 2008 09:48:29 +0000,  Olly Betts <olly at survex.com> wrote:
>  On Tue, Feb 26, 2008 at 01:27:36AM -0800, Rick Olson wrote:
>  > chun yu wrote:
>  > > I am wandered if the version 1.0.5 has support the chinese/japanese
>  > > indexing.
>
>  There's nothing specific to Chinese or Japanese currently, although we
>  do support all of Unicode in the character classification code, so
>  Chinese and Japanese characters should be correctly identified as part
>  of words.
>
>  > > or how can I implement to support indexing chinese?
>
>  The usual approaches are based on n-gram matching.  Someone posted a
>  link to some code they'd written (and I think were using with Xapian)
>  on the list, but I've not had a chance to study it yet.
>
Yung-chung Lin wrote a CJKV n-gram tokenizer. The source is here :
http://svn.berlios.de/wsvn/dijon/trunk/cjkv/?rev=0&sc=1
It's not tied to Xapian in particular. It needs libunicode 0.4 or glib.

I make use of it in Pinot, to generate terms when indexing CJKV documents,
and at search time to pre-process CJKV queries before feeding them to the
QueryParser.

Fabrice