[Xapian-discuss] chinese/japanese index support

Tue Feb 26 09:48:29 GMT 2008

A quick answer as I have almost no spare time this week...

On Tue, Feb 26, 2008 at 01:27:36AM -0800, Rick Olson wrote:
> chun yu wrote:
> > I am wandered if the version 1.0.5 has support the chinese/japanese
> > indexing.

There's nothing specific to Chinese or Japanese currently, although we
do support all of Unicode in the character classification code, so
Chinese and Japanese characters should be correctly identified as part
of words.

> > or how can I implement to support indexing chinese?

The usual approaches are based on n-gram matching.  Someone posted a
link to some code they'd written (and I think were using with Xapian)
on the list, but I've not had a chance to study it yet.

> I haven't yet successfully used Xapian for indexing any character from 
> the CJK set in a production environment, but from my experience so far 
> it's not so convenient to use it for such a thing (no stemming support 
> that I can see, and significance of spaces in many cases!).

My understanding is that stemming isn't really meaningful for Chinese.
I'm not aware of a suitably licensed Japanese stemming algorithm.

Spaces are only significant to TermGenerator and QueryParser.  The best
approach to addressing this might be to have variants of these designed
specifically for languages which don't generally use whitespace to
signify word breaks.  The important thing is that they work together so
if both use n-grams, everything should work.

Cheers,
    Olly