[Xapian-discuss] chinese/japanese index support

Tue Feb 26 10:15:34 GMT 2008

Olly Betts wrote:
> A quick answer as I have almost no spare time this week...
>
> On Tue, Feb 26, 2008 at 01:27:36AM -0800, Rick Olson wrote:
>   
>> chun yu wrote:
>>     
>>> I am wandered if the version 1.0.5 has support the chinese/japanese
>>> indexing.
>>>       
>
> There's nothing specific to Chinese or Japanese currently, although we
> do support all of Unicode in the character classification code, so
> Chinese and Japanese characters should be correctly identified as part
> of words.
>   
Exact matches work, which in my personal tests account for a majority of 
searches, more later.

>   
>>> or how can I implement to support indexing chinese?
>>>       
>
> The usual approaches are based on n-gram matching.  Someone posted a
> link to some code they'd written (and I think were using with Xapian)
> on the list, but I've not had a chance to study it yet.
>   

senna supports it well, in this particular case.  More later, yet again :)

>   
>> I haven't yet successfully used Xapian for indexing any character from 
>> the CJK set in a production environment, but from my experience so far 
>> it's not so convenient to use it for such a thing (no stemming support 
>> that I can see, and significance of spaces in many cases!).
>>     
>
> My understanding is that stemming isn't really meaningful for Chinese.
> I'm not aware of a suitably licensed Japanese stemming algorithm.
>
>   
[snip]
> Spaces are only significant to TermGenerator and QueryParser.  
[/snip]
Case in point, TermGenerator & QueryParser can be quite significant (as 
you know) in this case.
> The best
> approach to addressing this might be to have variants of these designed
> specifically for languages which don't generally use whitespace to
> signify word breaks.  The important thing is that they work together so
> if both use n-grams, everything should work.
>
> Cheers,
>     Olly
>   

Stemming in it's proper form is not so meaningful to Chinese, 
particularly, in my own limited experience (I will ask tomorrow just to 
see if such a concept exists :p).  Things get hairy in Japanese, and a 
bit in Korean (and in all honesty, even in Chinese).  The Japanese 
language, bless it's heart, is actually relatively simple if handled 
100% properly.  Unfortunately, much like the English language itself, 
there are multiple permutations of various word-handling's, and it goes 
pretty deep.  Xapian, as a matter of [unfortunate?] fact, does not 
handle the Japanese language for beans when it comes down to the 
nitty-gritty.

I don't mean to sound negative at all with my previous statement; I'd go 
so far as to say that Xapian, at it's core, should probably avoid 
catering to CJK sets at all due to their inherent complexity (but I'd 
not whine if proper support were implemented in an elegant way, it'd 
just surprise me if it were possible).

I do make some attempt to make sure I'm not propagating FUD with my 
statements, so please call me out if I'm in the wrong direction :)

Kind Regards,
Rick