[Xapian-discuss] Japanese / UTF-8 support

Thu Aug 10 20:25:51 BST 2006

James Aylett schrieb:
> On Wed, Aug 09, 2006 at 11:43:34PM -0700, Jeff Breidenbach wrote:
> 
>> I tried running omindex on the following file, which is a
>> UTF-8 web page with mixed English and Japanese text.
> 
> [...]
> 
>> Any comments? I was really surprised, since Omega did so well
>> in an earlier test against a similar UTF-8 document written in Danish.
>> Is this a matter of polish or are there deeper barriers, like a lack of
>> word splitting capability for languages like Chinese/Japanese/Korean?
> 
> omindex (and the QueryParser) has somewhat primitive,
> European-centric, word splitting. The tricky bit is actually for the
> query parser ... you could either make it so you have to specify the
> language you're searching in, and set splitting and stemming
> appropriately (or auto-detect the language), or parse it all possible
> ways (based on which languages exist in your database) and merge the
> results somehow.
> 
> Ultimately it would be nice to support this kind of thing. The first
> step is UTF-8 support, which Olly has been working on. On top of that
> we'd need word splitting algorithm for CJK (and anything else that we
> can't throw English-like rules at). My understanding is that there
> isn't a good stemming strategy for CJK, so we'd just disable it there.
> 
> Lots of work to make this sort of thing work automatically. If anyone
> knows about word splitting for CJK, that'd be a huge help ...

And what about automatic language detection?
That would help me also tremendously as I have about 60% english, 20% 
german, 5% french, 5% korean, 5% japanese, and 5% italian.

Automatic charset detection would of course help also. Aren't there any 
libraries out there?

-- 
Reini Urban
http://phpwiki.org/  http://murbreak.at/
http://helsinki.at/  http://spacemovie.mur.at/