[Xapian-discuss] Japanese / UTF-8 support

Sat Aug 26 15:54:26 BST 2006

On Thu, Aug 10, 2006 at 09:42:01PM -0700, Jeff Breidenbach wrote:
> >Ultimately it would be nice to support this kind of thing. The first
> >step is UTF-8 support, which Olly has been working on.
> 
> Omega is producing excellent results for French/Italian/German/Spanish
> against UTF-8 HTML files.

I think the problem you're seeing with Japanese is with omindex's term
generation.  The omega CGI should work fine, but I didn't patch omindex
yet as I have a separate indexer for gmane which knows about UTF-8.

The plan is to have everything working with UTF-8 for Xapian 1.0, so
this is actually going to be my main focus once I've dealt with my
email backlog.

> >> Lots of work to make this sort of thing work automatically. If anyone
> >> knows about word splitting for CJK, that'd be a huge help ...
> 
> Chinese is easy; one word per character. I suspect a basic unicode
> lookup table (e.g. this range is for Chinese characters) and a few simple
> rules would go a long way.

Chinese isn't really as simple as one word per character.  Chinese
characters are themselves words, but many words are formed from multiple
characters.  For example, the Chinese capital Beijing is formed from two
characters (which literally mean something like "North Capital").

> >And what about automatic language detection?
> 
> Not sure why this is useful, except perhaps for stemming. Even then
> you will be in trouble for mixed language documents.

I think it only really matters for stemming, and also if you want to
allow users to filter a query to just show results in a particular
language.

While different languages may have different ideas of word splitting, I
think in reality there are few, if any, characters which are a word
character in one language but a word breaking character in another.

For gmane, I just don't use stemming currently.  As you say, the error
rate of language identifiers is noticeable, and they perform worse
on very short texts, so you really can't use them to detect what
language a query is in.

Cheers,
    Olly