[Xapian-discuss] Japanese / UTF-8 support

Fri Aug 18 12:27:49 BST 2006

On Thu, Aug 10, 2006 at 09:42:01PM -0700, Jeff Breidenbach wrote:

> >Ultimately it would be nice to support this kind of thing. The first
> >step is UTF-8 support, which Olly has been working on.
> 
> Omega is producing excellent results for French/Italian/German/Spanish
> against UTF-8 HTML files.

Is that with the patch? Unless I've missed something, I don't think we
have released support yet.

> >> Lots of work to make this sort of thing work automatically. If anyone
> >> knows about word splitting for CJK, that'd be a huge help ...
> 
> Chinese is easy; one word per character. I suspect a basic unicode
> lookup table (e.g. this range is for Chinese characters) and a few simple
> rules would go a long way.

Erm ... lookup tables aren't good for unicode are they? I'd have
thought a tiny predicate function that just checks the codepoint
against known Chinese ranges would work better.

> >And what about automatic language detection?
> >That would help me also tremendously as I have about 60% english, 20%
> >german, 5% french, 5% korean, 5% japanese, and 5% italian.
> 
> Not sure why this is useful, except perhaps for stemming. Even then
> you will be in trouble for mixed language documents. Seems a little
> outside the scope of Xapian, at least from my newbie perspective.
> Anyway, there's a couple n-gram based language detectors in open
> source land which work fairly well, but the error rate is noticible.

The 'right' way is to mark up the language use in the
document. However language detectors as a fallback would be neat. In
general the way of approaching this that I'd favour would be to
generate scriptindex input files, so your indexing setup is the only
thing that needs to care about whether you're using detection or
reading it out of xml:lang attributes or whatever.

> >Automatic charset detection would of course help also. Aren't there any
> >libraries out there?
> 
> Not that I know about, except for more probablistic n-gram stuff
> that's even flakier than language detection. I thought UTF-8 solved
> this problem. Documents not using unicode? The horror!!

<grins>

Firefox's auto-detection of charsets is regularly fooled. If you
assume utf-8, iso-8859-1, they some multibyte options you might be
able to do it, but it's not a good idea. Again, marking up the
document explicitly is always going to be better.

James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org