[Xapian-discuss] Japanese / UTF-8 support

James Aylett james-xapian at tartarus.org
Sat Aug 19 12:46:54 BST 2006


On Fri, Aug 18, 2006 at 08:13:50PM -0700, Jeff Breidenbach wrote:

> I tried 0.9.6 unpatched, and also with xapian-qp-utf8-0.9.5.patch
> applied. Either way, UTF-8 Danish worked fine, UTF-8 Japanese was
> a disaster.

I suspect that's largely luck, then.

> Although Japanese got a little better once I the put a META tag in
> the search form to tell the browser to think in UTF-8.  Previously
> Firefox was ... converting the Japanese query string into HTML
> numerical entity references at form submission time!

Yeah, as far as I'm aware there's no standard on what you should do if
your form enctype (which tends to default to the document charset,
which is daft but there you go) can't cope with characters you're
submitting. Note that multipart/form-data copes with this properly,
because it allows different form fields to have different encodings.

> >Erm ... lookup tables aren't good for unicode are they? I'd have
> >thought a tiny predicate function that just checks the codepoint
> >against known Chinese ranges would work better.
> 
> You are right, I recently read more about unicode and became very,
> very, very scared. The 'NJ' codepoint for Croation is completely
> insane!  I'm pretty much terrified to do anything other than library
> calls. And I don't see a word break library call in glib/gunicode.h

Unicode is big and complex, but that's because it's trying to do an
insanely difficult job. I have Unicode 3.0 at work, and I have
actually read pretty much all the rules and bits and pieces. Of
course, now it's got that little bit more complex... :)

Tim Bray recommended some good resources for starting to grok unicode
a while back; they should be findable on his blog.

James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list