[Xapian-discuss] Japanese / UTF-8 support

Jeff Breidenbach breidenbach at gmail.com
Sat Aug 19 04:13:50 BST 2006


> > Omega is producing excellent results for French/Italian/German/Spanish
> > against UTF-8 HTML files.
>
> Is that with the patch? Unless I've missed something, I don't think we
> have released support yet.

I tried 0.9.6 unpatched, and also with xapian-qp-utf8-0.9.5.patch
applied. Either way, UTF-8 Danish worked fine, UTF-8 Japanese was
a disaster. Although Japanese got a little better once I the put a
META tag in the search form to tell the browser to think in UTF-8.
Previously Firefox was ... converting the Japanese query string into
HTML numerical entity references at form submission time!

http://www.mail-archive.com/cgi-bin/omega/omega?P=alts%C3%A5&DB=brygforum%40lists.haandbryg.dk
http://www.mail-archive.com/cgi-bin/omega/omega?P=%E6%A7%98&DB=axis-user-ja%40ws.apache.org

> Erm ... lookup tables aren't good for unicode are they? I'd have
> thought a tiny predicate function that just checks the codepoint
> against known Chinese ranges would work better.

You are right, I recently read more about unicode and became very,
very, very scared. The 'NJ' codepoint for Croation is completely insane!
I'm pretty much terrified to do anything other than library calls. And I don't
see a word break library call in glib/gunicode.h



More information about the Xapian-discuss mailing list