[Xapian-devel] ICU
Jean-Francois Dockes
jean-francois.dockes at wanadoo.fr
Thu Apr 13 08:36:39 BST 2006
Olly Betts writes:
> On Wed, Apr 12, 2006 at 08:38:42AM +0200, Jean-Francois Dockes wrote:
> > What's wrong with iconv for encoding conversion ?
>
> The main problem is iconv_open. As the Linux iconv_open man page puts it:
>
> The values permitted for fromcode and tocode and the supported
> combinations are system dependent.
True, I think it's part a more general problem with locale/charsets naming
(for example, would you believe that on Solaris, the charset name returned
by nl_langinfo(CODESET) in the C locale is "646" ...)
> The problem is that there's no standard accompanying API for discovering
> what values are supported or which combinations. So perhaps on some
> platform I can't convert from encoding X to utf-8, but I could convert
> from encoding X to Y and then Y to utf-8. Or utf-8 may not be supported
> at all. I've read before that these are genuine problems with trying to
> use iconv.
Is there really a reasonably current platform with no support for
conversion to utf-8 ? What do you want to support beyond
Linux/xBSD/Solaris/AIX/HP-UX ?
> It's also not portably documented how to spell any particular encoding -
> for GNU libiconv, it appears utf-8 is "UTF-8", but there's no assurance
> that name will work on another implementation even if utf-8 is supported.
It's also probably true that the encoding names that you retrieve from the
source documents will be quite variable too.
> The GNU implementation seems pretty decent - it supports a lot of
> encodings and can convert between any given pair. So one option is to
> use iconv where it's known to be decent, but use other code elsewhere.
Another option might be to always use iconv, but carry GNU libiconv as a
dependency on systems where the native implementation proves to be really
deficient ? In any case, encoding conversion can be wrapped in a few method
calls, so it might not be a big issue to switch to ICU if really needed.
I think that glib relies on libiconv, so it's only a candidate as a
wrapper (at least libiconv is required for building/installing glib on
FreeBSD).
After having a look at the ICU documentation, it does appear to be much
more complete than anything else, but also quite a large dependency to
carry.
Do you know how the different web browsers handle this issue ? I think that
openoffice uses ICU, and Mozilla uses all plus internal code :)
Jf
More information about the Xapian-devel
mailing list