[Xapian-devel] ICU

Jean-Francois Dockes jean-francois.dockes at wanadoo.fr
Thu Apr 13 08:36:39 BST 2006


Olly Betts writes:
 > On Wed, Apr 12, 2006 at 08:38:42AM +0200, Jean-Francois Dockes wrote:
 > > What's wrong with iconv for encoding conversion ?
 > 
 > The main problem is iconv_open.  As the Linux iconv_open man page puts it:
 > 
 >     The values permitted for fromcode and tocode and the supported
 >     combinations are system dependent.

True, I think it's part a more general problem with locale/charsets naming
(for example, would you believe that on Solaris, the charset name returned
by nl_langinfo(CODESET) in the C locale is "646" ...)

 > The problem is that there's no standard accompanying API for discovering
 > what values are supported or which combinations.  So perhaps on some
 > platform I can't convert from encoding X to utf-8, but I could convert
 > from encoding X to Y and then Y to utf-8.  Or utf-8 may not be supported
 > at all.  I've read before that these are genuine problems with trying to
 > use iconv.

Is there really a reasonably current platform with no support for
conversion to utf-8 ?  What do you want to support beyond
Linux/xBSD/Solaris/AIX/HP-UX ?

 > It's also not portably documented how to spell any particular encoding -
 > for GNU libiconv, it appears utf-8 is "UTF-8", but there's no assurance
 > that name will work on another implementation even if utf-8 is supported.

It's also probably true that the encoding names that you retrieve from the
source documents will be quite variable too.

 > The GNU implementation seems pretty decent - it supports a lot of
 > encodings and can convert between any given pair.  So one option is to
 > use iconv where it's known to be decent, but use other code elsewhere.

Another option might be to always use iconv, but carry GNU libiconv as a
dependency on systems where the native implementation proves to be really
deficient ? In any case, encoding conversion can be wrapped in a few method
calls, so it might not be a big issue to switch to ICU if really needed.

I think that glib relies on libiconv, so it's only a candidate as a
wrapper (at least libiconv is required for building/installing glib on
FreeBSD). 

After having a look at the ICU documentation, it does appear to be much
more complete than anything else, but also quite a large dependency to
carry.

Do you know how the different web browsers handle this issue ? I think that
openoffice uses ICU, and Mozilla uses all plus internal code :)

Jf




More information about the Xapian-devel mailing list