[Xapian-devel] ICU

Olly Betts olly at survex.com
Thu Apr 13 15:14:15 BST 2006


On Thu, Apr 13, 2006 at 09:36:39AM +0200, Jean-Francois Dockes wrote:
> Olly Betts writes:
>  > The problem is that there's no standard accompanying API for discovering
>  > what values are supported or which combinations.  So perhaps on some
>  > platform I can't convert from encoding X to utf-8, but I could convert
>  > from encoding X to Y and then Y to utf-8.  Or utf-8 may not be supported
>  > at all.  I've read before that these are genuine problems with trying to
>  > use iconv.
> 
> Is there really a reasonably current platform with no support for
> conversion to utf-8 ?

I don't know.  I'd have to try every reasonably current platform to find
out, and (thanks to the way iconv is specified) I need to try converting
every supported encoding to utf-8 on each platform to be sure, except
there's no API to discover the names of every supported encoding.  Or
what "utf-8" is called.

It's not insurmountable (and I'd hope that each iconv implementation has
documentation to say what the supported encodings are and perhaps even
which pairs of conversions are supported), but it really ought to have
been standardised.

> What do you want to support beyond Linux/xBSD/Solaris/AIX/HP-UX ?

Incidentally I've never heard from anyone who's tried Xapian on AIX (and
IBM don't seem to have any sort of "developer access" program, which is
a little suprising given they seem very supportive of Open Source in
other ways).

But there's also Darwin/OS X (which I guess might be covered by xBSD),
DEC OSF1, IRIX, and MS Windows.

I do have access to most of these except AIX, IRIX, and MS Windows.  And
SourceForge's OS X boxes have been offline for months now.

>  > It's also not portably documented how to spell any particular encoding -
>  > for GNU libiconv, it appears utf-8 is "UTF-8", but there's no assurance
>  > that name will work on another implementation even if utf-8 is supported.
> 
> It's also probably true that the encoding names that you retrieve from the
> source documents will be quite variable too.

But at least there are standards which specify most of those, and it's
not a potentially different problem on every platform.

>  > The GNU implementation seems pretty decent - it supports a lot of
>  > encodings and can convert between any given pair.  So one option is to
>  > use iconv where it's known to be decent, but use other code elsewhere.
> 
> Another option might be to always use iconv, but carry GNU libiconv as a
> dependency on systems where the native implementation proves to be really
> deficient ?

That's definitely worth considering.

> In any case, encoding conversion can be wrapped in a few method
> calls, so it might not be a big issue to switch to ICU if really needed.

Yeah, that was my plan.  I wonder if ultimately we might want to have
wrappers to support several different alternatives as they probably each
have good and bad points.

> After having a look at the ICU documentation, it does appear to be much
> more complete than anything else, but also quite a large dependency to
> carry.

Yes, it does seem very comprehensive.

> Do you know how the different web browsers handle this issue ? I think that
> openoffice uses ICU, and Mozilla uses all plus internal code :)

OpenOffice definitely uses ICU.  I don't know about anything else.

Cheers,
    Olly



More information about the Xapian-devel mailing list