[Xapian-devel] ICU

Mon Apr 10 04:44:21 BST 2006

I've just been looking at ICU with an eye to reworking the unicode
queryparser patch to use it.  A few things have jumped out so far which
make we wonder if it's the best option.  I don't really know what the
alternatives are though (currently QueryParser uses glib's unicode
routines).

The first is that there seems to be bad version skew.  Ubuntu breezy
(the latest release) has ICU 2.1 and 2.8 packaged, as does Debian sarge
(the latest stable release).  The latest ICU version is 3.4.1 (and
debian unstable only has this version).  I can't seem to find what's
changed between ICU versions (except for release notes for 3.2 and 3.4
versions), so I worry this is going to be a hassle.

The second is that all the multi-statement macro definitions in their
headers are just enclosed in a block "{...}" instead of using the
familiar "do {...} while (0)" trick to avoid suprise when used in
places where an extra ";" matters.

This doesn't seem to rate a mention in the user guide, but e.g.
/usr/include/unicode/utf8.h says:

 * <em>Usage:</em>
 * ICU coding guidelines for if() statements should be followed when using these macros.
 * Compound statements (curly braces {}) must be used  for if-else-while...
 * bodies and all macro statements should be terminated with semicolon.

I don't really like the attitude that *I* have to follow *their* coding
guidelines in my own code!  If I'm contributing code to their project
then I agree it's reasonable to expect adherence to their coding
standards, but not just to use their library.

By eschewing the standard idiom for wrapping multiline macro calls,
they're forcing the risk of silent miscompilation on their users.

Finally, they use UTF-16 as their internal representation whereas we
want to use UTF-8.  For the queryparser, this isn't an issue as
there are macros for decoding UTF-8 characters and for saying if a
unicode code point is upper case, etc.  But in omindex we want to
be able to convert between encodings, and it looks like we have to
go via UTF-16.  I suspect we'd end up writing our own ISO-8859-1
to UTF-8 convertor (that's probably the most common conversion we'd
need).

Cheers,
    Olly