[Xapian-discuss] Re: Japanese / UTF-8 support

Olly Betts olly at survex.com
Wed Sep 6 13:44:01 BST 2006


On Sun, Aug 27, 2006 at 01:29:01AM +0100, Olly Betts wrote:
> On Sat, Aug 12, 2006 at 09:34:50PM -0700, Jeff Breidenbach wrote:
> >  * How can I best help with CJK ? The more concrete the suggestion,
> >     the better.
> 
> One useful job which doesn't require particular knowledge of Xapian is
> to check all the filtering tools which omindex can use and discover the
> runes required to get them to produce UTF-8 output (or failing that,
> UTF-16 or UTF-32 but I suspect Unix tools are more likely to produce
> UTF-8 if they do unicode at all).

I've now pretty much done this.  The worst gap is that there doesn't
seem to be a PostScript to text convertor which handles anything above
iso-8859-1.

The current state of my reworked code is that everything is being
converted to UTF-8 in omindex.  That mostly leaves adjusting the
word tokenisation in line with the UTF-8 QueryParser patch, and
deciding what to do about character sets in scriptindex.

Further work is also still needed to handle wide character HTML files
(such as UTF-16), or indeed HTML in any encoding which doesn't have
ASCII as a subset.

Anyway, I'll create a "unicode" branch in SVN soon so people can try out
the new code.

Cheers,
    Olly



More information about the Xapian-discuss mailing list