[Xapian-discuss] Re: Japanese / UTF-8 support
Olly Betts
olly at survex.com
Wed Sep 6 13:44:01 BST 2006
On Sun, Aug 27, 2006 at 01:29:01AM +0100, Olly Betts wrote:
> On Sat, Aug 12, 2006 at 09:34:50PM -0700, Jeff Breidenbach wrote:
> > * How can I best help with CJK ? The more concrete the suggestion,
> > the better.
>
> One useful job which doesn't require particular knowledge of Xapian is
> to check all the filtering tools which omindex can use and discover the
> runes required to get them to produce UTF-8 output (or failing that,
> UTF-16 or UTF-32 but I suspect Unix tools are more likely to produce
> UTF-8 if they do unicode at all).
I've now pretty much done this. The worst gap is that there doesn't
seem to be a PostScript to text convertor which handles anything above
iso-8859-1.
The current state of my reworked code is that everything is being
converted to UTF-8 in omindex. That mostly leaves adjusting the
word tokenisation in line with the UTF-8 QueryParser patch, and
deciding what to do about character sets in scriptindex.
Further work is also still needed to handle wide character HTML files
(such as UTF-16), or indeed HTML in any encoding which doesn't have
ASCII as a subset.
Anyway, I'll create a "unicode" branch in SVN soon so people can try out
the new code.
Cheers,
Olly
More information about the Xapian-discuss
mailing list