[Xapian-discuss] Re: Japanese / UTF-8 support

Olly Betts olly at survex.com
Sun Aug 27 01:29:01 BST 2006


On Sat, Aug 12, 2006 at 09:34:50PM -0700, Jeff Breidenbach wrote:
>  * The patch is still too crude to submit, but I'v beaten htmlparse.cc
>   into respecting <!--htdig_noindex--><!--/htdig_noindex-->

Oops, I've already committed a patch for this.

> * Getting filesize and last modification date in summary results is
>    nice to have, but not critical. Putting on backburner.

These were trivial to add, so I've just done them.

>  * How can I best help with CJK ? The more concrete the suggestion,
>     the better.

One useful job which doesn't require particular knowledge of Xapian is
to check all the filtering tools which omindex can use and discover the
runes required to get them to produce UTF-8 output (or failing that,
UTF-16 or UTF-32 but I suspect Unix tools are more likely to produce
UTF-8 if they do unicode at all).

If any can't, seek out alternative tools which can and check if they do
as good a job of dumping text for indexing.  Failing that, we can still
support formats where the convertors only support iso-8859-1 by just
converting the output (perhaps some formats don't support unicode
anyway).

If any other tasks come to mind, I'll let the list know.

Cheers,
    Olly



More information about the Xapian-discuss mailing list