[Xapian-discuss] Japanese / UTF-8 support

Sun Aug 27 15:18:44 BST 2006

On Sat, Aug 26, 2006 at 04:09:52PM +0100, Olly Betts wrote:

> > Yeah, as far as I'm aware there's no standard on what you should do if
> > your form enctype (which tends to default to the document charset,
> > which is daft but there you go) can't cope with characters you're
> > submitting. Note that multipart/form-data copes with this properly,
> > because it allows different form fields to have different encodings.
> 
> The problem is that search forms usually want to use METHOD=GET so
> that users can bookmark the results page.

Indeed :-/

> The best approach I've found is to simply ensure that the document with
> the search form in hadsan encoding which can handle all unicode
> characters.  UTF-8 is the best choice since at least unaccented latin
> characters appear in human readable form in the query URL.

UTF-8 is generally the right encoding to use for most general purpose
applications these days. It is its problems, not least the political
ones (largely inherited from Unicode), but there is good support
around and it deals with a lot more of the problems than anything else
I'm aware of.

If you are charset=utf-8 (which you certainly should be for XHTML, and
is a very good idea for HTML 4), then your HTML forms should transmit
all Unicode code points through successfully.

Whether you can conveniently work with them in the backend is another
matter entirely (although Python, Java, C# and C++ all make this
fairly easy; it's possible but sometimes awkward in PHP. Ruby again
it's possible (my understanding is that there are some good libraries,
and detailed Unicode support is now being designed in).

James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org