[Xapian-discuss] Japanese / UTF-8 support
Jeff Breidenbach
breidenbach at gmail.com
Thu Aug 10 07:43:34 BST 2006
I tried running omindex on the following file, which is a
UTF-8 web page with mixed English and Japanese text.
http://www.mail-archive.com/axis-user-ja@ws.apache.org/msg00058.html
An English query with Omega mostly worked. The only problem was
the summary results were displayed as gibberish - looked like UTF-8
data against a Latin-1 character set. I suspect this issue is easily fixed
by tacking on a UTF-8 META tag in the search interface.
More seriously, Japanese searches didn't seem to work at all. Cutting
and pasting a few words into the browser yielded no results. Additionally,
the UTF-8 quere was escaped into character entity referencess; e.g.
a query for 皆様 got me a blank result page with the query listed as
皆様
Any comments? I was really surprised, since Omega did so well
in an earlier test against a similar UTF-8 document written in Danish.
Is this a matter of polish or are there deeper barriers, like a lack of
word splitting capability for languages like Chinese/Japanese/Korean?
More information about the Xapian-discuss
mailing list