[Xapian-discuss] Japanese / UTF-8 support

Thu Aug 10 10:43:02 BST 2006

On Wed, Aug 09, 2006 at 11:43:34PM -0700, Jeff Breidenbach wrote:

> I tried running omindex on the following file, which is a
> UTF-8 web page with mixed English and Japanese text.

[...]

> Any comments? I was really surprised, since Omega did so well
> in an earlier test against a similar UTF-8 document written in Danish.
> Is this a matter of polish or are there deeper barriers, like a lack of
> word splitting capability for languages like Chinese/Japanese/Korean?

omindex (and the QueryParser) has somewhat primitive,
European-centric, word splitting. The tricky bit is actually for the
query parser ... you could either make it so you have to specify the
language you're searching in, and set splitting and stemming
appropriately (or auto-detect the language), or parse it all possible
ways (based on which languages exist in your database) and merge the
results somehow.

Ultimately it would be nice to support this kind of thing. The first
step is UTF-8 support, which Olly has been working on. On top of that
we'd need word splitting algorithm for CJK (and anything else that we
can't throw English-like rules at). My understanding is that there
isn't a good stemming strategy for CJK, so we'd just disable it there.

Lots of work to make this sort of thing work automatically. If anyone
knows about word splitting for CJK, that'd be a huge help ...

James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org