[Xapian-discuss] Indexing Chinese

Wed Jun 28 03:40:08 BST 2006

On Tue, Jun 27, 2006 at 01:17:46PM -0400, Alex Deucher wrote:
> Has anyone ever indexed documents of Chinese characters?  What's the
> best way to break down the text for indexing.  I know context is
> important.

I understand it's possible to algorithmically split a string of Chinese
characters into words to some extent, but that it's a bit complex and
error prone.

> My current plan is to index each character and then do
> phrase queries on combinations of characters.  Is there a better
> approach?

That could be slow, though it depends on the data.  I don't know
enough about Chinese to say if it's likely to be OK or not.

You could try using an n-gram approach - just index adjacent pairs (or
triples, etc) of characters as terms, and perform the same process on
the query.  Essentially the same idea as yours really, except indexing
the combinations of characters as terms.

Cheers,
    Olly