[Xapian-discuss] Indexing Chinese

Wed Jun 28 03:54:01 BST 2006

On 6/27/06, Olly Betts <olly at survex.com> wrote:
> On Tue, Jun 27, 2006 at 01:17:46PM -0400, Alex Deucher wrote:
> > Has anyone ever indexed documents of Chinese characters?  What's the
> > best way to break down the text for indexing.  I know context is
> > important.
>
> I understand it's possible to algorithmically split a string of Chinese
> characters into words to some extent, but that it's a bit complex and
> error prone.
>
> > My current plan is to index each character and then do
> > phrase queries on combinations of characters.  Is there a better
> > approach?
>
> That could be slow, though it depends on the data.  I don't know
> enough about Chinese to say if it's likely to be OK or not.
>

heh... neither do I.  I suppose I should enquire with someone who
knows more about Chinese.

> You could try using an n-gram approach - just index adjacent pairs (or
> triples, etc) of characters as terms, and perform the same process on
> the query.  Essentially the same idea as yours really, except indexing
> the combinations of characters as terms.

Not a bad idea.

Thanks,

Alex

>
> Cheers,
>     Olly
>