[Xapian-discuss] Indexing Chinese
Alex Deucher
alexdeucher at gmail.com
Wed Jun 28 03:54:01 BST 2006
On 6/27/06, Olly Betts <olly at survex.com> wrote:
> On Tue, Jun 27, 2006 at 01:17:46PM -0400, Alex Deucher wrote:
> > Has anyone ever indexed documents of Chinese characters? What's the
> > best way to break down the text for indexing. I know context is
> > important.
>
> I understand it's possible to algorithmically split a string of Chinese
> characters into words to some extent, but that it's a bit complex and
> error prone.
>
> > My current plan is to index each character and then do
> > phrase queries on combinations of characters. Is there a better
> > approach?
>
> That could be slow, though it depends on the data. I don't know
> enough about Chinese to say if it's likely to be OK or not.
>
heh... neither do I. I suppose I should enquire with someone who
knows more about Chinese.
> You could try using an n-gram approach - just index adjacent pairs (or
> triples, etc) of characters as terms, and perform the same process on
> the query. Essentially the same idea as yours really, except indexing
> the combinations of characters as terms.
Not a bad idea.
Thanks,
Alex
>
> Cheers,
> Olly
>
More information about the Xapian-discuss
mailing list