Indexing Chinese?

Robert Stepanek rsto at fastmailteam.com
Thu Oct 4 08:27:18 BST 2018


We are a using a fork of Xapian for this at the Cyrus IMAP project [1], using the Unicode library word segmentation for Chinese, Japanese and Korean [2]. We are using it at FastMail in production since about 2 years and are generally happy with it, the search results improved over using ngrams. There's a pull request open to merge the patch upstream [3], but it's to be decided how to best add this to Xapian. Currently, the upstream patch doesn't build cleanly on the master branch, but I'll look into making it compile cleanly next week.

Cheers,
Robert

[1] https://github.com/cyrusimap/xapian
[2] http://site.icu-project.org/
[3] https://github.com/xapian/xapian/pull/114

On Thu, Oct 4, 2018, at 05:20, Eric Abrahamsen wrote:
> My second (and hopefully last) question: is there any more news on
> indexing Chinese characters and words? Searching online mostly returns
> results from a decade ago or more, with nothing very conclusive. How
> close is this to possible?
> 
> For the time being I'm doing some pre-processing on long strings of
> Chinese, breaking on punctuation in order to avoid errors. But I have
> some large corpora of Chinese texts that in the future I'd like to index
> properly.
> 
> Thanks,
> Eric
> 
>



More information about the Xapian-discuss mailing list