Indexing Chinese?

Thu Oct 4 16:31:12 BST 2018

That's a coincidence! And very good news. I've subscribed to the PR, and
will look forward to seeing it land!

Thanks a lot,
Eric

On 10/04/18 03:27 AM, Robert Stepanek wrote:
> We are a using a fork of Xapian for this at the Cyrus IMAP project
> [1], using the Unicode library word segmentation for Chinese, Japanese
> and Korean [2]. We are using it at FastMail in production since about
> 2 years and are generally happy with it, the search results improved
> over using ngrams. There's a pull request open to merge the patch
> upstream [3], but it's to be decided how to best add this to Xapian.
> Currently, the upstream patch doesn't build cleanly on the master
> branch, but I'll look into making it compile cleanly next week.
>
> Cheers,
> Robert
>
> [1] https://github.com/cyrusimap/xapian
> [2] http://site.icu-project.org/
> [3] https://github.com/xapian/xapian/pull/114
>
> On Thu, Oct 4, 2018, at 05:20, Eric Abrahamsen wrote:
>> My second (and hopefully last) question: is there any more news on
>> indexing Chinese characters and words? Searching online mostly returns
>> results from a decade ago or more, with nothing very conclusive. How
>> close is this to possible?
>> 
>> For the time being I'm doing some pre-processing on long strings of
>> Chinese, breaking on punctuation in order to avoid errors. But I have
>> some large corpora of Chinese texts that in the future I'd like to index
>> properly.
>> 
>> Thanks,
>> Eric
>> 
>>