How to set environment variable XAPIAN_CJK_NGRAM?

Robert Stepanek rsto at fastmailteam.com
Tue Feb 13 08:26:00 GMT 2018


On Tue, Feb 13, 2018, at 02:32, Peter Zhao wrote:
> At 2018-02-12 20:00:02, xapian-discuss-request at lists.xapian.org wrote:
> >There's also a patch to add support for using libicu to find word
> >boundaries:
> >
> >https://github.com/xapian/xapian/pull/114
> >
> >That'll get merged soon hopefully (mostly we need to sort out how to
> >manage the libicu dependency - do we make it a hard dependency, or an
> >option for how to build xapian-core, etc) but if you're happy to build
> >xapian-core from source please try it and give feedback on how well
> >it works.

We are running the CJK word boundary segmentation patch at FastMail since over a year in production and are happy with it. That being said, I just realised that the PR does not cleanly merge with the latest Xapian upstream branch. I'll fix the merge conflicts and push an update to the pull request tomorrow.

BTW: For a quick glance at how ICU segments arbitrary CJK text, I wrote a small wrapper around  libicu and expose it as a web tool: https://cjkwords.com/

Cheers,
Robert




More information about the Xapian-discuss mailing list