How to let Xapian support Chinese searching

Olly Betts olly at survex.com
Sun Feb 11 20:34:44 GMT 2018


On Sat, Feb 10, 2018 at 08:26:52PM +0800, Peter Zhao wrote:
> I installed  Eprints, but it can not search Chinese. EPRINTS use
> Xapian to index data, how to let Xapian support CHINESE searching?

Current releases support indexing ngrams for CJK text - to enable this
you need to pass FLAG_CJK_NGRAM to TermGenerator when indexing and to
QueryParser when searching.

You can also activate this flag without code changes by setting
environment variable XAPIAN_CJK_NGRAM to a non-empty value (don't forget
to export it if you're setting it via the shell).

There's also a patch to add support for using libicu to find word
boundaries:

https://github.com/xapian/xapian/pull/114

That'll get merged soon hopefully (mostly we need to sort out how to
manage the libicu dependency - do we make it a hard dependency, or an
option for how to build xapian-core, etc) but if you're happy to build
xapian-core from source please try it and give feedback on how well
it works.

An algorithm to identify word boundaries should result in a
significantly smaller database than indexing ngrams, but it's reliant on
the algorithm finding the correct boundaries.  If the wrong boundaries
are identified that can lead to both false positives and false
negatives.

Cheers,
    Olly



More information about the Xapian-discuss mailing list