Pull requests: CJK words and Snippet generator

rsto at paranoia.at rsto at paranoia.at
Tue Jul 26 14:06:07 BST 2016


Hi,

The Cyrus IMAP mail server uses Xapian as search engine. Recently,
FastMail has sponsored implementation of two Xapian features: CJK word
splitting and a generator for search snippets. I've been working on both
features and we would be happy to get them merged into Xapian master.

The CJK word tokenizer uses the word segmentation algorithms of the
International Components for Unicode library (ICU), which brings support
for Japanese, Korean and Thai, among others. The feature co-exists with
n-grams (which remain the default for CJK text) and the code is
unit-tested [1]. In the feature branch, libicu is mandatory to build but
that'd be easy to make optional.

The search snippet generator has been an independent effort to Xapian's
MSet::snippet generator. It orders snippets within a document by their
relevance to the search terms, supports CJK and handles punctuation. The
unit tests in the commit [2] outline its main capabilities.

Would you be interested in these features? Just let us know what would
be required to get them merged. As a minimum I'd rebase the current
forks against latest master. I'll be happy to answer any questions or
change requests.

Cheers,
Robert

[1] CJK word splitter:  
https://github.com/rsto/xapian/commit/16dd9b232eb9b6e7346184db0790b6655180492c
[2] Snippet generator:
https://github.com/rsto/xapian/commit/979757c161ec912c98f2fe87595d7529740e3247#diff-832f4feb83e5ba60ebb64b4d8b93d93fR1



More information about the Xapian-devel mailing list