Pull requests: CJK words and Snippet generator

James Aylett james-xapian at tartarus.org
Wed Jul 27 23:22:55 BST 2016


On Tue, Jul 26, 2016 at 03:06:07PM +0200, rsto at paranoia.at wrote:

> The Cyrus IMAP mail server uses Xapian as search engine. Recently,
> FastMail has sponsored implementation of two Xapian features: CJK word
> splitting and a generator for search snippets. I've been working on both
> features and we would be happy to get them merged into Xapian master.
> 
> Would you be interested in these features? Just let us know what would
> be required to get them merged. As a minimum I'd rebase the current
> forks against latest master. I'll be happy to answer any questions or
> change requests.

This sounds great! I know sufficiently little about CJK that I won't
try to comment on that at all :)

I think I'm right in saying that your snippet generator:

a. needs driving separately (so it's not integrated in the way
Xapian::MSet::snippet() is; is the intention that it replaced the
current snippet system as something more sophisticated?

I wonder if we can arrange suitable defaults to use your
implementation with the older API, and come up with a newer API that
allows a SnippetGenerator class to be used from the MSet.

(That might allow us to refactor the existing implementation and
provide both, if they have different strengths. I can't remember much
detail of the current one, offhand.)

b. only works with UTF8 (I assume that the pre_match & post_match
strings, and inter_snippet, should also be in UTF8?)

This probably just needs noting in the docstrings.

A good start would certainly be rebasing against master and opening a
pull request for each on github (this will trigger travis CI builds,
which is a helpful first pass in making sure everything good; it runs
against both G++ and Clang, which can expose some weirdnesses).

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org



More information about the Xapian-devel mailing list