Pull requests: CJK words and Snippet generator

Mon Sep 19 10:27:01 BST 2016

Olly, sorry for my delayed reply.

Am Mo, 12. Sep 2016, um 05:32, schrieb Olly Betts:
> On Wed, Sep 07, 2016 at 02:30:16PM +0200, rsto at paranoia.at wrote:
> > On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:
> > > I think my main concerns are about efficiency [...]
> > For the proposed term coverage, the implementation looks up and inserts
> > terms into a map. That makes it slightly less efficient with an overall
> > complexity of O(n*log n).
> By "efficiency", I'm meaning in terms of wall-clock time, not the
> computational complexity of the algorithms.
> I'm not quite clear what your "n" above is -

n is the number of terms in a document. I haven't done systematic
testing of wall-clock time for the new feature. If it is crucial to go
ahead with the patch, I could create a couple of benchmarks.

> The tokenisation of the snippet uses the same code as indexing does, so
> CJK should just work automatically, though it looks like there aren't
> currently any testcases for this, so it would be worth checking (and
> worth adding some)
> 
> Normalisation could perhaps be done with a custom stemming algorithm.
> The indexing pipeline doesn't currently have a separate stage for
> normalisation and for stemming.

I'll investigate both options with tests and will merge them into
Xapian's unit tests where it makes sense. I won't be able to come up
with it until next week, though.

> The main issue is that new codepoints get added (and the odd one changes
> category) in each new Unicode version, so if you're using different
> Unicode versions at index time and at search time, the terms you get
> won't match each other.  [...] If Xapian's CJK::codepoint_is_cjk() and ICU have different ideas of
> what's in CJK, the results might be odd, and will likely vary depending
> on the exact combination of Unicode versions

ICU currently only word-breaks text that `codepoint_is_cjk` before
identified as CJK text, there shouldn't be a gap between search and
indexing. Yet, I understand your concerns about having two Unicode
implementations. Despite our specific experience, migrating Xapian's
Unicode handling to ICU might be a good choice and I could support.
Surely, its modules are far away from what Xapian's UTF8Iterator
currently provides.

Cheers,
Robert