Pull requests: CJK words and Snippet generator

Mon Sep 12 04:32:48 BST 2016

On Wed, Sep 07, 2016 at 02:30:16PM +0200, rsto at paranoia.at wrote:
> On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:
> > I think my main concerns are about efficiency (since that a major
> > motivation for the current implementation, so slowing it down would be
> > annoying), and whether we can just make this the standard behaviour
> > rather than adding an option.
> 
> The current implementation is O(n) and I took care to keep it at that.
> For the proposed term coverage, the implementation looks up and inserts
> terms into a map. That makes it slightly less efficient with an overall
> complexity of O(n*log n). I could change this to use an unordered_map
> (which is on average constant), but this this could degenerate to O(n^2)
> in worst case. If n*log n is acceptable, one could keep the current
> linear heuristic as default and let users choose the slightly less
> performant snippet generator with a flag?

By "efficiency", I'm meaning in terms of wall-clock time, not the
computational complexity of the algorithms.

I'm not quite clear what your "n" above is - complexity can be a
concern, but some things aren't really able to grow without bound (e.g.
number of terms in the query) and so performance for smaller n can
actually be what matters, and O() deliberately only considers the limit
as n grows ever larger.

> > What are the other features the fastmail snippet generator has which
> > the current one lacks?  I did study the fastmail one, but that was some
> > time ago and I don't remember clearly.
> 
> Off the top of my head: normalization of terms and CJK support. With
> normalization I mean that the API allows to inject a custom preprocessor
> for document and search terms before they are matched (that's mainly
> useful due to a quirk in Cyrus search). To be honest, I am not sure if
> these features even need to be migrated. I'll run a couple of tests if
> the current Xapian snippet generator covers them already.

The tokenisation of the snippet uses the same code as indexing does, so
CJK should just work automatically, though it looks like there aren't
currently any testcases for this, so it would be worth checking (and
worth adding some).

Normalisation could perhaps be done with a custom stemming algorithm.
The indexing pipeline doesn't currently have a separate stage for
normalisation and for stemming.

> > For the CJK segmentation, the ICU dependency makes things more complex,
> > so I suspect that'll take longer to sort out.  For example, Xapian
> > currently has its own Unicode support, but that presumably means we
> > could end up using two different versions of Unicode, so perhaps we
> > ought to use ICU for everything if we're using it at all.
> 
> At the moment, the pull request only builds on ICU's word segmentation
> and keeps using Xapian for character set conversions so I don't see much
> risk of conflicting implementations.

Xapian-core doesn't actually convert character sets - text is assumed to
be UTF-8 (when it's not handled opaquely).

The main issue is that new codepoints get added (and the odd one changes
category) in each new Unicode version, so if you're using different
Unicode versions at index time and at search time, the terms you get
won't match each other.  We've so far addressed that by sticking to the
same Unicode version within each release branch, but once you add ICU to
the mix there's potentially a second version in play, and with a
different update schedule.

If Xapian's CJK::codepoint_is_cjk() and ICU have different ideas of
what's in CJK, the results might be odd, and will likely vary depending
on the exact combination of Unicode versions.

> In a similar scenario, I recently
> replaced Cyrus' custom charset support with ICU but we noticed
> performance degradation for our specific use cases. We ended up porting
> back the fast, custom-built codepaths for UTF-8 and fall back to ICU for
> other charsets. That's not to say that ICU isn't a viable choice, but
> it'd require a thorough assessment.

Hmm, not the most promising data point.  Anyway, will try to get the
snippet patch merged first.

Cheers,
    Olly