Pull requests: CJK words and Snippet generator

Tue Oct 4 00:37:49 BST 2016

On Mon, 19 Sep 2016, at 20:27, rsto at paranoia.at wrote:
> Olly, sorry for my delayed reply.
> 
> Am Mo, 12. Sep 2016, um 05:32, schrieb Olly Betts:
> > On Wed, Sep 07, 2016 at 02:30:16PM +0200, rsto at paranoia.at wrote:
> > > On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:
> > > > I think my main concerns are about efficiency [...]
> > > For the proposed term coverage, the implementation looks up and inserts
> > > terms into a map. That makes it slightly less efficient with an overall
> > > complexity of O(n*log n).
> > By "efficiency", I'm meaning in terms of wall-clock time, not the
> > computational complexity of the algorithms.
> > I'm not quite clear what your "n" above is -
> 
> n is the number of terms in a document. I haven't done systematic
> testing of wall-clock time for the new feature. If it is crucial to go
> ahead with the patch, I could create a couple of benchmarks.

Is there a good dataset to run benchmarks against?  We'll be testing this shortly on FastMail, but there will be enough confounding factors that it won't be a realistic benchmark of just the individual changes to Xapian.

> > The tokenisation of the snippet uses the same code as indexing does, so
> > CJK should just work automatically, though it looks like there aren't
> > currently any testcases for this, so it would be worth checking (and
> > worth adding some)
> > 
> > Normalisation could perhaps be done with a custom stemming algorithm.
> > The indexing pipeline doesn't currently have a separate stage for
> > normalisation and for stemming.
> 
> I'll investigate both options with tests and will merge them into
> Xapian's unit tests where it makes sense. I won't be able to come up
> with it until next week, though.
> 
> > The main issue is that new codepoints get added (and the odd one changes
> > category) in each new Unicode version, so if you're using different
> > Unicode versions at index time and at search time, the terms you get
> > won't match each other.  [...] If Xapian's CJK::codepoint_is_cjk() and ICU have different ideas of
> > what's in CJK, the results might be odd, and will likely vary depending
> > on the exact combination of Unicode versions

I guess my question here is - how much churn is there here in reality?  Assuming that existing codepoints never change CJKness and you're always using a newer version of Unicode at search time than at index time, I think this risk goes away, because you never index those codepoints.

Making sure Xapian and ICU agree on what is CJK is necessary of course, but hopefully that could be done in a few hours of machine time just by throwing every possible codepoint at both libraries and asking them :)

Robert is in Australia visiting the FastMail office to co-work with us for a couple of months, and I'd love to get this Xapian integration work done during this time.  We're also looking to release Cyrus IMAPd version 3.0 some time in the next few months, and it would be great to not depend on too many custom patches!  Ideally I'd like to be running vanilla upstream Xapian libraries on FastMail's production rather than keeping a separate branch as well.

Cheers,

Bron.

-- 
  Bron Gondwana
  brong at fastmail.fm