Pull requests: CJK words and Snippet generator

Olly Betts olly at survex.com
Tue Sep 6 08:16:15 BST 2016


On Thu, Aug 18, 2016 at 03:31:46PM +0200, rsto at paranoia.at wrote:
> On Thu, Aug 11, 2016, at 13:19, rsto at paranoia.at wrote:
> > The CJK word segmentation and snippet pull requests both pass Travis
> > since middle/end of last week. Did you find time to look at them?
> 
> just checking in if you found time to look at the PRs?

I've not managed to yet find time to do more than a quick skim through
them - I'm still a bit back-logged from my post-1.4.0-release break.

But GSoC has now wrapped up, so hopefully I can finish catching up soon.

> It'd be nice to know a tentative timeline, so I can plan if to build
> next features on top of our local fork or the upstream PRs.

The snippet patch seems like something we should be able to get merged
and backported for 1.4.x fairly efficiently.  We're overdue 1.4.1 so
that's not a realistic target, but .2 or .3 might be.

I think my main concerns are about efficiency (since that a major
motivation for the current implementation, so slowing it down would be
annoying), and whether we can just make this the standard behaviour
rather than adding an option.  It's not a deliberate choice that it
doesn't do this already - when I was implementing it I actually wanted
to favour snippets containing more different terms over ones with
repetitions of the same term, but I failed to come up with an efficient
way to do that.  Your approach looks very promising from the quick look
I took.

What are the other features the fastmail snippet generator has which
the current one lacks?  I did study the fastmail one, but that was some
time ago and I don't remember clearly.

For the CJK segmentation, the ICU dependency makes things more complex,
so I suspect that'll take longer to sort out.  For example, Xapian
currently has its own Unicode support, but that presumably means we
could end up using two different versions of Unicode, so perhaps we
ought to use ICU for everything if we're using it at all.

Cheers,
    Olly



More information about the Xapian-devel mailing list