Pull requests: CJK words and Snippet generator

rsto at paranoia.at rsto at paranoia.at
Wed Sep 7 13:30:16 BST 2016


On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:
> I think my main concerns are about efficiency (since that a major
> motivation for the current implementation, so slowing it down would be
> annoying), and whether we can just make this the standard behaviour
> rather than adding an option.

The current implementation is O(n) and I took care to keep it at that.
For the proposed term coverage, the implementation looks up and inserts
terms into a map. That makes it slightly less efficient with an overall
complexity of O(n*log n). I could change this to use an unordered_map
(which is on average constant), but this this could degenerate to O(n^2)
in worst case. If n*log n is acceptable, one could keep the current
linear heuristic as default and let users choose the slightly less
performant snippet generator with a flag?

> What are the other features the fastmail snippet generator has which
> the current one lacks?  I did study the fastmail one, but that was some
> time ago and I don't remember clearly.

Off the top of my head: normalization of terms and CJK support. With
normalization I mean that the API allows to inject a custom preprocessor
for document and search terms before they are matched (that's mainly
useful due to a quirk in Cyrus search). To be honest, I am not sure if
these features even need to be migrated. I'll run a couple of tests if
the current Xapian snippet generator covers them already.

> For the CJK segmentation, the ICU dependency makes things more complex,
> so I suspect that'll take longer to sort out.  For example, Xapian
> currently has its own Unicode support, but that presumably means we
> could end up using two different versions of Unicode, so perhaps we
> ought to use ICU for everything if we're using it at all.

At the moment, the pull request only builds on ICU's word segmentation
and keeps using Xapian for character set conversions so I don't see much
risk of conflicting implementations. In a similar scenario, I recently
replaced Cyrus' custom charset support with ICU but we noticed
performance degradation for our specific use cases. We ended up porting
back the fast, custom-built codepaths for UTF-8 and fall back to ICU for
other charsets. That's not to say that ICU isn't a viable choice, but
it'd require a thorough assessment.

Cheers,
Robert






More information about the Xapian-devel mailing list