[Xapian-tickets] [Xapian] #180: Add support for CJK text to queryparser and termgenerator

Tue Jul 19 16:31:31 BST 2011

#180: Add support for CJK text to queryparser and termgenerator
-------------------------+--------------------------------------------------
 Reporter:  richard      |        Owner:  richard  
     Type:  enhancement  |       Status:  assigned 
 Priority:  normal       |    Milestone:  1.2.x    
Component:  QueryParser  |      Version:  SVN trunk
 Severity:  normal       |   Resolution:           
 Keywords:               |    Blockedby:           
 Platform:  All          |     Blocking:           
-------------------------+--------------------------------------------------

Comment(by olly):

 Thanks for your work on this - the latest patch is a big step forwards.

 A couple of critical issues though:

  * This really needs test coverage to make sure it is actually working as
 intended, and continues to do so in the future.

  * The new code and changes to existing code don't match the coding style
 used in the Xapian codebase.

 And some other thoughts, in no particular order:

  * queryparser/cjk/cjk-tokenizer.h isn't listed in
 queryparser/Makefile.mk, so won't get shipped.  The cjk-LICENSE file won't
 get shipped either, but I think it would be better to insert that text at
 the start of each of the files it applies to, so it's completely clear
 what the licence of each file is.

  * I've not done any benchmarking, but I'm a little worried at the cost of
 CJK::codepoint_is_cjk() since it does a lot of comparisons and it looks
 like it gets called for every character.  However, I notice that many of
 the ranges checked are actually contiguous and so could be merged into a
 smaller number of larger ranges - the compiler may do this for us, but it
 would be better not to rely on that.  Also the last range is there twice!
 That reduces 23 ranges to 9 - if this still proves to be an issue, we can
 probably add an extra flag bit to the Unicode tables.

  * unicode_char_t - types ending _t are reserved by POSIX, so we shouldn't
 go creating new ones.  Elsewhere we just use unsigned for this, so I'd
 suggest just doing the same.

  * ~tokenizer() is just the default empty destructor, so there's no
 benefit to explicitly defining it.

  * Generating the token stream TERM AND TERM ... AND TERM seems
 problematic - for example, there are higher precedence operators than AND,
 so there are probably cases which won't be parsed as you intend, and parse
 errors might end up confusingly referring to an AND when none appears in
 the query string.  I'd suggest generating a CJKTERM token (or perhaps a
 TERM token with a "cjk" flag set) and then generating the n-grams when
 this token gets converted to a Query object.

-- 
Ticket URL: <http://trac.xapian.org/ticket/180#comment:17>
Xapian <http://xapian.org/>
Xapian