[Xapian-tickets] [Xapian] #180: Add support for CJK text to queryparser and termgenerator

Sat Jul 23 15:25:20 BST 2011

#180: Add support for CJK text to queryparser and termgenerator
-------------------------+--------------------------------------------------
 Reporter:  richard      |        Owner:  richard  
     Type:  enhancement  |       Status:  assigned 
 Priority:  normal       |    Milestone:  1.2.x    
Component:  QueryParser  |      Version:  SVN trunk
 Severity:  normal       |   Resolution:           
 Keywords:               |    Blockedby:           
 Platform:  All          |     Blocking:           
-------------------------+--------------------------------------------------

Comment(by olly):

 The codepoint_is_cjk() overhead is constant, but that constant is '''per
 character processed''', so "constant" is misleading - it's really O(n),
 where n is the size of text processed.  And we often process a lot of
 characters.  But indeed the most major concern right now is not to
 introduce a speed regression for everyone for this, and the early exit if
 < 0x2e80 pretty much guarantees that.

 I already gave my suggested approach to the !QueryParser tokenisation
 above, i.e.:

 > I'd suggest generating a CJKTERM token (or perhaps a TERM token with a
 "cjk" flag set) and then generating the n-grams when this token gets
 converted to a Query object.

 In other words, don't generate the n-grams in the lexer, do it when the
 parser accepts this CJKTERM (or TERM with is_cjk set) and then just
 generate a Query object directly from the generated set on n-grams.

 I think any approach which involves faking a token stream is potentially
 problematic.

 A "CJK" operator might work, but just generating a single token is
 simpler.  Why generate a special token stream in the lexer only to have to
 recognise it in the parser?

-- 
Ticket URL: <http://trac.xapian.org/ticket/180#comment:19>
Xapian <http://xapian.org/>
Xapian