[Xapian-tickets] [Xapian] #180: Add support for CJK text to queryparser and termgenerator

Xapian nobody at xapian.org
Fri Jul 22 04:29:37 BST 2011


#180: Add support for CJK text to queryparser and termgenerator
-------------------------+--------------------------------------------------
 Reporter:  richard      |        Owner:  richard  
     Type:  enhancement  |       Status:  assigned 
 Priority:  normal       |    Milestone:  1.2.x    
Component:  QueryParser  |      Version:  SVN trunk
 Severity:  normal       |   Resolution:           
 Keywords:               |    Blockedby:           
 Platform:  All          |     Blocking:           
-------------------------+--------------------------------------------------

Comment(by bschaefer):

 I took a look at the CJK::codepoint_is_cjk() and reduced it to 9 checks
 but also added this 'if (p < 0x2E80) return false'. So now if it runs into
 more common text it will just check once. It is still a constant so I
 don't see this as being an issue when it comes to speed, but I do see why
 this is very important if there is no CJK.

 One question I have is about your last statement. It looks like where ever
 you generate the n-grams it is going to need an OP in-between them. If AND
 is not what you are looking for what else would you suggest putting
 between all the new terms generated? A couple idea's

  * Use the highest precedence OP or make a new CJK OP that also has the
 highest precedence for putting in-between after generating the n-grams. So
 any errors can report with the CJK OP.

  * Surround it with a brace ( BRA, KET ) but then again that is not in the
 query string, and use the default OP. This should provide what ever is
 generated to be stuck together even if a high precedence surrounds it.

  * Example (default OP_OR ): Xapian::Query((hello:(pos=1) AND (万:(pos=2)
 OR 万众:(pos=3) OR 众:(pos=4))))

 The problem I see is what do you put in-between what is generated? I think
 the key point is whatever is generated from the n-gram can't be
 accidentally attached to something else with a higher precedence, which
 brackets should solve that. Also possibly having a cjk flag around to set
 for when an error might occur to refer to that instead of the OP put in-
 between.

 I could also be overlooking something, but I am currently going the code
 to look for other options.

 I'll also start adding some test cases to get a good test coverage to show
 it is working as intended and will continue to. Also will try to break it
 or produce un-intended results for the query parser.

 Everything else you mentioned is changed already and will look out for any
 improper coding style to change.

 Hope this can get fixed soon :)

-- 
Ticket URL: <http://trac.xapian.org/ticket/180#comment:18>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list