[Xapian-tickets] [Xapian] #180: Add support for CJK text to queryparser and termgenerator

Xapian nobody at xapian.org
Wed Aug 24 15:39:33 BST 2011


#180: Add support for CJK text to queryparser and termgenerator
-------------------------+--------------------------------------------------
 Reporter:  richard      |        Owner:  richard  
     Type:  enhancement  |       Status:  assigned 
 Priority:  normal       |    Milestone:  1.3.0    
Component:  QueryParser  |      Version:  SVN trunk
 Severity:  normal       |   Resolution:           
 Keywords:               |    Blockedby:           
 Platform:  All          |     Blocking:           
-------------------------+--------------------------------------------------

Comment(by olly):

 I've applied the work from the git branch to SVN trunk (r16052 and r16053
 to fix a memory leak I missed) and backported to 1.2 (r16054 and r16055).
 Currently this new code is only enabled if environment variable
 XAPIAN_CJK_NGRAM is set to a non-empty value.  Ubuntu were keen to get
 this support for their upcoming release, and this way we can provide a
 compatible 1.2.x package for other applications, yet easily allow the
 applications which want it to have the CJK support.  We'll want to sort
 out an API for enabling this (and any additional CJK modes which are added
 in the future) of course.

 The substantial changes from the patch here are:

  * Fixed dereferencing of an end Utf8Iterator (probably harmless in
 practice)
  * Remove unnecessary extra check for a character being CJK
  * Only assign term positions to 1-grams (gives natural positions and
 reduces database size)
  * Quoted CJK phrases work
  * No intermediate vector of n-grams is built - saves space and is faster
 (~11% saving in total indexing time in a quick test)

-- 
Ticket URL: <http://trac.xapian.org/ticket/180#comment:29>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list