[Xapian-tickets] [Xapian] #180: Add support for CJK text to queryparser and termgenerator

Fri Sep 25 11:01:02 BST 2009

#180: Add support for CJK text to queryparser and termgenerator
-------------------------+--------------------------------------------------
 Reporter:  richard      |        Owner:  richard  
     Type:  enhancement  |       Status:  assigned 
 Priority:  high         |    Milestone:  1.2.0    
Component:  QueryParser  |      Version:  SVN trunk
 Severity:  normal       |   Resolution:           
 Keywords:               |    Blockedby:           
 Platform:  All          |     Blocking:           
-------------------------+--------------------------------------------------

Old description:

> Some code to do this kind of tokenisation is now available at
> http://code.google.com/p/cjk-tokenizer/ which should probably be used as
> the
> basis for supporting this in Xapian.
>
> We could add this as a QueryParser/TermGenerator option without breaking
> API compatibility.  Marking for considering later in 1.1.x, but it could
> probably go in 1.2.x as it's likely to be ABI compatible too.

New description:

 Some code to do this kind of tokenisation is now available at
 http://code.google.com/p/cjk-tokenizer/ which should probably be used as
 the
 basis for supporting this in Xapian.

 We could add this as a !QueryParser/!TermGenerator option without breaking
 API compatibility.  Marking for considering later in 1.1.x, but it could
 probably go in 1.2.x as it's likely to be ABI compatible too.

--

Comment(by olly):

 Thanks for the patch - certainly a step forward.

 There seem to be quite a lot of whitespace changes which make it harder to
 read.  Can you regenerate it adding {{{-bB}}} to the diff options?

 The new header shouldn't be under "include/xapian", since that's for the
 installed public API headers, but that's easy enough to fix.

 It would be better to use Xapian's Unicode and UTF-8 support rather than
 adding a dependency on glib.  Not just because adding avoidable
 dependencies is generally better, but also because there's scope for
 getting confused results if glib and Xapian's routines give different
 answers (as they might legitimately do if they are supporting different
 Unicode versions, or if invalid UTF-8 sequences are encountered).

 I think it's probably better to have the user select "CJKV-mode".
 Exploding every string being indexed into a vector and then scanning it to
 see if CJKV characters are present is going to add a lot of overhead to
 everyone, even those indexing non-CJKV text.  It also seems we don't want
 to completely change how we index (e.g.) English text which a Chinese name
 in.  Alternatively, we could perhaps switch mode within a text string when
 we hit CJKV, and switch back when we hit non-CJKV.

-- 
Ticket URL: <http://trac.xapian.org/ticket/180#comment:8>
Xapian <http://xapian.org/>
Xapian