[Xapian-tickets] [Xapian] #180: Add support for CJK text to queryparser and termgenerator

Sun Oct 11 20:48:15 BST 2009

#180: Add support for CJK text to queryparser and termgenerator
-------------------------+--------------------------------------------------
 Reporter:  richard      |        Owner:  richard  
     Type:  enhancement  |       Status:  assigned 
 Priority:  high         |    Milestone:  1.2.0    
Component:  QueryParser  |      Version:  SVN trunk
 Severity:  normal       |   Resolution:           
 Keywords:               |    Blockedby:           
 Platform:  All          |     Blocking:           
-------------------------+--------------------------------------------------

Comment(by olly):

 Sorry for not responding sooner, I'm insanely busy this month.

 > 1. Where i should put cjkv headers/sources files?

 I'd suggest sticking the cjkv support in its own "cjkv" subdirectory,
 since it's essentially its own subsystem.  Certainly "include" is only for
 headers visible to the end user, so that's not suitable.

 > 2. Yes, glib2 dependency not good because Xapian already has
 Unicode/UTF-8 API. I agree, but i have no time while to completely rework
 cjkv code and because i've integrate Dijon's code "as is". One thing -
 Dijon/glib2 code will be used only if document has CJKV sequences, i.e.
 99% backward compatible for non-CJKV documents :).

 I think this needs to be done before we can put this patch in a release,
 though I can probably sort it out when I'm less busy.

 > 3. How and where user should select CJKV-mode? What if user just have a
 big folder with many files which updates every day and every day this big
 folder is indexing. Or another example - international forums. There is no
 way to say "index this file/topic with CJKV-mode". We can try to optimize
 scanning and detecting CJKV sequence process.

 In many cases the user knows they are handling particular languages, and
 then checking for CJKV is a waste of time.  Conversely, you may '''only'''
 be handling CJKV, in which case checking is also pointless.

 But in the "might be CJKV or might not" case, we certainly could be more
 efficient than converting the whole string to a vector and then scanning
 that.  {{{Xapian::Utf8Iterator}}} would be a better approach.

 > 4. About your alternatively. Its already done in patch (if i'm right
 understand you). If indexable string doesn't have CJKV - will be used old
 algorithm.

 I'm thinking of the case of a mixed document (a document without any CJKV
 characters is obviously easy to deal with, and similarly a document which
 is only CJKV is easy too).

 I'm suggesting (perhaps) that if a document is in (say) English with
 quoted Chinese text, the English parts will be indexed as they currently
 are while the Chinese parts would be indexed with the CJKV rules, with the
 tokenizer switching between CJKV-mode and non-CJKV-mode as it goes.  That
 avoids the need to decide whether such documents are "CJKV" or "non-CJKV",
 so there's no need to pre-scan them prior to actually indexing them.

 > Saying simple - "No CJKV - patch will not be used and all staying as is.
 If there CJKV - we will use modified queryparser/termgenerator code".

 I think Xapian should have some sort of CJKV support, and this patch is a
 good start, but I do think it needs further work.

 There's also the issue of the licence.  Xapian is currently GPL, but we'd
 like to get to a position where we can relicense in the future.  LGPL is a
 possible choice for the new licence, though we might want to go to a more
 liberal licence than that.  I suspect this isn't a blocker, but we'd need
 to check with Fabrice.

-- 
Ticket URL: <http://trac.xapian.org/ticket/180#comment:10>
Xapian <http://xapian.org/>
Xapian