[Xapian-tickets] [Xapian] #180: Add support for CJK text to queryparser and termgenerator
Xapian
nobody at xapian.org
Wed Aug 24 15:39:33 BST 2011
#180: Add support for CJK text to queryparser and termgenerator
-------------------------+--------------------------------------------------
Reporter: richard | Owner: richard
Type: enhancement | Status: assigned
Priority: normal | Milestone: 1.3.0
Component: QueryParser | Version: SVN trunk
Severity: normal | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
-------------------------+--------------------------------------------------
Comment(by olly):
I've applied the work from the git branch to SVN trunk (r16052 and r16053
to fix a memory leak I missed) and backported to 1.2 (r16054 and r16055).
Currently this new code is only enabled if environment variable
XAPIAN_CJK_NGRAM is set to a non-empty value. Ubuntu were keen to get
this support for their upcoming release, and this way we can provide a
compatible 1.2.x package for other applications, yet easily allow the
applications which want it to have the CJK support. We'll want to sort
out an API for enabling this (and any additional CJK modes which are added
in the future) of course.
The substantial changes from the patch here are:
* Fixed dereferencing of an end Utf8Iterator (probably harmless in
practice)
* Remove unnecessary extra check for a character being CJK
* Only assign term positions to 1-grams (gives natural positions and
reduces database size)
* Quoted CJK phrases work
* No intermediate vector of n-grams is built - saves space and is faster
(~11% saving in total indexing time in a quick test)
--
Ticket URL: <http://trac.xapian.org/ticket/180#comment:29>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list