[Xapian-tickets] [Xapian] #180: Add support for CJK text to queryparser and termgenerator
Xapian
nobody at xapian.org
Sat Jul 23 15:25:20 BST 2011
#180: Add support for CJK text to queryparser and termgenerator
-------------------------+--------------------------------------------------
Reporter: richard | Owner: richard
Type: enhancement | Status: assigned
Priority: normal | Milestone: 1.2.x
Component: QueryParser | Version: SVN trunk
Severity: normal | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
-------------------------+--------------------------------------------------
Comment(by olly):
The codepoint_is_cjk() overhead is constant, but that constant is '''per
character processed''', so "constant" is misleading - it's really O(n),
where n is the size of text processed. And we often process a lot of
characters. But indeed the most major concern right now is not to
introduce a speed regression for everyone for this, and the early exit if
< 0x2e80 pretty much guarantees that.
I already gave my suggested approach to the !QueryParser tokenisation
above, i.e.:
> I'd suggest generating a CJKTERM token (or perhaps a TERM token with a
"cjk" flag set) and then generating the n-grams when this token gets
converted to a Query object.
In other words, don't generate the n-grams in the lexer, do it when the
parser accepts this CJKTERM (or TERM with is_cjk set) and then just
generate a Query object directly from the generated set on n-grams.
I think any approach which involves faking a token stream is potentially
problematic.
A "CJK" operator might work, but just generating a single token is
simpler. Why generate a special token stream in the lexer only to have to
recognise it in the parser?
--
Ticket URL: <http://trac.xapian.org/ticket/180#comment:19>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list