[Xapian-tickets] [Xapian] #180: Add support for CJK text to queryparser and termgenerator
Xapian
nobody at xapian.org
Tue Jul 19 16:31:31 BST 2011
#180: Add support for CJK text to queryparser and termgenerator
-------------------------+--------------------------------------------------
Reporter: richard | Owner: richard
Type: enhancement | Status: assigned
Priority: normal | Milestone: 1.2.x
Component: QueryParser | Version: SVN trunk
Severity: normal | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
-------------------------+--------------------------------------------------
Comment(by olly):
Thanks for your work on this - the latest patch is a big step forwards.
A couple of critical issues though:
* This really needs test coverage to make sure it is actually working as
intended, and continues to do so in the future.
* The new code and changes to existing code don't match the coding style
used in the Xapian codebase.
And some other thoughts, in no particular order:
* queryparser/cjk/cjk-tokenizer.h isn't listed in
queryparser/Makefile.mk, so won't get shipped. The cjk-LICENSE file won't
get shipped either, but I think it would be better to insert that text at
the start of each of the files it applies to, so it's completely clear
what the licence of each file is.
* I've not done any benchmarking, but I'm a little worried at the cost of
CJK::codepoint_is_cjk() since it does a lot of comparisons and it looks
like it gets called for every character. However, I notice that many of
the ranges checked are actually contiguous and so could be merged into a
smaller number of larger ranges - the compiler may do this for us, but it
would be better not to rely on that. Also the last range is there twice!
That reduces 23 ranges to 9 - if this still proves to be an issue, we can
probably add an extra flag bit to the Unicode tables.
* unicode_char_t - types ending _t are reserved by POSIX, so we shouldn't
go creating new ones. Elsewhere we just use unsigned for this, so I'd
suggest just doing the same.
* ~tokenizer() is just the default empty destructor, so there's no
benefit to explicitly defining it.
* Generating the token stream TERM AND TERM ... AND TERM seems
problematic - for example, there are higher precedence operators than AND,
so there are probably cases which won't be parsed as you intend, and parse
errors might end up confusingly referring to an AND when none appears in
the query string. I'd suggest generating a CJKTERM token (or perhaps a
TERM token with a "cjk" flag set) and then generating the n-grams when
this token gets converted to a Query object.
--
Ticket URL: <http://trac.xapian.org/ticket/180#comment:17>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list