[Xapian-tickets] [Xapian] #180: Add support for CJK text to queryparser and termgenerator
Xapian
nobody at xapian.org
Fri Jul 22 04:29:37 BST 2011
#180: Add support for CJK text to queryparser and termgenerator
-------------------------+--------------------------------------------------
Reporter: richard | Owner: richard
Type: enhancement | Status: assigned
Priority: normal | Milestone: 1.2.x
Component: QueryParser | Version: SVN trunk
Severity: normal | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
-------------------------+--------------------------------------------------
Comment(by bschaefer):
I took a look at the CJK::codepoint_is_cjk() and reduced it to 9 checks
but also added this 'if (p < 0x2E80) return false'. So now if it runs into
more common text it will just check once. It is still a constant so I
don't see this as being an issue when it comes to speed, but I do see why
this is very important if there is no CJK.
One question I have is about your last statement. It looks like where ever
you generate the n-grams it is going to need an OP in-between them. If AND
is not what you are looking for what else would you suggest putting
between all the new terms generated? A couple idea's
* Use the highest precedence OP or make a new CJK OP that also has the
highest precedence for putting in-between after generating the n-grams. So
any errors can report with the CJK OP.
* Surround it with a brace ( BRA, KET ) but then again that is not in the
query string, and use the default OP. This should provide what ever is
generated to be stuck together even if a high precedence surrounds it.
* Example (default OP_OR ): Xapian::Query((hello:(pos=1) AND (万:(pos=2)
OR 万众:(pos=3) OR 众:(pos=4))))
The problem I see is what do you put in-between what is generated? I think
the key point is whatever is generated from the n-gram can't be
accidentally attached to something else with a higher precedence, which
brackets should solve that. Also possibly having a cjk flag around to set
for when an error might occur to refer to that instead of the OP put in-
between.
I could also be overlooking something, but I am currently going the code
to look for other options.
I'll also start adding some test cases to get a good test coverage to show
it is working as intended and will continue to. Also will try to break it
or produce un-intended results for the query parser.
Everything else you mentioned is changed already and will look out for any
improper coding style to change.
Hope this can get fixed soon :)
--
Ticket URL: <http://trac.xapian.org/ticket/180#comment:18>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list