[Xapian-tickets] [Xapian] #699: Better tokenisation of mixed CJK numbers
Xapian
nobody at xapian.org
Sat Dec 12 04:18:13 GMT 2015
#699: Better tokenisation of mixed CJK numbers
--------------------------------+------------------------
Reporter: olly | Owner: olly
Type: defect | Status: new
Priority: normal | Milestone:
Component: QueryParser | Version: git master
Severity: normal | Keywords:
Blocked By: | Blocking:
Operating System: All |
--------------------------------+------------------------
From comment:28:ticket:180:
Dai Youli noted on IRC that mixed numbers like 2千3百 (two thousand three
hundred) get indexed as four separate terms - while that's not terrible
(since the same does at least happen at search time), it's not ideal
either - searching for 2千3百 would find 3千2百, as well as documents
containing those characters nowhere near each other.
Perhaps digits among CJK characters should be included in the span of text
to be passed for n-gramming though.
--
Ticket URL: <http://trac.xapian.org/ticket/699>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list