Ask for advice on exact requirements to fix #699 mixed CJK numbers

Fri Mar 8 13:29:14 GMT 2019

Sorry for my verbose text in last email...

I have created a PR to the master.
The code partially fixes the problem mentioned in #699,
it supports mixed Chinese numbers sent to CJKNgramIterator,
for example, these test cases would pass:
> { "", "有2千3百", "2千3百:1 有[1]"},
> { "", "我有2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},

But it won't deal with mixed numbers whose previous character is not a
CJK character, as the first digit would be eaten by the TermGenerator.
for example, below cases would fail:
> { "", "我有 2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},
current output is "2[3] 3百:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 钱[6]"
> { "", "there are 3万4千零1 apples", "3万4千零1:1 apples[4] are[2] there[1] "}
current output is "3[3] 4千零1:1 apples[5] are[2] there[1] 万[4]"

I'm not sure if these failed cases should be supported,
as in my current plan, it would need to modify the TermGenerator
and check after every Latin digit.
I'm confused if the cost to support these unusual cases is worthy.
If you have better method to solve it, please give me some tips.

Besides, I'm not sure if taking the whole mixed as one token is suitable,
as users have to input the whole number to get relative results.
I think we could feed both the whole token and ngram results during
tokenisation. please your comments.

(Because of the time difference and my limited English,
I might not reply on time, please your forgiveness.)

Cheers,
outdream
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190308/53c4f900/attachment.html>