<div dir="ltr"><div dir="ltr"><div dir="ltr">Sorry for my verbose text in last email...<br><br>I have created a PR to the master.<br>The code partially fixes the problem mentioned in #699,<br>it supports mixed Chinese numbers sent to CJKNgramIterator,<br>for example, these test cases would pass:<br>> { "", "有2千3百", "2千3百:1 有[1]"},<br>> { "", "我有2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},<br><br>But it won't deal with mixed numbers whose previous character is not a<br>CJK character, as the first digit would be eaten by the TermGenerator.<br>for example, below cases would fail:<br>> { "", "我有 2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},<br>current output is "2[3] 3百:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 钱[6]"<br>> { "", "there are 3万4千零1 apples", "3万4千零1:1 apples[4] are[2] there[1] "}<br>current output is "3[3] 4千零1:1 apples[5] are[2] there[1] 万[4]"<br><br>I'm not sure if these failed cases should be supported,<br>as in my current plan, it would need to modify the TermGenerator<br>and check after every Latin digit.<br>I'm confused if the cost to support these unusual cases is worthy.<br>If you have better method to solve it, please give me some tips.<br><br>Besides, I'm not sure if taking the whole mixed as one token is suitable,<br>as users have to input the whole number to get relative results.<br>I think we could feed both the whole token and ngram results during<br>tokenisation. please your comments.<br><br>(Because of the time difference and my limited English, <br>I might not reply on time, please your forgiveness.)<br><br><br>Cheers,<br>outdream<br></div></div></div>