<div dir="ltr">I am working on "#699 Better tokenisation of mixed CJK numbers",<br>and have implemented a partial patch of Chinese for this ticket.<br>Current code works well with special test cases and<br>all tests in xapian-core could still pass.<br><br>But I'm confused with exact requirements of the question,<br>for how much we could pay with performance on enabling more cases,<br>and if there are better methods to do these?<br><br><br>---<br>The following are details about current implementation,<br>potential requirements I have thought, my suspects to<br>Google's solution from the search results.<br><br>---<br><br><br>Current Implementation<br>===<br>As I am still unclear with the exact requirements,<br>I haven't pull request to the root repository, but only push the<br>code to my own fork of it, and it's in<br>> <a href="https://github.com/outdream/xapian/tree/defect699-mixed_Chinese_num">https://github.com/outdream/xapian/tree/defect699-mixed_Chinese_num</a><br><br>I also add the 'git diff' result as attachment as an alternative.<br>(If it's impolite to add attachments on maillist, please tell me)<br><br>(Sorry for the code misalignment, I was confused by the tabSize before,<br>and got the answer from the documents after pushing to github.<br>While this email running out my time, I would fix the code in next commit.)<br><br>If it's better to create a pull request, please tell me.<br><br>(Below is my explanation to the code, in case my code is not clear to read)<br><br>current code only supports the cases that mixed Chinese numbers<br>are embedded into the CJK characters which sent to CJKNgramIterator.<br>And it would extract the whole number as one token instead of 1-gram.<br><br>The code was added in the operator++ of CJKNgramIterator in cjk-tokenizer.cc,<br>for considerations of minimizing the modification to existing code and<br>harm to modularity.<br><br>current implementation would pass the test cases below:<br>> { "", "有2千3百", "2千3百:1 有[1]"},<br>> { "", "我有2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},<br><br>the conditions to enable this function are:<br>- the number should start with a Latin digit<br>- a CJK character before the first Latin digit to<br>have it sent to CJKNgramIterator.<br><br>As The mapping between Unicode and Chinese digits just likes such:<br>> Chinese 1-5: 一二三四五<br>> in Unicode: \u4E00\u4E8C\u4E09\u56DB\u4E94<br><br>I can't figure out the rules of Unicode of Chinese digits,<br>and almost believe that the code-makers didn't consider it :(.<br><br>So I check if a character is Chinese digits with a static set stores them.<br>It would have an effect on performance, so the mixed number would<br>only be checked if start with a Latin digit.<br>(For the Unicode, if anyone get the key, please tell me, thanks.)<br><br><br>Potential Requirements<br>===<br>Below are some test cases I made in which my implement is invalid.<br>They just show potential requirements I have thought,<br>but unsupported for considerations on performance.<br>I sign them with numbers and alias them as ex-1 or ex-2.<br>(the output are results got from my current definition and code.)<br><br>(1)<br>> { "", "我有两千3百块钱", "两千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},<br><br>> Expected output and expect to be equal, were:<br>> "3百:1 两[3] 两千:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 有两:1 钱[6]"<br>> "两千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"<br><br>(2)<br>> { "", "我有 2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},<br><br>> Expected output and expect to be equal, were:<br>> "2[3] 3百:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 钱[6]"<br>> "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"<br><br>(3)<br>> { "", "there are 3万4千零1 apples", "3万4千零1:1 apples[4] are[2] there[1] "}<br><br>> Expected output and expect to be equal, were:<br>> "3[3] 4千零1:1 apples[5] are[2] there[1] 万[4]"<br>> "3万4千零1:1 apples[4] are[2] there[1] "<br><br>ex-1 shows the case mixed number starts with a Chinese digit,<br>to enable it, my current plan needs to check every CJK char if<br>it is a Chinese digit, and the cost seems unacceptable.<br><br>ex-2 and ex-3 show the cases there is non-CJK-character before<br>the first Latin digit, so it would be eaten by the TermGenerator,<br>so the Latin digit won't be sent to CJKNgramIterator.<br>To enable these cases, in my plan, the mixed numbers would be needed<br>to solved in the TermGenerator. However, this would both affect the<br>performance and modularity.<br><br>With these considerations, I'm confused about if these cases<br>should be supported.<br><br><br>Google's Solution?<br>===<br>Trying to make a better definition with the interface,<br>I make some suspects based on the search results of "2千3百" from google.<br><br>I suppose they use both the number token<br>and ngram results as keywords.<br>From the result and the highlighted text,<br>in the searched keywords list,<br>maybe besides the whole number token in the list,<br>they also add result from ngram of the number token.<br><br>And I also believe they do mapping (or stemming?) to the number,<br>as transformed keyword '三百'(3百) and '二千'(2千) appears in the<br>highlighted text frequently.<br><br><br>However, with all these, I still can't decide how this interface<br>should be, please give me some advices on the exact requirements<br>and better methods on solving the question.<br><br>Cheers,<br>outdream<br></div>