Ask for advice on exact requirements to fix #699 mixed CJK numbers

Thu Mar 7 16:55:48 GMT 2019

I am working on "#699 Better tokenisation of mixed CJK numbers",
and have implemented a partial patch of Chinese for this ticket.
Current code works well with special test cases and
all tests in xapian-core could still pass.

But I'm confused with exact requirements of the question,
for how much we could pay with performance on enabling more cases,
and if there are better methods to do these?

---
The following are details about current implementation,
potential requirements I have thought, my suspects to
Google's solution from the search results.

---

Current Implementation
===
As I am still unclear with the exact requirements,
I haven't pull request to the root repository, but only push the
code to my own fork of it, and it's in
> https://github.com/outdream/xapian/tree/defect699-mixed_Chinese_num

I also add the 'git diff' result as attachment as an alternative.
(If it's impolite to add attachments on maillist, please tell me)

(Sorry for the code misalignment, I was confused by the tabSize before,
and got the answer from the documents after pushing to github.
While this email running out my time, I would fix the code in next commit.)

If it's better to create a pull request, please tell me.

(Below is my explanation to the code, in case my code is not clear to read)

current code only supports the cases that mixed Chinese numbers
are embedded into the CJK characters which sent to CJKNgramIterator.
And it would extract the whole number as one token instead of 1-gram.

The code was added in the operator++ of CJKNgramIterator in
cjk-tokenizer.cc,
for considerations of minimizing the modification to existing code and
harm to modularity.

current implementation would pass the test cases below:
> { "", "有2千3百", "2千3百:1 有[1]"},
> { "", "我有2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},

the conditions to enable this function are:
- the number should start with a Latin digit
- a CJK character before the first Latin digit to
have it sent to CJKNgramIterator.

As The mapping between Unicode and Chinese digits just likes such:
> Chinese 1-5: 一二三四五
> in Unicode: \u4E00\u4E8C\u4E09\u56DB\u4E94

I can't figure out the rules of Unicode of Chinese digits,
and almost believe that the code-makers didn't consider it :(.

So I check if a character is Chinese digits with a static set stores them.
It would have an effect on performance, so the mixed number would
only be checked if start with a Latin digit.
(For the Unicode, if anyone get the key, please tell me, thanks.)

Potential Requirements
===
Below are some test cases I made in which my implement is invalid.
They just show potential requirements I have thought,
but unsupported for considerations on performance.
I sign them with numbers and alias them as ex-1 or ex-2.
(the output are results got from my current definition and code.)

(1)
> { "", "我有两千3百块钱", "两千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},

> Expected output and expect to be equal, were:
> "3百:1 两[3] 两千:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 有两:1 钱[6]"
> "两千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"

(2)
> { "", "我有 2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},

> Expected output and expect to be equal, were:
> "2[3] 3百:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 钱[6]"
> "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"

(3)
> { "", "there are 3万4千零1 apples", "3万4千零1:1 apples[4] are[2] there[1] "}

> Expected output and expect to be equal, were:
> "3[3] 4千零1:1 apples[5] are[2] there[1] 万[4]"
> "3万4千零1:1 apples[4] are[2] there[1] "

ex-1 shows the case mixed number starts with a Chinese digit,
to enable it, my current plan needs to check every CJK char if
it is a Chinese digit, and the cost seems unacceptable.

ex-2 and ex-3 show the cases there is non-CJK-character before
the first Latin digit, so it would be eaten by the TermGenerator,
so the Latin digit won't be sent to CJKNgramIterator.
To enable these cases, in my plan, the mixed numbers would be needed
to solved in the TermGenerator. However, this would both affect the
performance and modularity.

With these considerations, I'm confused about if these cases
should be supported.

Google's Solution?
===
Trying to make a better definition with the interface,
I make some suspects based on the search results of "2千3百" from google.

I suppose they use both the number token
and ngram results as keywords.
>From the result and the highlighted text,
in the searched keywords list,
maybe besides the whole number token in the list,
they also add result from ngram of the number token.

And I also believe they do mapping (or stemming?) to the number,
as transformed keyword '三百'(3百) and '二千'(2千) appears in the
highlighted text frequently.

However, with all these, I still can't decide how this interface
should be, please give me some advices on the exact requirements
and better methods on solving the question.

Cheers,
outdream
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190308/dc21b48e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: diff_mixed_CJK_numbers.diff
Type: application/octet-stream
Size: 3413 bytes
Desc: not available
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190308/dc21b48e/attachment.obj>