Ask for advice on exact requirements to fix #699 mixed CJK numbers

Sat Mar 9 00:18:26 GMT 2019

On Fri, Mar 08, 2019 at 12:55:48AM +0800, outdream wrote:
> I am working on "#699 Better tokenisation of mixed CJK numbers",
> and have implemented a partial patch of Chinese for this ticket.
> Current code works well with special test cases and
> all tests in xapian-core could still pass.
> 
> But I'm confused with exact requirements of the question,
> for how much we could pay with performance on enabling more cases,
> and if there are better methods to do these?

We don't really have exact requirements here.

The history of the ticket is that back in 2011 Dai Youli (one of our
GSoC students that year) pointed out on IRC that 2千3百 was split into
4 terms, which results in a lot of false matches (3千2百 is the obvious
one, but also documents which have those terms nowhere near each other).

I just noted that comment so it didn't just get forgotten.

So really we just have a note that we should perhaps handle this
case "better".

It does seem that such mixed Latin and Chinese numbers aren't very
common - we've not had any other feedback about them in the last 8
years, and you said on IRC that you'd rarely seen them.  So possibly
the resolution for this ticket is to conclude that it's not worth
changing anything here.

We recently merged a segmentation option for CJK which uses ICU
(http://site.icu-project.org/).  I tweaked the code for testing so that
ICU gets passed 2千3百, and it seems ICU currently splits this case into
4 words too, while 二千三百 is split into 二千 and 三百.

> As The mapping between Unicode and Chinese digits just likes such:
> > Chinese 1-5: 一二三四五
> > in Unicode: \u4E00\u4E8C\u4E09\u56DB\u4E94
> 
> I can't figure out the rules of Unicode of Chinese digits,
> and almost believe that the code-makers didn't consider it :(.

I think sometimes the codepoint allocations match the order in an
existing encoding, but I'm not sure that's the case here.  It
would certainly be more logical for the digits to be consecutive
codepoints, as they are in ASCII.

> So I check if a character is Chinese digits with a static set stores them.

I think we probably want to avoid a static std::set for such a check -
it's likely to need to be initialised at runtime (at least with current
compilers).  Given the list of characters is known at compile-time we
can probably build some sort of fixed mapping for checking these, e.g. a
sorted static const unsigned int array can be searched using
std::binary_search() in O(log(n)), which is the same O() as std::set
gives.

> (1)
> > { "", "我有两千3百块钱", "两千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},
> 
> > Expected output and expect to be equal, were:
> > "3百:1 两[3] 两千:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 有两:1 钱[6]"
> > "两千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"
> 
> (2)
> > { "", "我有 2千3百块钱", "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"},
> 
> > Expected output and expect to be equal, were:
> > "2[3] 3百:1 千[4] 块[5] 块钱:1 我[1] 我有:1 有[2] 钱[6]"
> > "2千3百:1 块[3] 块钱:1 我[1] 我有:1 有[2] 钱[4]"
> 
> (3)
> > { "", "there are 3万4千零1 apples", "3万4千零1:1 apples[4] are[2] there[1] "}
> 
> > Expected output and expect to be equal, were:
> > "3[3] 4千零1:1 apples[5] are[2] there[1] 万[4]"
> > "3万4千零1:1 apples[4] are[2] there[1] "
> 
> ex-1 shows the case mixed number starts with a Chinese digit,
> to enable it, my current plan needs to check every CJK char if
> it is a Chinese digit, and the cost seems unacceptable.
> 
> ex-2 and ex-3 show the cases there is non-CJK-character before
> the first Latin digit, so it would be eaten by the TermGenerator,
> so the Latin digit won't be sent to CJKNgramIterator.
> To enable these cases, in my plan, the mixed numbers would be needed
> to solved in the TermGenerator. However, this would both affect the
> performance and modularity.
> 
> With these considerations, I'm confused about if these cases
> should be supported.

I think the questions to ask are whether these cases occur in practice,
and how well they work without special handling vs with special
handling.

If we can come up with a set of "use cases" of numbers which seems we
should be able to handle better then we can think about what we can
achieve to improve things, and how to implement that cleanly and
efficiently.

> I make some suspects based on the search results of "2千3百" from google.
> 
> I suppose they use both the number token
> and ngram results as keywords.
> >From the result and the highlighted text,
> in the searched keywords list,
> maybe besides the whole number token in the list,
> they also add result from ngram of the number token.
> 
> And I also believe they do mapping (or stemming?) to the number,
> as transformed keyword '三百'(3百) and '二千'(2千) appears in the
> highlighted text frequently.

I'm not sure exactly what they're doing.

But I think a plausible approach along those sort of lines would be
to aim to normalise numbers written in different scripts to a single
form, so 2千3百, 二千三百, and 2300 would all be indexed as
the same term (and eventually so would 2300 represented in Arabic and
other scripts), so the user could search for any of these and find
documents using the others.  Sort of like how terms from English text
are case-folded and stemmed.

Cheers,
    Olly