Ask for advice on exact requirements to fix #699 mixed CJK numbers

Mon Mar 11 01:09:55 GMT 2019

On Sat, Mar 09, 2019 at 11:41:08AM +0800, outdream wrote:
> Thanks for your patience.
> I'm still confused of what I should do next.
> 
> If it's not worth changing anything here as it's a rare case,
[...]

I'm not sure I'm really able to judge this part, as I know very little
Chinese...

> Or rollback current modification to cjk-tokenizer and
> try to do some work with the stemming?

...but a way to normalise numbers when indexing and searching seems like
it would address the situation noted in the ticket, but also address the
wider problem of searching for numbers across languages as well as
within a language which has multiple ways of writing a number.  So this
seems like a better solution.

While this normalising of numbers is analogous to stemming of words, I
don't think the number normalising wants to be done in the stemmers as
it's not directly connected to stemming words in the language.

I'd suggest at least to start with to just hard-code the special
handling of numbers (there's already some special handling such
check_infix() vs check_infix_digit()).

It may make sense to abstract out the number normalisation somehow (some
sort of separate "number stemmer" maybe?), but if we try to abstract it
out to start with it'll take longer to get something working, and it's
quite likely we'll find we got the abstraction wrong and have to rework
it anyway.

Cheers,
    Olly