<div dir="ltr"><div>Thanks for your
<span class="gmail-dictBing-CdefItem_Def">considerate</span>
suggestion.</div><div>I think it maybe the most suitable measure for current case. <br></div><div><br></div><div>I plan to
fix the issue with adding cases to
check_infix() and check_infix_digit().</div><div> For mixed numbers likes '2千3百' which <br></div><div>starts with an Arabic digit likes 2, would be tokenized as one token.
<div>And with the
<span class="gmail-dictBing-CdefItem_Def">compiler optimization to 'switch',<br></span></div><div><span class="gmail-dictBing-CdefItem_Def">I think the
<span class="gmail-dictBing-CdefItem_Def">efficiency</span> would also be enough.</span>
</div></div><div><br></div><div>If you have more tips for the implementation, <br></div><div>or I have any misunderstanding, please tell me.</div><div><br></div><div>Cheers,</div><div>outdream<br>
</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Olly Betts <<a href="mailto:olly@survex.com">olly@survex.com</a>> 于2019年3月11日周一 上午9:09写道:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Sat, Mar 09, 2019 at 11:41:08AM +0800, outdream wrote:<br>
> Thanks for your patience.<br>
> I'm still confused of what I should do next.<br>
> <br>
> If it's not worth changing anything here as it's a rare case,<br>
[...]<br>
<br>
I'm not sure I'm really able to judge this part, as I know very little<br>
Chinese...<br>
<br>
> Or rollback current modification to cjk-tokenizer and<br>
> try to do some work with the stemming?<br>
<br>
...but a way to normalise numbers when indexing and searching seems like<br>
it would address the situation noted in the ticket, but also address the<br>
wider problem of searching for numbers across languages as well as<br>
within a language which has multiple ways of writing a number. So this<br>
seems like a better solution.<br>
<br>
While this normalising of numbers is analogous to stemming of words, I<br>
don't think the number normalising wants to be done in the stemmers as<br>
it's not directly connected to stemming words in the language.<br>
<br>
I'd suggest at least to start with to just hard-code the special<br>
handling of numbers (there's already some special handling such<br>
check_infix() vs check_infix_digit()).<br>
<br>
It may make sense to abstract out the number normalisation somehow (some<br>
sort of separate "number stemmer" maybe?), but if we try to abstract it<br>
out to start with it'll take longer to get something working, and it's<br>
quite likely we'll find we got the abstraction wrong and have to rework<br>
it anyway.<br>
<br>
Cheers,<br>
Olly<br>
</blockquote></div>