<div dir="ltr"><div>Thanks for your

<span class="gmail-dictBing-CdefItem_Def">considerate</span>


 suggestion.</div><div>I think it maybe the most suitable measure for current case. <br></div><div><br></div><div>I plan to 

fix the issue with adding cases to 

check_infix() and check_infix_digit().</div><div> For mixed numbers likes '2千3百' which <br></div><div>starts with an Arabic digit likes 2, would be tokenized as one token. 

<div>And with the 

<span class="gmail-dictBing-CdefItem_Def">compiler optimization to 'switch',<br></span></div><div><span class="gmail-dictBing-CdefItem_Def">I think the 

<span class="gmail-dictBing-CdefItem_Def">efficiency</span> would also be enough.</span>


</div></div><div><br></div><div>If you have more tips for the implementation, <br></div><div>or I have any misunderstanding, please tell me.</div><div><br></div><div>Cheers,</div><div>outdream<br>


</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Olly Betts <<a href="mailto:olly@survex.com">olly@survex.com</a>> 于2019年3月11日周一 上午9:09写道：<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Sat, Mar 09, 2019 at 11:41:08AM +0800, outdream wrote:<br>

> Thanks for your patience.<br>

> I'm still confused of what I should do next.<br>

> <br>

> If it's not worth changing anything here as it's a rare case,<br>

[...]<br>

<br>

I'm not sure I'm really able to judge this part, as I know very little<br>

Chinese...<br>

<br>

> Or rollback current modification to cjk-tokenizer and<br>

> try to do some work with the stemming?<br>

<br>

...but a way to normalise numbers when indexing and searching seems like<br>

it would address the situation noted in the ticket, but also address the<br>

wider problem of searching for numbers across languages as well as<br>

within a language which has multiple ways of writing a number.  So this<br>

seems like a better solution.<br>

<br>

While this normalising of numbers is analogous to stemming of words, I<br>

don't think the number normalising wants to be done in the stemmers as<br>

it's not directly connected to stemming words in the language.<br>

<br>

I'd suggest at least to start with to just hard-code the special<br>

handling of numbers (there's already some special handling such<br>

check_infix() vs check_infix_digit()).<br>

<br>

It may make sense to abstract out the number normalisation somehow (some<br>

sort of separate "number stemmer" maybe?), but if we try to abstract it<br>

out to start with it'll take longer to get something working, and it's<br>

quite likely we'll find we got the abstraction wrong and have to rework<br>

it anyway.<br>

<br>

Cheers,<br>

    Olly<br>

</blockquote></div>