Ask for advice on exact requirements to fix #699 mixed CJK numbers
outdream
a13700bc at gmail.com
Tue Mar 12 07:06:12 GMT 2019
Thanks for your considerate suggestion.
I think it maybe the most suitable measure for current case.
I plan to fix the issue with adding cases to check_infix() and
check_infix_digit().
For mixed numbers likes '2千3百' which
starts with an Arabic digit likes 2, would be tokenized as one token.
And with the compiler optimization to 'switch',
I think the efficiency would also be enough.
If you have more tips for the implementation,
or I have any misunderstanding, please tell me.
Cheers,
outdream
Olly Betts <olly at survex.com> 于2019年3月11日周一 上午9:09写道:
> On Sat, Mar 09, 2019 at 11:41:08AM +0800, outdream wrote:
> > Thanks for your patience.
> > I'm still confused of what I should do next.
> >
> > If it's not worth changing anything here as it's a rare case,
> [...]
>
> I'm not sure I'm really able to judge this part, as I know very little
> Chinese...
>
> > Or rollback current modification to cjk-tokenizer and
> > try to do some work with the stemming?
>
> ...but a way to normalise numbers when indexing and searching seems like
> it would address the situation noted in the ticket, but also address the
> wider problem of searching for numbers across languages as well as
> within a language which has multiple ways of writing a number. So this
> seems like a better solution.
>
> While this normalising of numbers is analogous to stemming of words, I
> don't think the number normalising wants to be done in the stemmers as
> it's not directly connected to stemming words in the language.
>
> I'd suggest at least to start with to just hard-code the special
> handling of numbers (there's already some special handling such
> check_infix() vs check_infix_digit()).
>
> It may make sense to abstract out the number normalisation somehow (some
> sort of separate "number stemmer" maybe?), but if we try to abstract it
> out to start with it'll take longer to get something working, and it's
> quite likely we'll find we got the abstraction wrong and have to rework
> it anyway.
>
> Cheers,
> Olly
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190312/101ebb29/attachment.html>
More information about the Xapian-devel
mailing list