[Xapian-tickets] [Xapian] #150: Enhancements to Unicode support

Xapian nobody at xapian.org
Fri Jan 19 03:24:09 GMT 2024


#150: Enhancements to Unicode support
-------------------------+-------------------------------
 Reporter:  Olly Betts   |             Owner:  Olly Betts
     Type:  enhancement  |            Status:  assigned
 Priority:  normal       |         Milestone:  2.0.0
Component:  QueryParser  |           Version:  git master
 Severity:  minor        |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+-------------------------------
Changes (by Olly Betts):

 * version:  SVN trunk => git master

Comment:

 Re Unicode Normalisation:

 I think the workable approach is to provide an "opinionated"
 implementation where we pick one normalisation and only support that (we
 essentially do that for encodings - Xapian features which care about an
 encoding only support UTF-8).

 A composed form is probably the more sensible choice here:

 * Snowball stemmers all support that and few (maybe none) support
 decomposed forms
 * It makes for smaller terms
 * It seems by far the dominant form that data is actually in

 That means NFC or NFKC - the latter seems helpful in some cases (e.g.
 ligatures: "office" -> "office") but less so in others (e.g. "4²" -> "42").

 I think this needs a deeper analysis, but possibly we could define a
 subset of the Unicode compatibility equivalent forms to use here.
-- 
Ticket URL: <https://trac.xapian.org/ticket/150#comment:13>
Xapian <https://xapian.org/>
Xapian


More information about the Xapian-tickets mailing list