[Xapian-tickets] [Xapian] #150: Enhancements to Unicode support
Xapian
nobody at xapian.org
Fri Jan 19 03:24:09 GMT 2024
#150: Enhancements to Unicode support
-------------------------+-------------------------------
Reporter: Olly Betts | Owner: Olly Betts
Type: enhancement | Status: assigned
Priority: normal | Milestone: 2.0.0
Component: QueryParser | Version: git master
Severity: minor | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-------------------------+-------------------------------
Changes (by Olly Betts):
* version: SVN trunk => git master
Comment:
Re Unicode Normalisation:
I think the workable approach is to provide an "opinionated"
implementation where we pick one normalisation and only support that (we
essentially do that for encodings - Xapian features which care about an
encoding only support UTF-8).
A composed form is probably the more sensible choice here:
* Snowball stemmers all support that and few (maybe none) support
decomposed forms
* It makes for smaller terms
* It seems by far the dominant form that data is actually in
That means NFC or NFKC - the latter seems helpful in some cases (e.g.
ligatures: "office" -> "office") but less so in others (e.g. "4²" -> "42").
I think this needs a deeper analysis, but possibly we could define a
subset of the Unicode compatibility equivalent forms to use here.
--
Ticket URL: <https://trac.xapian.org/ticket/150#comment:13>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list