[Xapian-tickets] [Xapian] #150: Enhancements to Unicode support
Xapian
nobody at xapian.org
Mon Apr 6 14:20:09 BST 2009
#150: Enhancements to Unicode support
-------------------------+--------------------------------------------------
Reporter: olly | Owner: olly
Type: enhancement | Status: assigned
Priority: normal | Milestone: 2.0.0
Component: QueryParser | Version: SVN trunk
Severity: minor | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
-------------------------+--------------------------------------------------
Description changed by olly:
Old description:
> This bug is intended to just gather together enhancements we'd like to
> make to
> our Unicode support.
>
> Currently I'm aware of:
>
> * Special cases for case conversion:
> http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Case_Mappings
> and in particular:
> http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing.txt
>
> * Normalisation (mostly combining accents):
> http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Decompositions_and_Normalization
>
> * Unicode has rules for indentifying word boundaries, which we should
> investigate and perhaps use more of. For example, we currently handle a
> space followed by a non-spacing mark wrongly.
>
> I'd imagine we would probably want to target most such changes at a ".0"
> release, for
> reasons of database compatibility. There are probably cases where it
> would be
> reasonable to implement such changes sooner though - if we build a
> different
> database in a case where the existing behaviour is poor, or the
> difference isn't
> problematic for some other reason, say.
New description:
This bug is intended to just gather together enhancements we'd like to
make to
our Unicode support.
Currently I'm aware of:
* Special cases for case conversion:
http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Case_Mappings
and in particular:
http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing.txt
* Normalisation (mostly combining accents):
http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Decompositions_and_Normalization
* Unicode has rules for indentifying word boundaries, which we should
investigate and perhaps use more of. For example, we currently handle a
space followed by a non-spacing mark wrongly.
I'd imagine we would probably want to target most such changes at a ".0"
release, for
reasons of database compatibility. There are probably cases where it
would be
reasonable to implement such changes sooner though - if we build a
different
database in a case where the existing behaviour is poor, or the difference
isn't
problematic for some other reason, say.
--
--
Ticket URL: <http://trac.xapian.org/ticket/150#comment:10>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list