[Xapian-tickets] [Xapian] #150: Enhancements to Unicode support

Mon Apr 6 14:20:09 BST 2009

#150: Enhancements to Unicode support
-------------------------+--------------------------------------------------
 Reporter:  olly         |        Owner:  olly     
     Type:  enhancement  |       Status:  assigned 
 Priority:  normal       |    Milestone:  2.0.0    
Component:  QueryParser  |      Version:  SVN trunk
 Severity:  minor        |   Resolution:           
 Keywords:               |    Blockedby:           
 Platform:  All          |     Blocking:           
-------------------------+--------------------------------------------------
Description changed by olly:

Old description:

> This bug is intended to just gather together enhancements we'd like to
> make to
> our Unicode support.
>
> Currently I'm aware of:
>
> * Special cases for case conversion:
> http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Case_Mappings
> and in particular:
> http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing.txt
>
> * Normalisation (mostly combining accents):
> http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Decompositions_and_Normalization
>
> * Unicode has rules for indentifying word boundaries, which we should
> investigate and perhaps use more of.  For example, we currently handle a
> space followed by a non-spacing mark wrongly.
>
> I'd imagine we would probably want to target most such changes at a ".0"
> release, for
> reasons of database compatibility.  There are probably cases where it
> would be
> reasonable to implement such changes sooner though - if we build a
> different
> database in a case where the existing behaviour is poor, or the
> difference isn't
> problematic for some other reason, say.

New description:

 This bug is intended to just gather together enhancements we'd like to
 make to
 our Unicode support.

 Currently I'm aware of:

   * Special cases for case conversion:
   http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Case_Mappings
   and in particular:
   http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing.txt

   * Normalisation (mostly combining accents):
 http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Decompositions_and_Normalization

   * Unicode has rules for indentifying word boundaries, which we should
 investigate and perhaps use more of.  For example, we currently handle a
 space followed by a non-spacing mark wrongly.

 I'd imagine we would probably want to target most such changes at a ".0"
 release, for
 reasons of database compatibility.  There are probably cases where it
 would be
 reasonable to implement such changes sooner though - if we build a
 different
 database in a case where the existing behaviour is poor, or the difference
 isn't
 problematic for some other reason, say.

--

-- 
Ticket URL: <http://trac.xapian.org/ticket/150#comment:10>
Xapian <http://xapian.org/>
Xapian