[Xapian-tickets] [Xapian] #150: Enhancements to Unicode support

Xapian nobody at xapian.org
Tue Jun 20 01:08:43 BST 2023


#150: Enhancements to Unicode support
-------------------------+-------------------------------
 Reporter:  Olly Betts   |             Owner:  Olly Betts
     Type:  enhancement  |            Status:  assigned
 Priority:  normal       |         Milestone:  2.0.0
Component:  QueryParser  |           Version:  SVN trunk
 Severity:  minor        |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+-------------------------------
Comment (by Olly Betts):

 > omindex assumes text files are UTF-8 (although the UTF-8 parsing falls
 back to ISO-8859-1 for invalid UTF-8 sequences and is used for both term
 and sample generation). But we could use "libmagic" to do "charset
 detection"

 I had a quick look at doing so, but basically libmagic isn't actually
 useful for what we want - it seems to either say `binary`, `us-ascii`,
 `iso-8859-1`, `utf-8` or `unknown-8bit` (for some files in cp-1252, the
 Microsoft embrace-and-extend superset of iso8859-1).  The binary files
 aren't text files, and the rest omindex should already handle correctly
 because it falls back to treating invalid UTF-8 text as cp-1252).

 To be useful here we need something which can actually detect non-Unicode
 encodings, and ideally also which iso8859-N is in use.
-- 
Ticket URL: <https://trac.xapian.org/ticket/150#comment:12>
Xapian <https://xapian.org/>
Xapian


More information about the Xapian-tickets mailing list