[Xapian-tickets] [Xapian] #150: Enhancements to Unicode support
Xapian
nobody at xapian.org
Tue Jun 20 01:08:43 BST 2023
#150: Enhancements to Unicode support
-------------------------+-------------------------------
Reporter: Olly Betts | Owner: Olly Betts
Type: enhancement | Status: assigned
Priority: normal | Milestone: 2.0.0
Component: QueryParser | Version: SVN trunk
Severity: minor | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-------------------------+-------------------------------
Comment (by Olly Betts):
> omindex assumes text files are UTF-8 (although the UTF-8 parsing falls
back to ISO-8859-1 for invalid UTF-8 sequences and is used for both term
and sample generation). But we could use "libmagic" to do "charset
detection"
I had a quick look at doing so, but basically libmagic isn't actually
useful for what we want - it seems to either say `binary`, `us-ascii`,
`iso-8859-1`, `utf-8` or `unknown-8bit` (for some files in cp-1252, the
Microsoft embrace-and-extend superset of iso8859-1). The binary files
aren't text files, and the rest omindex should already handle correctly
because it falls back to treating invalid UTF-8 text as cp-1252).
To be useful here we need something which can actually detect non-Unicode
encodings, and ideally also which iso8859-N is in use.
--
Ticket URL: <https://trac.xapian.org/ticket/150#comment:12>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list