[Xapian-tickets] [Xapian] #46: zero byte cleanliness in C# and Java bindings
Xapian
nobody at xapian.org
Thu Mar 23 03:09:57 GMT 2023
#46: zero byte cleanliness in C# and Java bindings
-----------------------------+-------------------------------
Reporter: Olly Betts | Owner: Olly Betts
Type: defect | Status: assigned
Priority: normal | Milestone: 1.4.x
Component: Xapian-bindings | Version: SVN trunk
Severity: minor | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-----------------------------+-------------------------------
Old description:
> Current status:
>
> Java (SWIG-based): \0 in Java -> \xc0\x80 in Xapian, which reappears as
> \0 in Java when returned
>
> Tcl: \0 in Tcl -> \xc0\x80 in Xapian, which reappears as \0 in Tcl when
> returned
>
> C#: Truncates at \0 on input
>
> ----
> ''Original description:''
>
> Check for zero byte cleanness wherever strings are used. There are a
> number of c_str()s in the code, but I believe all in the core library
> are harmless at 2002-04-29. There may be other zero
> byte issues though. xapian-applications/dbtools also uses c_str() where
> it
> should probably use data() and length(). xapian-bindings hasn't been
> checked.
New description:
Current status:
Java (SWIG-based): \0 in Java -> \xc0\x80 in Xapian, which reappears as \0
in Java when returned; character >= U+10000 is misencoded as two surrogate
pair codepoints encoded into UTF-8
Tcl: \0 in Tcl -> \xc0\x80 in Xapian, which reappears as \0 in Tcl when
returned
C#: Truncates at \0 on input, which seems to be a SWIG or pinvoke
limitation; character >= U+10000 seems to work (at least on Linux)
----
''Original description:''
Check for zero byte cleanness wherever strings are used. There are a
number of c_str()s in the code, but I believe all in the core library
are harmless at 2002-04-29. There may be other zero
byte issues though. xapian-applications/dbtools also uses c_str() where
it
should probably use data() and length(). xapian-bindings hasn't been
checked.
--
Comment (by Olly Betts):
Testing with current git master:
* Java: Still passes UTF-8 containing `\xc0\x80` and codepoints >= U+10000
are encoded as high and low surrogate pair codepoints in UTF-8, which is
invalid
* csharp: Still truncates at a zero byte; codepoints >= U+10000 seem to
work (on linux at least)
* Tcl: Still passes UTF-8 containing `\xc0\x80`
A note on testing this: Looking at `doc.get_description()` can reveal zero
bytes being passed with an invalid encoding as we get `Document(docid=0,
data=a\xc0\x80b)` for input `a\0b`.
The same trick didn't work for surrogate pairs because `get_description()`
didn't escape them and they converted back to the same Java string we fed
in. However this is really a bug - we should treat surrogate pair code
units encoded as UTF-8 as invalid, like we already do for overlong
sequences. [d390be27767642fae189b10d82e902d5cb2f30ed] implements this
change, and this trick now works for testing the surrogate case too.
[b8509b41c920e36ecfdd8a1a0b95e73f5f738b66] adds testcases where we don't
already have them (with crude XPASS if the cases we know to fail should
change to actually pass).
--
Ticket URL: <https://trac.xapian.org/ticket/46#comment:24>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list