[Xapian-tickets] [Xapian] #46: zero byte cleanliness in C# and Java bindings

Thu Mar 23 03:09:57 GMT 2023

#46: zero byte cleanliness in C# and Java bindings
-----------------------------+-------------------------------
 Reporter:  Olly Betts       |             Owner:  Olly Betts
     Type:  defect           |            Status:  assigned
 Priority:  normal           |         Milestone:  1.4.x
Component:  Xapian-bindings  |           Version:  SVN trunk
 Severity:  minor            |        Resolution:
 Keywords:                   |        Blocked By:
 Blocking:                   |  Operating System:  All
-----------------------------+-------------------------------

Old description:

> Current status:
>
> Java (SWIG-based): \0 in Java -> \xc0\x80 in Xapian, which reappears as
> \0 in Java when returned
>
> Tcl: \0 in Tcl -> \xc0\x80 in Xapian, which reappears as \0 in Tcl when
> returned
>
> C#: Truncates at \0 on input
>
> ----
> ''Original description:''
>
> Check for zero byte cleanness wherever strings are used.  There are a
> number of c_str()s in the code, but I believe all in the core library
> are harmless at 2002-04-29.  There may be other zero
> byte issues though.  xapian-applications/dbtools also uses c_str() where
> it
> should probably use data() and length().  xapian-bindings hasn't been
> checked.

New description:

 Current status:

 Java (SWIG-based): \0 in Java -> \xc0\x80 in Xapian, which reappears as \0
 in Java when returned; character >= U+10000 is misencoded as two surrogate
 pair codepoints encoded into UTF-8

 Tcl: \0 in Tcl -> \xc0\x80 in Xapian, which reappears as \0 in Tcl when
 returned

 C#: Truncates at \0 on input, which seems to be a SWIG or pinvoke
 limitation; character >= U+10000 seems to work (at least on Linux)

 ----
 ''Original description:''

 Check for zero byte cleanness wherever strings are used.  There are a
 number of c_str()s in the code, but I believe all in the core library
 are harmless at 2002-04-29.  There may be other zero
 byte issues though.  xapian-applications/dbtools also uses c_str() where
 it
 should probably use data() and length().  xapian-bindings hasn't been
 checked.

--
Comment (by Olly Betts):

 Testing with current git master:

 * Java: Still passes UTF-8 containing `\xc0\x80` and codepoints >= U+10000
 are encoded as high and low surrogate pair codepoints in UTF-8, which is
 invalid
 * csharp: Still truncates at a zero byte; codepoints >= U+10000 seem to
 work (on linux at least)
 * Tcl: Still passes UTF-8 containing `\xc0\x80`

 A note on testing this: Looking at `doc.get_description()` can reveal zero
 bytes being passed with an invalid encoding as we get `Document(docid=0,
 data=a\xc0\x80b)` for input `a\0b`.

 The same trick didn't work for surrogate pairs because `get_description()`
 didn't escape them and they converted back to the same Java string we fed
 in.  However this is really a bug - we should treat surrogate pair code
 units encoded as UTF-8 as invalid, like we already do for overlong
 sequences.  [d390be27767642fae189b10d82e902d5cb2f30ed] implements this
 change, and this trick now works for testing the surrogate case too.

 [b8509b41c920e36ecfdd8a1a0b95e73f5f738b66] adds testcases where we don't
 already have them (with crude XPASS if the cases we know to fail should
 change to actually pass).
-- 
Ticket URL: <https://trac.xapian.org/ticket/46#comment:24>
Xapian <https://xapian.org/>
Xapian