[Xapian-tickets] [Xapian] #46: zero byte cleanliness in C# and Java bindings

Xapian nobody at xapian.org
Mon Jul 25 13:12:54 BST 2011


#46: zero byte cleanliness in C# and Java bindings
-----------------------------+----------------------------------------------
 Reporter:  olly             |        Owner:  olly     
     Type:  defect           |       Status:  assigned 
 Priority:  normal           |    Milestone:  1.2.x    
Component:  Xapian-bindings  |      Version:  SVN trunk
 Severity:  minor            |   Resolution:           
 Keywords:                   |    Blockedby:           
 Platform:  All              |     Blocking:           
-----------------------------+----------------------------------------------

Old description:

> Current status:
>
> Java (SWIG-based): \0 in Java -> \xc0\x80 in Xapian, which seems to
> disappear from the returned string on output
>
> Tcl: \0 in Tcl -> \xc0\x80 in Xapian, which reappears as \0 in Tcl on
> when returned
>
> C#: Truncates at \0 on input
>
> ----
> ''Original description:''
>
> Check for zero byte cleanness wherever strings are used.  There are a
> number of c_str()s in the code, but I believe all in the core library
> are harmless at 2002-04-29.  There may be other zero
> byte issues though.  xapian-applications/dbtools also uses c_str() where
> it
> should probably use data() and length().  xapian-bindings hasn't been
> checked.

New description:

 Current status:

 Java (SWIG-based): \0 in Java -> \xc0\x80 in Xapian, which reappears as \0
 in Java when returned

 Tcl: \0 in Tcl -> \xc0\x80 in Xapian, which reappears as \0 in Tcl when
 returned

 C#: Truncates at \0 on input

 ----
 ''Original description:''

 Check for zero byte cleanness wherever strings are used.  There are a
 number of c_str()s in the code, but I believe all in the core library
 are harmless at 2002-04-29.  There may be other zero
 byte issues though.  xapian-applications/dbtools also uses c_str() where
 it
 should probably use data() and length().  xapian-bindings hasn't been
 checked.

--

Comment(by olly):

 I must have had an unclean tree or something.  SWIG-based Java bindings
 are just like Tcl - from the Java side they appear zero-byte clean, but
 actually in C++ we see \xc0\x80 for Java \0.

 I tried a quick patch to use GetStringCritical() and convert to UTF-8
 ourselves using's Xapian's Unicode support, which would mean \0 in Java
 <-> \0 in C++, and also would convert surrogate pairs in Java's
 representation properly to/from UTF-8 in C++.  However, timing some
 operations which do a lot of string passing this is twice as slow, so I'm
 parking it for now.  I'll attach it here so it doesn't get lost.

 Added a Java testcase in r15917, which just checks that the roundtripping
 works.

-- 
Ticket URL: <http://trac.xapian.org/ticket/46#comment:21>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list