[Xapian-discuss] UTF-8 Corruption

Fri Mar 14 23:14:56 GMT 2008

Hi All

I was wondering if anyone every came across a problem I seem to be  
having. I'm indexing in text files using some basic code written in C+ 
+. The text files may or may not be in UTF-8, ISO 8859-1 or possibly  
(but very rarely) even some other format - I have no way of knowing.

Question is, does Xapian convert none UTF-8 characters when it stores  
the document. I think I read that UTF-8 is the default encoding for  
Xapian, which is exactly what I am after.

The reason I'm asking is that I am getting some seriously corrupted  
characters in the index. When they are displayed on Tomcat I get a  
"sun.io.MalformedInputException" when trying to display the search  
results. I have set the pages charset to UTF-8 and apparently this  
error is thrown when when the streamreader detects characters that are  
not proper UTF-8 characters.

I know my query may seem naive,but I would really appreciate any  
insight you may be willing to offer on this.

Many thanks

Colin