[Xapian-discuss] UTF-8 Corruption
Colin Bell
colinabell at gmail.com
Fri Mar 14 23:14:56 GMT 2008
Hi All
I was wondering if anyone every came across a problem I seem to be
having. I'm indexing in text files using some basic code written in C+
+. The text files may or may not be in UTF-8, ISO 8859-1 or possibly
(but very rarely) even some other format - I have no way of knowing.
Question is, does Xapian convert none UTF-8 characters when it stores
the document. I think I read that UTF-8 is the default encoding for
Xapian, which is exactly what I am after.
The reason I'm asking is that I am getting some seriously corrupted
characters in the index. When they are displayed on Tomcat I get a
"sun.io.MalformedInputException" when trying to display the search
results. I have set the pages charset to UTF-8 and apparently this
error is thrown when when the streamreader detects characters that are
not proper UTF-8 characters.
I know my query may seem naive,but I would really appreciate any
insight you may be willing to offer on this.
Many thanks
Colin
More information about the Xapian-discuss
mailing list