[Xapian-tickets] [Xapian] #292: incorrect translation of non-english HTMLS when charset entry is after title in head.
Xapian
nobody at xapian.org
Tue Sep 2 13:53:50 BST 2008
#292: incorrect translation of non-english HTMLS when charset entry is after
title in head.
--------------------+-------------------------------------------------------
Reporter: rssh | Owner: olly
Type: defect | Status: new
Priority: normal | Milestone:
Component: Omega | Version:
Severity: normal | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
--------------------+-------------------------------------------------------
Comment(by rssh):
Replying to [comment:3 olly]:
> Sorry, I don't understand your comment.
Sorry, my fault.
Long description:
1 -- let we have html with http-equiv with charser below title:
<html>
<title> Щось рідною мовою (Something in my native language) </title>
<description content="Моя сторінка (My page)" </description>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1251">
</html>
when charset is below title, than htmlparser call process_text on title
(htmplparse.cc line 218) from value, which is actually in windows-1251 but
transformed to utf8 from ISO-8859-1
(i.e. totally incorrect).
to prevent this we must pass to myhtmlparser original text, yet not
transformed to utf8.
(i.e. or move converting to utf8 to myhtmlparsee or change process_text
to receive two arguments, [utf8 text and origin text]).
P.S. About difference between ISO_8859-1
Here is mapping
http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
It's the same from 00 to FF, i.e,. for first 256 symbols
(first 128 - it's about ASCII)
--
Ticket URL: <http://trac.xapian.org/ticket/292#comment:4>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list