[Xapian-tickets] [Xapian] #292: incorrect translation of non-english HTMLS when charset entry is after title in head.

Xapian nobody at xapian.org
Tue Sep 2 13:53:50 BST 2008


#292: incorrect translation of non-english HTMLS when charset entry is after
title in head.
--------------------+-------------------------------------------------------
 Reporter:  rssh    |        Owner:  olly
     Type:  defect  |       Status:  new 
 Priority:  normal  |    Milestone:      
Component:  Omega   |      Version:      
 Severity:  normal  |   Resolution:      
 Keywords:          |    Blockedby:      
 Platform:  All     |     Blocking:      
--------------------+-------------------------------------------------------

Comment(by rssh):

 Replying to [comment:3 olly]:
 > Sorry, I don't understand your comment.

 Sorry, my fault.

 Long description:
   1 -- let we have html with http-equiv with charser below title:
 <html>
  <title> Щось рідною мовою (Something in my native language) </title>
  <description content="Моя сторінка (My page)" </description>
  <meta http-equiv="Content-Type" content="text/html;
 charset=windows-1251">
 </html>
  when charset is below title, than htmlparser call process_text on title
 (htmplparse.cc line 218) from value, which is actually in windows-1251 but
 transformed to utf8 from ISO-8859-1
  (i.e. totally incorrect).
 to prevent this we must pass to myhtmlparser original text, yet not
 transformed to utf8.
  (i.e. or move converting to utf8 to myhtmlparsee or change process_text
 to receive two arguments, [utf8 text and origin text]).


 P.S. About difference between ISO_8859-1
 Here is mapping
 http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
 It's the same from 00 to FF, i.e,. for first 256 symbols
  (first 128 - it's about ASCII)

-- 
Ticket URL: <http://trac.xapian.org/ticket/292#comment:4>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list