[Xapian-tickets] [Xapian] #550: Omega script enhancement: $prettyurl

Fri Jun 13 11:11:07 BST 2014

#550: Omega script enhancement: $prettyurl
-------------------------+-----------------------------
 Reporter:  catkin       |             Owner:  olly
     Type:  enhancement  |            Status:  assigned
 Priority:  normal       |         Milestone:  1.3.3
Component:  Omega        |           Version:
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+-----------------------------
\
\
\
\
\
\

Comment (by james):

 Replying to [comment:10 olly]:

 > So we do have to deal with an authority section, but we only need to
 worry about decoding, not encoding.  None of {{{[]@}}} are valid in
 hostnames IIRC, but they could be seen in a username or password.  Having
 those in search result links seems unlikely, but perhaps we should do some
 basic parsing of the URL and limit what we decode here.

 {{{[]}}} are only for IP literals, so always decoding them is probably
 safe as no one seems to use them for ipv6 anyway. However if we were
 considering parsing the URL, we could probably follow the RFC more
 precisely, which has different reserved characters for different portions.

 > I'm aware {{{http:bad.html}}} is valid - it just doesn't mean the same
 as {{{http%3Abad.html}}} (the "bad" is that it's bad to undo the percent
 encoding there).  And {{{http:http:bad.html}}} was a test to see if an
 unencoded {{{:}}} works if there is an explicit scheme (which is seems
 to).

 My understanding of {{{http:http:bad.html}}} is that it gets parsed as
 {{{scheme=http:[relative-path=http:bad.html]}}}, with an empty authority
 and other pieces, because {{{:}}} doesn't need escaping in path segments
 (unless it's the first one and there's no scheme, which doesn't apply
 here). (The collected ABNF in RFC 3986 seems to actually spell this out,
 although it's considerably less clear if you read through the RFC from top
 to bottom. Sigh.)

 > Probably the next step should actually be to try to handle top-bit-set
 characters.  For these, I think we just need to make sure that they're
 valid for the character set the page is in, though I've not done any tests
 yet.

 There's also IRIs (RFC 3987) for going full Unicode, and IDNA (RFC 5890 et
 al) for internationalised domain names (in the authority). However I
 suspect that these may conflict with, eg, a page in ISO-8859-2 and a query
 string that has been encoded for the page (which will probably "just
 work"). It may be that we need one filter that interprets as UTF-8 and
 reverses IRI/IDNA escaping for prettifying, separate to one that can work
 in codepages.

 > Incidentally, I also tested with the browser on my android phone, and
 results are inline with the other mainstream browsers I tried.  I'm not
 sure what this browser is called (the "about" dialog just shows the
 useragent string, which seems to include the name of just about every web
 browser I can think of).

 Android browser is a variant of Chrom[e|ium], I believe.
\
\
\

--
Ticket URL: <http://trac.xapian.org/ticket/550#comment:11>
Xapian <http://xapian.org/>
Xapian