[Xapian-tickets] [Xapian] #550: Omega script enhancement: $prettyurl
Xapian
nobody at xapian.org
Fri Jun 13 11:11:07 BST 2014
#550: Omega script enhancement: $prettyurl
-------------------------+-----------------------------
Reporter: catkin | Owner: olly
Type: enhancement | Status: assigned
Priority: normal | Milestone: 1.3.3
Component: Omega | Version:
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-------------------------+-----------------------------
\
\
\
\
\
\
Comment (by james):
Replying to [comment:10 olly]:
> So we do have to deal with an authority section, but we only need to
worry about decoding, not encoding. None of {{{[]@}}} are valid in
hostnames IIRC, but they could be seen in a username or password. Having
those in search result links seems unlikely, but perhaps we should do some
basic parsing of the URL and limit what we decode here.
{{{[]}}} are only for IP literals, so always decoding them is probably
safe as no one seems to use them for ipv6 anyway. However if we were
considering parsing the URL, we could probably follow the RFC more
precisely, which has different reserved characters for different portions.
> I'm aware {{{http:bad.html}}} is valid - it just doesn't mean the same
as {{{http%3Abad.html}}} (the "bad" is that it's bad to undo the percent
encoding there). And {{{http:http:bad.html}}} was a test to see if an
unencoded {{{:}}} works if there is an explicit scheme (which is seems
to).
My understanding of {{{http:http:bad.html}}} is that it gets parsed as
{{{scheme=http:[relative-path=http:bad.html]}}}, with an empty authority
and other pieces, because {{{:}}} doesn't need escaping in path segments
(unless it's the first one and there's no scheme, which doesn't apply
here). (The collected ABNF in RFC 3986 seems to actually spell this out,
although it's considerably less clear if you read through the RFC from top
to bottom. Sigh.)
> Probably the next step should actually be to try to handle top-bit-set
characters. For these, I think we just need to make sure that they're
valid for the character set the page is in, though I've not done any tests
yet.
There's also IRIs (RFC 3987) for going full Unicode, and IDNA (RFC 5890 et
al) for internationalised domain names (in the authority). However I
suspect that these may conflict with, eg, a page in ISO-8859-2 and a query
string that has been encoded for the page (which will probably "just
work"). It may be that we need one filter that interprets as UTF-8 and
reverses IRI/IDNA escaping for prettifying, separate to one that can work
in codepages.
> Incidentally, I also tested with the browser on my android phone, and
results are inline with the other mainstream browsers I tried. I'm not
sure what this browser is called (the "about" dialog just shows the
useragent string, which seems to include the name of just about every web
browser I can think of).
Android browser is a variant of Chrom[e|ium], I believe.
\
\
\
--
Ticket URL: <http://trac.xapian.org/ticket/550#comment:11>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list