Opened 10 years ago

Last modified 8 years ago

#14 new defect

update URI to IRI to match IRI to URI

Reported by: lmm@… Owned by: masinter@…
Priority: major Milestone:
Component: 3987bis Version:
Severity: - Keywords:
Cc:

Description

763,768c799,805
< <t>In some situations, for presentation and further processing,
< it is desirable to convert a URI into an equivalent IRI in which
< natural characters are represented directly rather than
< percent encoded. Of course, every URI is already an IRI in
< its own right without any conversion, and in general there
< This section gives one such procedure for this conversion.
---

<t>In some situations, for presentation and further processing, it is
often desirable to convert a URI into an equivalent IRI, in which natural
characters are represented directly. Of course, every URI is already
an IRI in its own right without any conversion.
Conversion from a URI to an IRI is an optional process;
in general, there are many possible such transformations; this
section gives just one such procedure.

================================================
(22) editorial, trying to make wording clearer

774,778c811,815
< conversion (except for potential case differences in percent-encoding
< and for potential percent-encoded unreserved characters).
<
< However, the IRI resulting from this conversion may differ
< from the original IRI (if there ever was one).</t>
---

conversion (string-equivalent except for potential case differences
in percent-encoding and for potential percent-encoded unreserved
characters). In the case of a URI originally generated by conversion
from an IRI, the IRI resulting from this conversion may differ from the
original one.</t>

============================================
(23) Trying to make the conversion from URI to IRI also based

on converting parsed components and reassembling, rather than
on the whole string. Also, just don't decode things, rather than
first decode and then re-encode (is that wise?)

804,828c842,843
< </list></t>
<
< <t>Conversion from a URI to an IRI MAY be done by using the following
< steps:
<
< <list style="hanging">
< <t hangText="1.">Represent the URI as a sequence of octets in
< US-ASCII.</t>
<
< <t hangText="2.">Convert all percent-encodings ("%" followed by two
< hexadecimal digits) to the corresponding octets, except those
< corresponding to "%", characters in "reserved", and characters
< in US-ASCII not allowed in URIs.</t>
<
< <t hangText="3.">Re-percent-encode any octet produced in step 2 that
< is not part of a strictly legal UTF-8 octet sequence.</t>
<
<
< <t hangText="4.">Re-percent-encode all octets produced in step 3 that
< in UTF-8 represent characters that are not appropriate according
< to <xref target="abnf"/>, <xref target="visual"/>, and <xref
< target="limitations"/>.</t>
<
< <t hangText="5.">Interpret the resulting octet sequence as a sequence
< of characters encoded in UTF-8.</t>
---

</list>

</t>

830,833c845,847
< <t hangText="6.">URIs known to contain domain names in the reg-name
< component SHOULD convert punycode-encoded domain name labels to
< the corresponding characters using the ToUnicode? procedure. </t>
< </list></t>
---

<t>

Steps for URI to IRI transformation, starting with a sequence
of characters (taken from the 7-bit US-ASCII repertoire).

835,843c849,910
< <t>This procedure will convert as many percent-encoded characters as
< possible to characters in an IRI. Because there are some choices when
< step 4 is applied (see <xref target="limitations"/>), results may
< vary.</t>
<
< <t>Conversions from URIs to IRIs MUST NOT use any character
< encoding other than UTF-8 in steps 3 and 4, even if it might be
< possible to guess from the context that another character encoding
< than UTF-8 was used in the URI. For example, the URI
---

<list style="hanging">

<t hangText="1.">Parse the URI according to the parsing rules

of section 3 of <xref target="RFC3986"/>.</t>

<t hangText="2.">Converting components to Unicode repertoire.
For each resulting parsed component from step 1:

<list style="hanging">

<t hangText="2.a. Query Component">

Because of special processing applied to query components
in some web browsers, leaving query components encoded
is advisable in some circumstances: if the
scheme is "http" or "https", and
the resulting IRI is not for use in a Unicode-only
environment, leave any percent-encoded query component
encoded.
This will avoid ambiguity when IRIs with query components
are used in contexts sensitive to the document character set.
</t>

<t hangText="2.b. Reg-name without %">

If the reg-name of the URI contains no "%", and the
ToUnicode? mapping of [reference-to-idnabis] succeeds
on the reg-name, use the result of ToUnicode? as the
new ireg-name component. (This will undo special
IDNA ToAscii? conversions.) If the reg-name component
does contained a "%", use the "2.c. Other Components" step instead.
(This will undo any previous inappropriate %-encoding.)</t>

<t hangText="2.c. Other Components">

Convert all percent-encodings ("%" followed by two
hexadecimal digits) to the corresponding octets, except those:

<list style="symbols">

<t>corresponding to "%"</t>
<t>would result in a character in "reserved"</t>
<t>would result in a character in US-ASCII not allowed in URIs</t>
<t>would result in an octet not part of a strictly legal UTF-8

octet sequence</t>

<t>would result in an octet sequence that, in UTF8,

represents a character that not appropriate according to
<xref target="abnf"/>, <xref target="visual"/>, and <xref
target="limitations"/>.</t>

</list>

</t>

</list>

<t hangText="3.">Reassemble the translated parsed components

using the original punctuation used as delimiters in
step 1.</t>


<t>For most cases of URIs that were originally converted from

an IRI (one with no % or %-encoding, and no other inadvisable
characters), this will translate back into the original IRI.
Because there are some choices when
step 2 is applied (see <xref target="limitations"/>), results may
vary.</t>

<t>Note that this process explicitly does not use any

character encoding other than UTF-8 in step 2.c,

even if it might be tempting
to guess from the context that another character encoding
than UTF-8 might have been used in the URI. For example, the URI

846,847c913,914
< iso-8859-1. It must not be converted to an IRI containing these
< e-acute characters. Otherwise, in the future the IRI will be mapped to
---

iso-8859-1. It MUST NOT be converted to an IRI containing these
e-acute characters. Otherwise, in the future the IRI might be mapped to

850a918

852a921,923

<t>[[NOTE: This section needs rewrite to match changes to
the conversion algorithm.]] </t>

854,855c925
< Each example shows the result after each of the steps 1 through 6 is
< applied. XML Notation is used for the final result. Octets are
---

XML Notation is used for the final result. Octets are

==========================================================
(24) need to make examples match algorithm (once algorithm
is finished)

865,870c935,938
< <t hangText="1.">http://www.example.org/D%C3%BCrst</t>
< <t hangText="2.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
< <t hangText="3.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
< <t hangText="4.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
< <t hangText="5.">http://www.example.org/D&amp;#xFC;rst</t>
< <t hangText="6.">http://www.example.org/D&amp;#xFC;rst</t>
---

<t hangText="0.">http://www.example.org/D%C3%BCrst</t>
<t hangText="1.">http www.example.org D&lt;c3&gt;&lt;bc&gt;rst</t>
<t hangText="2.">http www.example.org D&lt;c3&gt;&lt;bc&gt;rst</t>
<t hangText="3.">http://www.example.org/D&amp;#xFC;rst</t>

880,881c948
< sequence, it is re-percent-encoded in step 3.
<
---

sequence, it is not unencoded.

884,889c951,954
< <t hangText="1.">http://www.example.org/D%FCrst</t>
< <t hangText="2.">http://www.example.org/D&lt;fc&gt;rst</t>
< <t hangText="3.">http://www.example.org/D%FCrst</t>
< <t hangText="4.">http://www.example.org/D%FCrst</t>
< <t hangText="5.">http://www.example.org/D%FCrst</t>
< <t hangText="6.">http://www.example.org/D%FCrst</t>
---

<t hangText="0.">http://www.example.org/D%FCrst</t>
<t hangText="1."> http www.example.org D&lt;fc&gt;rst</t>
<t hangText="2.">http://www.example.org/D%FCrst</t>

896c961
< corresponding octets are re-percent-encoded in step 4. This example shows
---

corresponding octets are not unencoded in step 2. This example shows

898c963
< The example also contains a punycode-encoded domain name label (xn--99zt52a),
---

The example also contains an IDNA-encoded domain name label (xn--99zt52a),

909,912c974
<
< <t>Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
< (Japanese Natto). ((EDITOR NOTE: There is some inconsistency in this note.))</t>
<
---

</t>

Change History (1)

comment:1 Changed 8 years ago by masinter@…

  • Owner set to masinter@…

during editor meeting, everyone had trouble following the ticket. Larry will rewrite & resubmit.

Note: See TracTickets for help on using tickets.