Ignore:
Timestamp:
Jan 9, 2012, 1:48:38 AM (8 years ago)
Author:
duerst@…
Message:

Moved back in the original LEIRI section from draft-duerst-iri-bis-05.
Tweaked for publication as draft-ietf-iri-3987bis-09.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • draft-ietf-iri-3987bis/draft-ietf-iri-3987bis.xml

    r83 r85  
    8787</author>
    8888
    89 <date year="2011" />
     89<date year="2012" month="January" day="9" />
    9090<area>Applications</area>
    9191<workgroup>Internationalized Resource Identifiers (iri)</workgroup>
     
    10751075</section> <!-- IRIuse -->
    10761076
    1077 <section title="Liberal Handling of Otherwise Invalid IRIs" anchor="LEIRIHREF">
    1078 
    1079 <t>
    1080 Some technical specifications and widely-deployed software have
    1081 allowed additional variations and extensions of IRIs to be used in
    1082 syntactic components. </t>
    1083 <t>Future technical specifications SHOULD NOT allow conforming
    1084 producers to produce, or conforming content to contain, such forms,
    1085 as they are not interoperable with other IRI consuming software.</t>
    1086 
    1087 <section title="LEIRI Processing"  anchor="LEIRIspec">
    1088   <t>This section defines Legacy Extended IRIs (LEIRIs).
    1089     The syntax of Legacy Extended IRIs is the same as that for &lt;IRI-reference>,
    1090     except that the ucschar production is replaced by the leiri-ucschar production:</t>
    1091 <figure>
    1092 
    1093 <artwork>
    1094   leiri-ucschar  = " " / "&lt;" / "&gt;" / DQUOTE / "{" / "}" / "|"
    1095                    / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
    1096                    / %xE000-FFFD / %x10000-10FFFF
    1097 </artwork>
    1098 
    1099 <postamble>
    1100   Among other extensions, processors based on this specification also
    1101   did not enforce the restriction on bidirectional formatting
    1102   characters in <xref target="Bidi"></xref>, and the iprivate
    1103   production becomes redundant.</postamble>
    1104 </figure>
    1105 
    1106 <t>To convert a string allowed as a LEIRI to an IRI, each character
    1107 allowed in leiri-ucschar but not in ucschar must be percent-encoded
    1108 using <xref target="compmapping"/>.</t>
    1109 </section> <!-- leiriproc -->
    1110 </section> <!-- LEIRIHREF -->
    1111 
    1112   <section title="Characters Disallowed or Not Recommended in IRIs" anchor="notAllowed">
    1113 
    1114 <t>This section provides a list of the groups of characters and code
    1115 points that are allowed in some contexts but are not allowed in IRIs or are
    1116 allowed in IRIs only in the query part. For each group of characters,
    1117 advice on the usage of these characters is also given, concentrating
    1118 on the reasons for why they are excluded from IRI use.</t>
    1119 
    1120 <t>
    1121 
    1122 <list><t>Space (U+0020): Some formats and applications use space as a
    1123 delimiter, e.g. for items in a list. Appendix C of <xref
    1124 target="RFC3986"></xref> also mentions that white space may have to be
    1125 added when displaying or printing long URIs; the same applies to long
    1126 IRIs. This means that spaces can disappear, or can make the what is
    1127 intended as a single IRI or IRI reference to be treated as two or more
    1128 separate IRIs.</t>
    1129 
    1130 <t>Delimiters "&lt;" (U+003C), "&gt;" (U+003E), and '"' (U+0022):
    1131 Appendix C of <xref target="RFC3986"></xref> suggests the use of
    1132 double-quotes ("http://example.com/") and angle brackets
    1133 (&lt;http://example.com/&gt;) as delimiters for URIs in plain
    1134 text. These conventions are often used, and also apply to IRIs.  Using
    1135 these characters in strings intended to be IRIs would result in the
    1136 IRIs being cut off at the wrong place.</t>
    1137 
    1138 <t>Unwise characters "\" (U+005C), "^" (U+005E), "`"
    1139 (U+0060), "{" (U+007B), "|" (U+007C), and "}" (U+007D): These
    1140 characters originally have been excluded from URIs because the
    1141 respective codepoints are assigned to different graphic characters in
    1142 some 7-bit or 8-bit encoding. Despite the move to Unicode, some of
    1143 these characters are still occasionally displayed differently on some
    1144 systems, e.g. U+005C may appear as a Japanese Yen symbol on some
    1145 systems. Also, the fact that these characters are not used in URIs or
    1146 IRIs has encouraged their use outside URIs or IRIs in contexts that
    1147 may include URIs or IRIs. If a string with such a character were used
    1148 as an IRI in such a context, it would likely be interpreted
    1149 piecemeal.</t>
    1150 
    1151 <t>The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
    1152 #x9F): There is generally no way to transmit these characters reliably
    1153 as text outside of a charset encoding.  Even when in encoded form,
    1154 many software components silently filter out some of these characters,
    1155 or may stop processing alltogether when encountering some of
    1156 them. These characters may affect text display in subtle, unnoticable
    1157 ways or in drastic, global, and irreversible ways depending on the
    1158 hardware and software involved. The use of some of these characters
    1159 would allow malicious users to manipulate the display of an IRI and
    1160 its context in many situations.</t>
    1161 
    1162 <t>Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
    1163 characters affect the display ordering of characters. If IRIs were
    1164 allowed to contain these characters and the resulting visual display
    1165 transcribed. they could not be converted back to electronic form
    1166 (logical order) unambiguously. These characters, if allowed in IRIs,
    1167 might allow malicious users to manipulate the display of IRI and its
    1168 context.</t>
    1169 
    1170 <t>Specials (U+FFF0-FFFD): These code points provide functionality
    1171 beyond that useful in an IRI, for example byte order identification,
    1172 annotation, and replacements for unknown characters and objects. Their
    1173 use and interpretation in an IRI would serve no purpose and might lead
    1174 to confusing display variations.</t>
    1175 
    1176 <t>Private use code points (U+E000-F8FF, U+F0000-FFFFD,
    1177 U+100000-10FFFD): Display and interpretation of these code points is
    1178 by definition undefined without private agreement. In any case, these
    1179 code points are not suited for use on the Internet. They are not
    1180 interoperable and may have unpredictable effects.</t>
    1181 
    1182 <t>Tags (U+E0000-E0FFF): These characters were intended to provide
    1183   a way to language tag in Unicode plain text. They are now deprecated <xref target="RFC6082" />.
    1184   In any case, they would not be appropriate for IRIs because
    1185 language information in identifiers cannot reliably be input,
    1186 transmitted (e.g. on a visual medium such as paper), or
    1187 recognized.</t>
    1188 
    1189 <t>Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
    1190 U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
    1191 U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
    1192 U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
    1193 U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
    1194 non-characters. Applications may use some of them internally, but are
    1195 not prepared to interchange them.</t>
    1196 
    1197 </list></t>
    1198 
    1199 <t>LEIRI preprocessing disallowed some code points and
    1200 code units:
    1201 
    1202 <list><t>Surrogate code units (D800-DFFF): These do not represent
    1203 Unicode codepoints.</t></list></t>
    1204 </section> <!-- notallowed -->
    1205 
     1077  <section title="Legacy Extended IRIs (LEIRIs)">
     1078    <t>For historic reasons, some formats have allowed variants of IRIs
     1079      that are somewhat less restricted in syntax. This section provides
     1080      a definition and a name (Legacy Extended IRI or LEIRI) for these
     1081      variants for easier reference. These variants have to be used with care;
     1082      they require further processing before being fully interchangeable as IRIs.
     1083      New protocols and formats SHOULD NOT use Legacy Extended IRIs.
     1084      Even where Legacy Extended IRIs are allowed, only IRIs fully conforming
     1085      to the syntax definition in <xref target="abnf"></xref> SHOULD be created,
     1086      generated, and used. The provisions in this section also apply to
     1087      Legacy Extended IRI references.</t>
     1088   
     1089    <section title="Legacy Extended IRI Syntax">
     1090      <figure>
     1091        <preamble>The syntax of Legacy Extended IRIs is the same as that for IRIs, except that ucschar is redefined as follows:</preamble>
     1092        <artwork>
     1093      ucschar        = " " / "&lt;" / "&gt;" / '"' / "{" / "}" / "|"
     1094      / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
     1095      / %xE000-FFFD / %x10000-10FFFF
     1096        </artwork>
     1097        <postamble>The restriction on bidirectional formatting characters in <xref target="Bidi"></xref> is lifted.
     1098        The iprivate production becomes redundant.</postamble>
     1099      </figure>
     1100      <t>Likewise, the syntax for Legacy Extended IRI references (LEIRI references)
     1101      is the same as that for IRI references with the above redefinition of ucschar applied.</t>
     1102      <t>Formats that use Legacy Extended IRIs or Legacy Extended IRI references
     1103        MAY further restrict the characters allowed therein, either implicitly
     1104        by the fact that the format as such does not allow some characters, or explicitly.
     1105        An example of a character not allowed implicitly may be the NUL character (U+0000).
     1106        However, all the characters allowed in IRIs MUST still be allowed.</t>
     1107    </section>
     1108    <section title="Conversion of Legacy Extended IRIs to IRIs" anchor="LEIRIspec">
     1109      <t>To convert a Legacy Extended IRI (reference) to
     1110      an IRI (reference), each character allowed in a Legacy Extended IRI (reference)
     1111      but not allowed in an IRI (reference) (see <xref target="notAllowed"></xref>)  MUST be percent-encoded
     1112      by applying steps 2.1 to 2.3 of <xref target="mapping"></xref>.</t>
     1113    </section>
     1114    <section title="Characters Allowed in Legacy Extended IRIs but not in IRIs" anchor="notAllowed">
     1115      <t>This section provides a list of the groups of characters and code points
     1116        that are allowed in Legacy Extedend IRIs, but are not allowed in IRIs
     1117        or are allowed in IRIs only in the query part. For each group of characters,
     1118        advice on the usage of these characters is also given, concentrating on the
     1119        reasons for why not to use them.</t>
     1120      <t>
     1121        <list>
     1122          <t>Space (U+0020): Some formats and applications use space as a delimiter,
     1123            e.g. for items in a list. Appendix C of <xref target="RFC3986"></xref>
     1124            also mentions that white space may have to be added when displaying
     1125            or printing long URIs; the same applies to long IRIs.
     1126            This means that spaces can disappear, or can make the Legacy Extended IRI
     1127            to be interpreted as two or more separate IRIs.</t>
     1128          <t>Delimiters "&lt;" (U+003C), "&gt;" (U+003E), and '"' (U+0022):
     1129            Appendix C of <xref target="RFC3986"></xref> suggests the use of
     1130            double-quotes ("http://example.com/") and angle brackets
     1131        (&lt;http://example.com/&gt;) as delimiters for URIs in plain text.
     1132        These conventions are often used, and also apply to IRIs.
     1133        Legacy Extended IRIs using these characters will be cut off at the wrong place.</t>
     1134          <t>Unwise characters "\" (U+005C),
     1135          "^" (U+005E), "`" (U+0060), "{" (U+007B), "|" (U+007C), and "}" (U+007D):
     1136          These characters originally have been excluded from URIs because
     1137          the respective codepoints are assigned to different graphic characters
     1138          in some 7-bit or 8-bit encoding. Despite the move to Unicode,
     1139          some of these characters are still occasionally displayed differently
     1140          on some systems, e.g. U+005C as a Japanese Yen symbol.
     1141          Also, the fact that these characters are not used in URIs or IRIs
     1142          has encouraged their use outside URIs or IRIs in contexts that may
     1143          include URIs or IRIs. In case a Legacy Extended IRI with such a character
     1144          is used in such a context, the Legacy Extended IRI will be interpreted piecemeal.</t>
     1145          <t>The controls (C0 controls, DEL, and C1 controls, #x0  - #x1F  #x7F - #x9F):
     1146            There is no way to transmit these characters reliably except potentially
     1147            in electronic form. Even when in electronic form, some software components
     1148            might silently filter out some of these characters,
     1149            or may stop processing alltogether when encountering some of them.
     1150            These characters may affect text display in subtle, unnoticable ways
     1151            or in drastic, global, and irreversible ways depending
     1152            on the hardware and software involved.
     1153            The use of some of these characters may allow malicious users
     1154            to manipulate the display of a Legacy Extended IRI and its context.</t>
     1155          <t>Bidi formatting characters (U+200E, U+200F, U+202A-202E):
     1156            These characters affect the display ordering of characters.
     1157            Displayed Legacy Extended IRIs containing these characters
     1158            cannot be converted back to electronic form (logical order) unambiguously.
     1159            These characters may allow malicious users to manipulate
     1160            the display of a Legacy Extended IRI and its context.</t>
     1161          <t>Specials (U+FFF0-FFFD): These code points provide functionality
     1162            beyond that useful in a Legacy Extended IRI, for example byte order identification,
     1163            annotation, and replacements for unknown characters and objects.
     1164            Their use and interpretation in a Legacy Extended IRI
     1165            serves no purpose and may lead to confusing display variations.</t>
     1166          <t>Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000-10FFFD):
     1167            Display and interpretation of these code points is by definition
     1168            undefined without private agreement. Therefore, these code points
     1169            are not suited for use on the Internet. They are not interoperable and may have
     1170            unpredictable effects.</t>
     1171          <t>Tags (U+E0000-E0FFF): These characters provide a way to language tag in Unicode plain text.
     1172            They are not appropriate for Legacy Extended IRIs because language information
     1173            in identifiers cannot reliably be input, transmitted
     1174            (e.g. on a visual medium such as paper), or recognized.</t>
     1175          <t>Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF, U+3FFFE-3FFFF,
     1176            U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF, U+7FFFE-7FFFF, U+8FFFE-8FFFF,
     1177            U+9FFFE-9FFFF, U+AFFFE-AFFFF, U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF,
     1178            U+EFFFE-EFFFF, U+FFFFE-FFFFF, U+10FFFE-10FFFF):
     1179            These code points are defined as non-characters. Applications may use
     1180            some of them internally, but are not prepared to interchange them.</t>
     1181        </list>
     1182      </t>
     1183      <t>For reference, we here also list the code points and code units
     1184        not even allowed in Legacy Extended IRIs:
     1185        <list>
     1186          <t>Surrogate code units (D800-DFFF):
     1187          These do not represent Unicode codepoints.</t>
     1188        </list>
     1189      </t>
     1190    </section>
     1191  </section>
     1192 
    12061193<section title="URI/IRI Processing Guidelines (Informative)" anchor="guidelines">
    12071194
Note: See TracChangeset for help on using the changeset viewer.