Changeset 71


Ignore:
Timestamp:
Oct 20, 2011, 1:46:18 PM (8 years ago)
Author:
duerst@…
Message:

moved bidi stuff and comparison stuff to separate drafts

File:
1 edited

Legend:

Unmodified
Added
Removed
  • draft-ietf-iri-3987bis/draft-ietf-iri-3987bis.xml

    r70 r71  
    11<?xml version="1.0"?>
    22<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
    3 <!ENTITY rfc1738 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.1738.xml">
    43<!ENTITY rfc2045 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2045.xml">
    54<!ENTITY rfc2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
     
    3332<?rfc compact='yes'?>
    3433<?rfc subcompact='no'?>
    35 <rfc ipr="pre5378Trust200902" docName="draft-ietf-iri-3987bis-06" category="std" xml:lang="en" obsoletes="3987">
     34<rfc ipr="pre5378Trust200902" docName="draft-ietf-iri-3987bis-07" category="std" xml:lang="en" obsoletes="3987">
    3635<front>
    3736<title abbrev="IRIs">Internationalized Resource Identifiers (IRIs)</title>
     
    9594<keyword>UTF-8</keyword>
    9695<keyword>URI</keyword>
    97 <keyword>URL</keyword>
    9896<keyword>IDN</keyword>
    9997<keyword>LEIRI</keyword>
     
    123121</abstract>
    124122  <note title='RFC Editor: Please remove the next paragraph before publication.'>
    125     <t>This document is intended to update RFC 3987 and move towards IETF
    126     Draft Standard.  For discussion and comments on this
    127     draft, please join the IETF IRI WG by subscribing to the mailing
    128     list public-iri@w3.org. For a list of open issues, please see
     123    <t>This (and several companion documents) are intended to obsolete RFC 3987,
     124    and also move towards IETF Draft Standard.  For discussion and comments on these
     125    drafts, please join the IETF IRI WG by subscribing to the mailing
     126    list public-iri@w3.org, archives at http://lists.w3.org/archives/public/public-iri/.
     127    For a list of open issues, please see
    129128    the issue tracker of the WG at http://trac.tools.ietf.org/wg/iri/trac/report/1.
    130129    For a list of individual edits, please see the change history at
     
    178177
    179178<t>Using characters outside of A - Z in IRIs adds a number of
    180 difficulties. <xref target="Bidi"/> discusses the special case of
    181 bidirectional IRIs using characters from scripts written
    182 right-to-left.  <xref target="IRIuse"/> discusses the use
     179difficulties. <xref target="IRIuse"/> discusses the use
    183180of IRIs in different situations.  <xref target="guidelines"/> gives
    184181additional informative guidelines.  <xref target="security"/>
    185182discusses IRI-specific security considerations.</t>
    186183
    187   <t>When originally defining IRIs, several design alternatives were considered.
     184<t>
     185<xref target="Bidi"/> discusses the special case of
     186bidirectional IRIs using characters from scripts written
     187right-to-left.
     188<xref target="Equivalence"/> gives guidelines for applications wishing
     189to determine if two IRIs are equivalent, as well as defining
     190some equivalence methods.
     191<xref target="RFC4395bis"/> updates the URI scheme registration
     192guidelines and proceedures to note that every URI scheme is also
     193automatically an IRI scheme and to allow scheme definitions
     194to be directly described in terms of Unicode characters.
     195</t>
     196
     197 <t>When originally defining IRIs, several design alternatives were considered.
    188198    Historically interested readers can find an overview in Appendix A of <xref target="RFC3987"/>.
    189199  For some additional background on the design of URIs and IRIs, please also see
     
    195205
    196206<t>IRIs are designed to allow protocols and software that deal with
    197 URIs to be updated to handle IRIs. A "URI scheme" (as defined by <xref
    198 target="RFC3986"/> and registered through the IANA process defined in
    199 <xref target="RFC4395bis"/> also serves as an "IRI scheme". Processing of
     207URIs to be updated to handle IRIs. Processing of
    200208IRIs is accomplished by extending the URI syntax while retaining (and
    201209not expanding) the set of "reserved" characters, such that the syntax
     
    288296    URIs.</t>
    289297   
    290 <t hangText="URL:">The term "URL" was originally used <xref
    291    target="RFC1738"/> for roughly what is now called a "URI".  Books,
    292    software and documentation often refers to URIs and IRIs using the
    293    "URL" term. Some usages restrict "URL" to those URIs which are not
    294    URNs. Because of the ambiguity of the term using the term "URL" is
    295    NOT RECOMMENDED in formal documents.</t>
    296 
    297298<t hangText="LEIRI (Legacy Extended IRI) processing:">  This term was used in
    298299   various XML specifications to refer
     
    300301   the processing rules in <xref target="LEIRIspec" />.</t>
    301302
    302 <t hangText="(Web Address, Hypertext Reference, HREF):"> These terms have been
    303    added in this document for convenience, to allow other
    304    specifications to refer to those strings that, although not valid
    305    IRIs, are acceptable input to the processing rules in <xref
    306    target="webaddress"/>. This usage corresponds to the parsing rules
    307    of some popular web browsing applications.
    308    ISSUE: Need to find a good name/abbreviation for these.</t>
    309    
    310303<t hangText="running text:">Human text (paragraphs, sentences,
    311304   phrases) with syntax according to orthographic conventions of a
     
    361354
    362355<t>To represent characters outside US-ASCII in examples, this document
    363 uses two notations: 'XML Notation' and 'Bidi Notation'.</t>
     356uses 'XML Notation'.</t>
    364357
    365358<t>XML Notation uses a leading '&amp;#x', a trailing ';', and the
     
    367360example, &amp;#x44F; stands for CYRILLIC CAPITAL LETTER YA. In this
    368361notation, an actual '&amp;' is denoted by '&amp;amp;'.</t>
    369 
    370 <t>Bidi Notation is used for bidirectional examples: Lower case
    371 letters stand for Latin letters or other letters that are written left
    372 to right, whereas upper case letters represent Arabic or Hebrew
    373 letters that are written right to left.</t>
    374362
    375363<t>To denote actual octets in examples (as opposed to percent-encoded
     
    626614given); the result is a set of parsed IRI components.</t>
    627615
    628 <t>NOTE: The result of parsing into components will correspond
    629 to subtrings of the IRI that may be accessible via an API.
    630 For example, in <xref target="HTML5"/>, the protocol
    631 components of interest are SCHEME (scheme), HOST (ireg-name), PORT
    632 (port), the PATH (ipath after the initial "/"), QUERY (iquery),
    633 FRAGMENT (ifragment), and AUTHORITY (iauthority).
    634 </t>
    635 
    636 <t>Subsequent processing rules are sometimes used to define other
    637 syntactic components. For example, <xref target="HTML5"/> defines APIs
    638 for IRI processing; in these APIs:
    639 
    640 <list style="hanging">
    641 <t hangText="HOSTSPECIFIC"> the substring that follows
    642 the substring matched by the iauthority production, or the whole
    643 string if the iauthority production wasn't matched.</t>
    644 <t hangText="HOSTPORT"> if there is a scheme component and a port
    645 component and the port given by the port component is different than
    646 the default port defined for the protocol given by the scheme
    647 component, then HOSTPORT is the substring that starts with the
    648 substring matched by the host production and ends with the substring
    649 matched by the port production, and includes the colon in between the
    650 two. Otherwise, it is the same as the host component.
    651 </t>
    652 </list>
    653 </t>
    654616</section> <!-- parse -->
    655617
     
    699661    a particular registered name lookup technology. For further background,
    700662    see <xref target="RFC6055"/> and <xref target="Gettys"/>.</t>
    701 </section>
     663</section> <!-- dnspercent -->
    702664<section title="Mapping using Punycode" anchor='dnspunycode'>
    703665  <t>The ireg-name component MAY also be converted as follows:</t>
     
    718680  <t>This conversion for ireg-name will be better able to deal with legacy
    719681    infrastructure that cannot handle percent-encoding in domain names.</t>
    720 </section>
     682</section> <!-- punicode -->
    721683  <section title="Additional Considerations">
     684
    722685<t><list style="hanging">
    723 
    724686<t hangText="Note:">Domain Names may appear in parts of an IRI other
    725687than the ireg-name part.  It is the responsibility of scheme-specific
     
    749711
    750712</list></t>
    751 </section>
     713</section> <!-- additional -->
    752714</section> <!-- dnsmapping -->
    753715
    754716<section title="Mapping query components" anchor="querymapping">
    755717
    756 <t>((NOTE: SEE ISSUES LIST))
    757 
    758 For compatibility with existing deployed HTTP infrastructure,
     718<t>For compatibility with existing deployed HTTP infrastructure,
    759719the following special case applies for schemes "http" and "https"
    760720and IRIs whose origin has a document charset other than one which
     
    769729<section title="Mapping IRIs to URIs" anchor="mapping">
    770730
    771 <t>The canonical mapping from a IRI to URI is defined by applying the
     731<t>The mapping from an IRI to URI is accomplished by applying the
    772732mapping above (from IRI to URI components) and then reassembling a URI
    773733from the parsed URI components using the original punctuation that
     
    813773
    814774<t hangText="3.">The conversion may result in a character that is not
    815     appropriate in an IRI. See <xref target="abnf"/>, <xref target="visual"/>,
     775    appropriate in an IRI. See <xref target="abnf"/>,
    816776      and <xref target="limitations"/> for further details.</t>
    817777
     
    839799<t hangText="4.">Re-percent-encode all octets produced in step 3 that
    840800      in UTF-8 represent characters that are not appropriate according
    841       to <xref target="abnf"/>, <xref target="visual"/>, and <xref
     801      to <xref target="abnf"/>  and <xref
    842802      target="limitations"/>.</t>
    843803
     
    909869
    910870<t>The following example contains "%e2%80%ae", which is the percent-encoded<vspace/>UTF-8
    911 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. <xref target="visual"/>
    912 forbids the direct use of this character in an IRI. Therefore, the
     871character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE.
     872The direct use of this character is forbiddin in an IRI. Therefore, the
    913873corresponding octets are re-percent-encoded in step 4. This example shows
    914874that the case (upper- or lowercase) of letters used in percent-encodings may not be preserved.
     
    931891</section> <!-- URItoIRI -->
    932892</section> <!-- processing -->
    933 <section title="Bidirectional IRIs for Right-to-Left Languages" anchor="Bidi">
    934 
    935 <t>Some UCS characters, such as those used in the Arabic and Hebrew
    936 scripts, have an inherent right-to-left (rtl) writing direction. IRIs
    937 containing these characters (called bidirectional IRIs or Bidi IRIs)
    938 require additional attention because of the non-trivial relation
    939 between logical representation (used for digital representation and
    940 for reading/spelling) and visual representation (used for
    941 display/printing).</t>
    942 
    943 <t>Because of the complex interaction between the logical representation,
    944 the visual representation, and the syntax of a Bidi IRI, a balance is
    945 needed between various requirements.
    946 The main requirements are<list style="hanging">
    947 <t hangText="1.">user-predictable conversion between visual and
    948     logical representation;</t>
    949 <t hangText="2.">the ability to include a wide range of characters
    950     in various parts of the IRI; and</t>
    951 <t hangText="3.">minor or no changes or restrictions for
    952       implementations.</t>
    953 </list></t>
    954 
    955 <section title="Logical Storage and Visual Presentation" anchor="visual">
    956 
    957 <t>When stored or transmitted in digital representation, bidirectional
    958 IRIs MUST be in full logical order and MUST conform to the IRI syntax
    959 rules (which includes the rules relevant to their scheme). This
    960 ensures that bidirectional IRIs can be processed in the same way as
    961 other IRIs.</t> <t>Bidirectional IRIs MUST be rendered by using the
    962 Unicode Bidirectional Algorithm <xref target="UNIV6"/>, <xref
    963 target="UNI9"/>.  Bidirectional IRIs MUST be rendered in the same way
    964 as they would be if they were in a left-to-right embedding; i.e., as
    965 if they were preceded by U+202A, LEFT-TO-RIGHT EMBEDDING (LRE), and
    966 followed by U+202C, POP DIRECTIONAL FORMATTING (PDF).  Setting the
    967 embedding direction can also be done in a higher-level protocol (e.g.,
    968 the dir='ltr' attribute in HTML).</t>
    969 
    970 <t>There is no requirement to use the above embedding if the display
    971 is still the same without the embedding. For example, a bidirectional
    972 IRI in a text with left-to-right base directionality (such as used for
    973 English or Cyrillic) that is preceded and followed by whitespace and
    974 strong left-to-right characters does not need an embedding.  Also, a
    975 bidirectional relative IRI reference that only contains strong
    976 right-to-left characters and weak characters and that starts and ends
    977 with a strong right-to-left character and appears in a text with
    978 right-to-left base directionality (such as used for Arabic or Hebrew)
    979 and is preceded and followed by whitespace and strong characters does
    980 not need an embedding.</t>
    981 
    982 <t>In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
    983 sufficient to force the correct display behavior.  However, the
    984 details of the Unicode Bidirectional algorithm are not always easy to
    985 understand. Implementers are strongly advised to err on the side of
    986 caution and to use embedding in all cases where they are not
    987 completely sure that the display behavior is unaffected without the
    988 embedding.</t>
    989 
    990 <t>The Unicode Bidirectional Algorithm (<xref target="UNI9"/>, section
    991 4.3) permits higher-level protocols to influence bidirectional
    992 rendering. Such changes by higher-level protocols MUST NOT be used if
    993 they change the rendering of IRIs.</t>
    994 
    995 <t>The bidirectional formatting characters that may be used before or
    996 after the IRI to ensure correct display are not themselves part of the
    997 IRI.  IRIs MUST NOT contain bidirectional formatting characters (LRM,
    998 RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of
    999 the IRI but do not appear themselves. It would therefore not be
    1000 possible to input an IRI with such characters correctly.</t>
    1001 
    1002 </section> <!-- visual -->
    1003 <section title="Bidi IRI Structure" anchor="bidi-structure">
    1004 
    1005 <t>The Unicode Bidirectional Algorithm is designed mainly for running
    1006 text.  To make sure that it does not affect the rendering of
    1007 bidirectional IRIs too much, some restrictions on bidirectional IRIs
    1008 are necessary. These restrictions are given in terms of delimiters
    1009 (structural characters, mostly punctuation such as "@", ".", ":",
    1010 and<vspace/>"/") and components (usually consisting mostly of letters
    1011 and digits).</t>
    1012 
    1013 <t>The following syntax rules from <xref target="abnf"/> correspond to
    1014 components for the purpose of Bidi behavior: iuserinfo, ireg-name,
    1015 isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and
    1016 ifragment.</t>
    1017 
    1018 <t>Specifications that define the syntax of any of the above
    1019 components MAY divide them further and define smaller parts to be
    1020 components according to this document. As an example, the restrictions
    1021 of <xref target="RFC3490"/> on bidirectional domain names correspond
    1022 to treating each label of a domain name as a component for schemes
    1023 with ireg-name as a domain name.  Even where the components are not
    1024 defined formally, it may be helpful to think about some syntax in
    1025 terms of components and to apply the relevant restrictions.  For
    1026 example, for the usual name/value syntax in query parts, it is
    1027 convenient to treat each name and each value as a component. As
    1028 another example, the extensions in a resource name can be treated as
    1029 separate components.</t>
    1030 
    1031 <t>For each component, the following restrictions apply:</t>
    1032 <t>
    1033 <list style="hanging">
    1034 
    1035 <t hangText="1.">A component SHOULD NOT use both right-to-left and
    1036   left-to-right characters.</t>
    1037 
    1038 <t hangText="2.">A component using right-to-left characters SHOULD
    1039   start and end with right-to-left characters.</t>
    1040 
    1041 </list></t>
    1042 
    1043 <t>The above restrictions are given as "SHOULD"s, rather than as
    1044 "MUST"s.  For IRIs that are never presented visually, they are not
    1045 relevant.  However, for IRIs in general, they are very important to
    1046 ensure consistent conversion between visual presentation and logical
    1047 representation, in both directions.</t>
    1048 
    1049 <t><list style="hanging">
    1050 
    1051 <t hangText="Note:">In some components, the above restrictions may
    1052   actually be strictly enforced.  For example, <xref
    1053   target="RFC3490"></xref> requires that these restrictions apply to
    1054   the labels of a host name for those schemes where ireg-name is a
    1055   host name.  In some other components (for example, path components)
    1056   following these restrictions may not be too difficult.  For other
    1057   components, such as parts of the query part, it may be very
    1058   difficult to enforce the restrictions because the values of query
    1059   parameters may be arbitrary character sequences.</t>
    1060 
    1061 </list></t>
    1062 
    1063 <t>If the above restrictions cannot be satisfied otherwise, the
    1064 affected component can always be mapped to URI notation as described
    1065 in <xref target="compmapping"/>. Please note that the whole component
    1066 has to be mapped (see also Example 9 below).</t>
    1067 
    1068 </section> <!-- bidi-structure -->
    1069 
    1070 <section title="Input of Bidi IRIs" anchor="bidiInput">
    1071 
    1072 <t>Bidi input methods MUST generate Bidi IRIs in logical order while
    1073 rendering them according to <xref target="visual"/>.  During input,
    1074 rendering SHOULD be updated after every new character is input to
    1075 avoid end-user confusion.</t>
    1076 
    1077 </section> <!-- bidiInput -->
    1078 
    1079 <section title="Examples">
    1080 
    1081 <t>This section gives examples of bidirectional IRIs, in Bidi
    1082 Notation.  It shows legal IRIs with the relationship between logical
    1083 and visual representation and explains how certain phenomena in this
    1084 relationship may look strange to somebody not familiar with
    1085 bidirectional behavior, but familiar to users of Arabic and Hebrew. It
    1086 also shows what happens if the restrictions given in <xref
    1087 target="bidi-structure"/> are not followed. The examples below can be
    1088 seen at <xref target="BidiEx"/>, in Arabic, Hebrew, and Bidi Notation
    1089 variants.</t>
    1090 
    1091 <t>To read the bidi text in the examples, read the visual
    1092 representation from left to right until you encounter a block of rtl
    1093 text. Read the rtl block (including slashes and other special
    1094 characters) from right to left, then continue at the next unread ltr
    1095 character.</t>
    1096 
    1097 <t>Example 1: A single component with rtl characters is inverted:
    1098 <vspace/>Logical representation:
    1099 "http://ab.CDEFGH.ij/kl/mn/op.html"<vspace/>Visual representation:
    1100 "http://ab.HGFEDC.ij/kl/mn/op.html"<vspace/> Components can be read
    1101 one by one, and each component can be read in its natural
    1102 direction.</t>
    1103 
    1104 <t>Example 2: More than one consecutive component with rtl characters
    1105 is inverted as a whole: <vspace/>Logical representation:
    1106 "http://ab.CDE.FGH/ij/kl/mn/op.html"<vspace/>Visual representation:
    1107 "http://ab.HGF.EDC/ij/kl/mn/op.html"<vspace/> A sequence of rtl
    1108 components is read rtl, in the same way as a sequence of rtl words is
    1109 read rtl in a bidi text.</t>
    1110 
    1111 <t>Example 3: All components of an IRI (except for the scheme) are
    1112 rtl.  All rtl components are inverted overall: <vspace/>Logical
    1113 representation:
    1114 "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"<vspace/>Visual
    1115 representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"<vspace/> The
    1116 whole IRI (except the scheme) is read rtl. Delimiters between rtl
    1117 components stay between the respective components; delimiters between
    1118 ltr and rtl components don't move.</t>
    1119 
    1120 <t>Example 4: Each of several sequences of rtl components is inverted
    1121 on its own: <vspace/>Logical representation:
    1122 "http://AB.CD.ef/gh/IJ/KL.html"<vspace/>Visual representation:
    1123 "http://DC.BA.ef/gh/LK/JI.html"<vspace/> Each sequence of rtl
    1124 components is read rtl, in the same way as each sequence of rtl words
    1125 in an ltr text is read rtl.</t>
    1126 
    1127 <t>Example 5: Example 2, applied to components of different kinds:
    1128 <vspace/>Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
    1129 <vspace/>Visual representation:
    1130 "http://ab.cd.HG/FE/ij/kl.html"<vspace/> The inversion of the domain
    1131 name label and the path component may be unexpected, but it is
    1132 consistent with other bidi behavior.  For reassurance that the domain
    1133 component really is "ab.cd.EF", it may be helpful to read aloud the
    1134 visual representation following the bidi algorithm. After
    1135 "http://ab.cd." one reads the RTL block "E-F-slash-G-H", which
    1136 corresponds to the logical representation.
    1137 </t>
    1138 
    1139 <t>Example 6: Same as Example 5, with more rtl components:
    1140 <vspace/>Logical representation:
    1141 "http://ab.CD.EF/GH/IJ/kl.html"<vspace/>Visual representation:
    1142 "http://ab.JI/HG/FE.DC/kl.html"<vspace/> The inversion of the domain
    1143 name labels and the path components may be easier to identify because
    1144 the delimiters also move.</t>
    1145 
    1146 <t>Example 7: A single rtl component includes digits: <vspace/>Logical
    1147 representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"<vspace/>Visual
    1148 representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"<vspace/>
    1149 Numbers are written ltr in all cases but are treated as an additional
    1150 embedding inside a run of rtl characters. This is completely
    1151 consistent with usual bidirectional text.</t>
    1152 
    1153 <t>Example 8 (not allowed): Numbers are at the start or end of an rtl
    1154 component:<vspace/>Logical representation:
    1155 "http://ab.cd.ef/GH1/2IJ/KL.html"<vspace/>Visual representation:
    1156 "http://ab.cd.ef/LK/JI1/2HG.html"<vspace/> The sequence "1/2" is
    1157 interpreted by the bidi algorithm as a fraction, fragmenting the
    1158 components and leading to confusion. There are other characters that
    1159 are interpreted in a special way close to numbers; in particular, "+",
    1160 "-", "#", "$", "%", ",", ".", and ":".</t>
    1161 
    1162 <t>Example 9 (not allowed): The numbers in the previous example are
    1163 percent-encoded: <vspace/>Logical representation:
    1164 "http://ab.cd.ef/GH%31/%32IJ/KL.html",<vspace/>Visual representation:
    1165 "http://ab.cd.ef/LK/JI%32/%31HG.html"</t>
    1166 
    1167 <t>Example 10 (allowed but not recommended): <vspace/>Logical
    1168 representation: "http://ab.CDEFGH.123/kl/mn/op.html"<vspace/>Visual
    1169 representation: "http://ab.123.HGFEDC/kl/mn/op.html"<vspace/>
    1170 Components consisting of only numbers are allowed (it would be rather
    1171 difficult to prohibit them), but these may interact with adjacent RTL
    1172 components in ways that are not easy to predict.</t>
    1173 
    1174 <t>Example 11 (allowed but not recommended): <vspace/>Logical
    1175 representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"<vspace/>Visual
    1176 representation: "http://ab.123.HGFEDCij/kl/mn/op.html"<vspace/>
    1177 Components consisting of numbers and left-to-right characters are
    1178 allowed, but these may interact with adjacent RTL components in ways
    1179 that are not easy to predict.</t>
    1180 </section><!-- examples -->
    1181 </section><!-- bidi -->
     893
    1182894
    1183895<section title="Use of IRIs" anchor="IRIuse">
     
    1186898
    1187899<t>This section discusses limitations on characters and character
    1188 sequences usable for IRIs beyond those given in <xref target="abnf"/>
    1189 and <xref target="visual"/>. The considerations in this section are
     900sequences usable for IRIs beyond those given in <xref target="abnf"/>.
     901The considerations in this section are
    1190902relevant when IRIs are created and when URIs are converted to
    1191903IRIs.</t>
     
    13551067<section title="Liberal Handling of Otherwise Invalid IRIs" anchor="LEIRIHREF">
    13561068
    1357 <t>(EDITOR NOTE: This Section may move to an appendix.)
    1358  
     1069<t>
    13591070Some technical specifications and widely-deployed software have
    13601071allowed additional variations and extensions of IRIs to be used in
    1361 syntactic components. This section describes two widely-used
    1362 preprocessing agreements. Other technical specifications may wish to
    1363 reference a syntactic component which is "a valid IRI or a string that
    1364 will map to a valid IRI after this preprocessing algorithm". These two
    1365 variants are known as <xref target="LEIRI">Legacy Extended IRI or
    1366 LEIRI</xref>, and <xref target="HTML5">Web Address</xref>).
    1367 </t>
    1368 
     1072syntactic components. </t>
    13691073<t>Future technical specifications SHOULD NOT allow conforming
    13701074producers to produce, or conforming content to contain, such forms,
     
    13861090  Among other extensions, processors based on this specification also
    13871091  did not enforce the restriction on bidirectional formatting
    1388   characters in <xref target="visual"></xref>, and the iprivate
     1092  characters in <xref target="Bidi"></xref>, and the iprivate
    13891093  production becomes redundant.</postamble>
    13901094</figure>
     
    13941098using <xref target="compmapping"/>.</t>
    13951099</section> <!-- leiriproc -->
    1396 
    1397 <section title="Web Address Processing" anchor="webaddress">
    1398 
    1399 <t>Many popular web browsers have taken the approach of being quite
    1400 liberal in what is accepted as a "URL" or its relative
    1401 forms. This section describes their behavior in terms of a preprocessor
    1402 which maps strings into the IRI space for subsequent parsing and
    1403 interpretation as an IRI.</t>
    1404 
    1405 <t>In some situations, it might be appropriate to describe the syntax
    1406 that a liberal consumer implementation might accept as a "Web
    1407 Address" or "Hypertext Reference" or "HREF". However,
    1408 technical specifications SHOULD restrict the syntactic form allowed by compliant producers
    1409 to the IRI or IRI reference syntax defined in this document
    1410 even if they want to mandate this processing.</t>
    1411 
    1412 <t>
    1413 Summary:
    1414 <list style="symbols">
    1415    <t>Leading and trailing whitespace is removed.</t>
    1416    <t>Some additional characters are removed.</t>
    1417    <t>Some additional characters are allowed and escaped (as with LEIRI).</t>
    1418    <t>If interpreting an IRI as a URI, the pct-encoding of the query
    1419    component of the parsed URI component depends on operational
    1420    context.</t>
    1421 </list>
    1422 </t>
    1423 
    1424 <t>Each string provided may have an associated charset (called
    1425 the HREF-charset here); this defaults to UTF-8.
    1426 For web browsers interpreting HTML, the document
    1427 charset of a string is determined:
    1428 
    1429 <list style="hanging">
    1430 <t hangText="If the string came from a script (e.g. as an argument to
    1431  a method)">The HRef-charset is the script's charset.</t>
    1432 
    1433 <t hangText="If the string came from a DOM node (e.g. from an
    1434   element)">The node has a Document, and the HRef-charset is the
    1435   Document's character encoding.</t>
    1436 
    1437 <t hangText="If the string had a HRef-charset defined when the string was
    1438 created or defined">The HRef-charset is as defined.</t>
    1439 
    1440 </list></t>
    1441 
    1442 <t>If the resulting HRef-charset is a unicode based character encoding
    1443 (e.g., UTF-16), then use UTF-8 instead.</t>
    1444 
    1445 
    1446 <figure>
    1447 <preamble>The syntax for Web Addresses is obtained by replacing the 'ucschar',
    1448   pct-form, path-sep, and ifragment rules with the href-ucschar, href-pct-form, href-path-sep,
    1449   and href-ifragment
    1450   rules below. In addition, some characters are stripped.</preamble>
    1451 
    1452 <artwork type='abnf'>
    1453   href-ucschar   = " " / "&lt;" / "&gt;" / DQUOTE / "{" / "}" / "|"
    1454                  / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
    1455                  / %xE000-FFFD / %x10000-10FFFF
    1456   href-pct-form  = pct-encoded / "%"
    1457   href-path-sep  = "/" / "\"
    1458   href-ifragment = *( ipchar / "/" / "?" / "#" )  ; adding "#"
    1459   href-strip     = &lt;to be done&gt;
    1460 </artwork>
    1461 
    1462 <postamble>
    1463 (NOTE: NEED TO FIX THESE SETS TO MATCH HTML5; NOT SURE ABOUT NEXT SENTENCE)
    1464 browsers did not enforce the restriction on bidirectional formatting
    1465   characters in <xref target="visual"></xref>, and the iprivate
    1466   production becomes redundant.</postamble>
    1467 </figure>
    1468 
    1469 <t>'Web Address processing' requires the following additional
    1470 preprocessing steps:
    1471 
    1472 <list style="numbers">
    1473 
    1474 <t>Leading and trailing instances of space (U+0020),
    1475 CR (U+000A), LF (U+000D), and TAB (U+0009) characters are removed.</t>
    1476 
    1477 <t>strip all characters in href-strip.</t>
    1478   <t>Percent-encode all characters in href-ucschar not in ucschar.</t>
    1479   <t>Replace occurrences of "%" not followed by two hexadecimal digits by "%25".</t>
    1480   <t>Convert backslashes ('\') matching href-path-sep to forward slashes ('/').</t>
    1481 </list></t>
    1482 </section> <!-- webaddress -->
     1100</section> <!-- LEIRIHREF -->
    14831101
    14841102<section title="Characters Not Allowed in IRIs" anchor="notAllowed">
    14851103
    14861104<t>This section provides a list of the groups of characters and code
    1487 points that are allowed by LEIRI or HREF but are not allowed in IRIs or are
     1105points that are allowed in some contexts but are not allowed in IRIs or are
    14881106allowed in IRIs only in the query part. For each group of characters,
    14891107advice on the usage of these characters is also given, concentrating
     
    15741192Unicode codepoints.</t></list></t>
    15751193</section> <!-- notallowed -->
    1576 </section> <!-- lieirihref -->
    1577  
     1194
    15781195<section title="URI/IRI Processing Guidelines (Informative)" anchor="guidelines">
    15791196
     
    16041221perform this inverse conversion unless it is certain this can be done
    16051222correctly.</t>
    1606 </section>
     1223</section><!-- software interfaces -->
    16071224
    16081225<section title="URI/IRI Entry">
     
    16421259
    16431260<t>For the input of IRIs with right-to-left characters, please see
    1644 <xref target="bidiInput"></xref>.</t>
    1645 </section>
     1261<xref target="Bidi"></xref>.</t>
     1262</section><!-- entry -->
    16461263
    16471264<section title="URI/IRI Transfer between Applications">
     
    16681285Correctly used, the clipboard transfers characters, not octets, which
    16691286will do the right thing with IRIs.</t>
    1670 </section>
     1287</section><!-- transfer -->
    16711288
    16721289<section title="URI/IRI Generation">
     
    16931310<t>This recommendation particularly applies to HTTP servers. For FTP
    16941311servers, similar considerations apply; see <xref target="RFC2640"/>.</t>
    1695 </section>
     1312</section><!-- generation -->
    16961313
    16971314<section title="URI/IRI Selection" anchor="selection">
     
    17371354Greek, and Cyrillic, using lowercase letters results in fewer
    17381355ambiguities than using uppercase letters would.</t>
    1739 </section>
     1356</section><!-- selection -->
    17401357
    17411358<section title="Display of URIs/IRIs" anchor="display">
     
    17451362resources, these parts should be percent-encoded before being displayed.</t>
    17461363
    1747 <t>For display of Bidi IRIs, please see <xref target="visual"/>.</t>
    1748 </section>
     1364<t>For display of Bidi IRIs, please see <xref target="Bidi"/>.</t>
     1365</section> <!-- display -->
    17491366
    17501367<section title="Interpretation of URIs and IRIs">
     
    17791396Also, the regularity of UTF-8 (see <xref target="Duerst97"/>) makes the
    17801397potential for collisions lower than it may seem at first.</t>
    1781 </section>
     1398</section> <!-- interpretation -->
    17821399
    17831400<section title="Upgrading Strategy">
     
    18291446<xref target="UTF8use"/>.</t>
    18301447
    1831 </section>
     1448</section> <!-- upgrading -->
    18321449</section> <!-- guidelines -->
    18331450
     
    18771494  confusion include different forms of normalization and different normalization
    18781495  expectations, use of percent-encoding with various legacy encodings,
    1879   and bidirectionality issues. See also <xref target='UTR36'/>.</t>
     1496  and bidirectionality issues. See also <xref target="Bidi"/>.</t>
    18801497
    18811498<t>Confusion can occur in various IRI components, such as the
     
    18881505Details are discussed in <xref target="selection"/>.</t>
    18891506
    1890 <t>Confusion can occur with bidirectional IRIs, if the restrictions
    1891 in <xref target="bidi-structure"/> are not followed. The same visual
    1892 representation may be interpreted as different logical representations,
    1893 and vice versa. It is also very important that a correct Unicode bidirectional
    1894 implementation be used.</t>
    18951507  <t>The characters additionally allowed in Legacy Extended IRIs
    18961508    introduce additional security issues. For details, see <xref target='notAllowed'/>.</t>
     
    19001512<t>This document was derived from <xref target="RFC3987"/>; the acknowledgments from
    19011513that specification still apply.</t>
    1902 <t>We would like to thank Ian Hickson, Michael Sperberg-McQueen,
    1903   and Dan Connolly for their work on HyperText References, and Norman Walsh, Richard Tobin,
    1904   Henry S. Thomson, John Cowan, Paul Grosso, and the XML Core Working Group of the W3C for their work on LEIRIs.</t>
    1905 <t>In addition, this document was influenced by contributions from (in no particular order) Chris
     1514<t>In addition, this document was influenced by contributions from (in no particular order)Norman Walsh, Richard Tobin,
     1515  Henry S. Thomson, John Cowan, Paul Grosso, the XML Core Working Group of the W3C,
     1516 Chris
    19061517  Lilley, Bjoern Hoehrmann,
    19071518Felix Sasaki, Jeremy Carroll, Frank Ellermann, Michael Everson, Cary Karp, Matitiahu Allouche,
     
    19141525Goland, Sam Ruby, Adam Barth, Abdulrahman I. ALGhadir, Aharon Lanin, Thomas Milo, Murray Sargent,
    19151526Marc Blanchet, and Mykyta Yevstifeyev.</t>
    1916 </section>
     1527</section> <!-- Acknowledgements -->
    19171528
    19181529<section title="Main Changes Since RFC 3987">
    19191530  <t>This section describes the main changes since <xref target="RFC3987"></xref>.</t>
     1531  <section title="Split out Bidi, processing guidelines, comparison sections">
     1532    <t>Move some components (comparison, bidi, processing) into separate documents.</t>
     1533  </section>
    19201534  <section title="Major restructuring of IRI processing model" anchor="forkChanges">
    19211535    <t>Major restructuring of IRI processing model to make scheme-specific translation
     
    20471661    <t>Added this Change Log Section.</t>
    20481662    <t>Added a section about "IRIs with Spaces/Controls" (converting from a Note in RFC 3987).</t></list></t>
    2049 </section>
     1663</section> <!-- -00 to -01 -->
    20501664<section title="Changes from RFC 3987 to -00 of draft-duerst-iri-bis">
    20511665  <t><list>
    20521666    <t>Fixed errata (see http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987).</t></list></t>
    2053 </section>
     1667</section> <!-- from 3987 -->
    20541668</section>
    20551669</middle>
     
    21081722</reference>
    21091723
    2110 <reference anchor="UNI9" target="http://www.unicode.org/reports/tr9/tr9-13.html">
    2111 <front>
    2112 <title>The Bidirectional Algorithm</title>
    2113 <author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
    2114 <date year="2004" month="March"/>
    2115 </front>
    2116 <seriesInfo name="Unicode Standard Annex" value="#9"/>
    2117 </reference>
    2118 
    21191724<reference anchor="UTR15" target="http://www.unicode.org/unicode/reports/tr15/tr15-23.html">
    21201725<front>
     
    21301735
    21311736<references title="Informative References">
    2132 
    2133 <reference anchor="BidiEx" target="http://www.w3.org/International/iri-edit/BidiExamples">
    2134 <front>
    2135 <title>Examples of bidirectional IRIs</title>
    2136 <author><organization/></author>
    2137 <date year="" month=""/>
    2138 </front>
    2139 </reference>
    21401737
    21411738<reference anchor="CharMod" target="http://www.w3.org/TR/charmod-resid">
     
    22031800&rfc2397;
    22041801&rfc2616;
    2205 &rfc1738;
    22061802&rfc2640;
    22071803&rfc3987;
    22081804&rfc6055;
     1805
     1806<reference anchor='Bidi'>
     1807  <front>
     1808    <title>Guidelines for Internationalized Resource Identifiers with Bi-directional Characters (Bidi IRIs)</title>
     1809    <author initials="M." surname="Duerst"/>
     1810    <author initials='L.' surname='Masinter' />
     1811    <date year="2011" month="August" day="14" />
     1812  </front>
     1813  <seriesInfo name="Internet-Draft" value="draft-ietf-iri-bidi-guidelines-00"/>
     1814</reference>
     1815
     1816<reference anchor='Equivalence'>
     1817  <front>
     1818    <title>Equivalence and Canonicalization of Internationalized Resource Identifiers (IRIs)</title>
     1819    <author initials='L.' surname='Masinter' />
     1820    <author initials="M." surname="Duerst"/>
     1821    <date year="2011" month="August" day="13" />
     1822  </front>
     1823  <seriesInfo name="Internet-Draft" value="draft-ietf-iri-comparison-00"/>
     1824</reference>
     1825
    22091826<reference anchor='RFC4395bis'>
    22101827  <front>
     
    22131830    <author initials='T.' surname='Hardie' fullname="Ted Hardie"><organization/></author>
    22141831    <author initials='L.' surname='Masinter' fullname="Larry Masinter"><organization/></author>
    2215     <date year="2010" month='September' day="30"/>
     1832    <date year="2011" month='July' day="29"/>
    22161833    <workgroup>IRI</workgroup>
    22171834  </front>
    2218   <seriesInfo name="Internet-Draft" value="draft-hansen-iri-4395bis-irireg-00"/>
     1835  <seriesInfo name="Internet-Draft" value="draft-ietf-iri-4395bis-irireg-03"/>
    22191836</reference>
    22201837 
     
    23001917</reference>
    23011918
    2302 <reference anchor="HTML5" target="http://www.w3.org/TR/2009/WD-html5-20090423/">
    2303 <front>
    2304 <title>A vocabulary and associated APIs for HTML and XHTML</title>
    2305 <author initials="I." surname="Hickson" fullname="Ian Hickson"><organization>Google, Inc.</organization></author>
    2306 <author initials="D." surname="Hyatt" fullname="David Hyatt"><organization>Apple, Inc.</organization></author>
    2307 <date year="2009"  month="April" day="23"/>
    2308 </front>
    2309 <seriesInfo name="World Wide Web Consortium" value="Working Draft"/>
    2310 </reference>
    2311 
    23121919</references>
    23131920
Note: See TracChangeset for help on using the changeset viewer.