Changeset 71

Oct 20, 2011, 1:46:18 PM (8 years ago)

moved bidi stuff and comparison stuff to separate drafts

1 edited


  • draft-ietf-iri-3987bis/draft-ietf-iri-3987bis.xml

    r70 r71  
    11<?xml version="1.0"?>
    22<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
    3 <!ENTITY rfc1738 SYSTEM "">
    43<!ENTITY rfc2045 SYSTEM "">
    54<!ENTITY rfc2119 SYSTEM "">
    3332<?rfc compact='yes'?>
    3433<?rfc subcompact='no'?>
    35 <rfc ipr="pre5378Trust200902" docName="draft-ietf-iri-3987bis-06" category="std" xml:lang="en" obsoletes="3987">
     34<rfc ipr="pre5378Trust200902" docName="draft-ietf-iri-3987bis-07" category="std" xml:lang="en" obsoletes="3987">
    3736<title abbrev="IRIs">Internationalized Resource Identifiers (IRIs)</title>
    97 <keyword>URL</keyword>
    124122  <note title='RFC Editor: Please remove the next paragraph before publication.'>
    125     <t>This document is intended to update RFC 3987 and move towards IETF
    126     Draft Standard.  For discussion and comments on this
    127     draft, please join the IETF IRI WG by subscribing to the mailing
    128     list For a list of open issues, please see
     123    <t>This (and several companion documents) are intended to obsolete RFC 3987,
     124    and also move towards IETF Draft Standard.  For discussion and comments on these
     125    drafts, please join the IETF IRI WG by subscribing to the mailing
     126    list, archives at
     127    For a list of open issues, please see
    129128    the issue tracker of the WG at
    130129    For a list of individual edits, please see the change history at
    179178<t>Using characters outside of A - Z in IRIs adds a number of
    180 difficulties. <xref target="Bidi"/> discusses the special case of
    181 bidirectional IRIs using characters from scripts written
    182 right-to-left.  <xref target="IRIuse"/> discusses the use
     179difficulties. <xref target="IRIuse"/> discusses the use
    183180of IRIs in different situations.  <xref target="guidelines"/> gives
    184181additional informative guidelines.  <xref target="security"/>
    185182discusses IRI-specific security considerations.</t>
    187   <t>When originally defining IRIs, several design alternatives were considered.
     185<xref target="Bidi"/> discusses the special case of
     186bidirectional IRIs using characters from scripts written
     188<xref target="Equivalence"/> gives guidelines for applications wishing
     189to determine if two IRIs are equivalent, as well as defining
     190some equivalence methods.
     191<xref target="RFC4395bis"/> updates the URI scheme registration
     192guidelines and proceedures to note that every URI scheme is also
     193automatically an IRI scheme and to allow scheme definitions
     194to be directly described in terms of Unicode characters.
     197 <t>When originally defining IRIs, several design alternatives were considered.
    188198    Historically interested readers can find an overview in Appendix A of <xref target="RFC3987"/>.
    189199  For some additional background on the design of URIs and IRIs, please also see
    196206<t>IRIs are designed to allow protocols and software that deal with
    197 URIs to be updated to handle IRIs. A "URI scheme" (as defined by <xref
    198 target="RFC3986"/> and registered through the IANA process defined in
    199 <xref target="RFC4395bis"/> also serves as an "IRI scheme". Processing of
     207URIs to be updated to handle IRIs. Processing of
    200208IRIs is accomplished by extending the URI syntax while retaining (and
    201209not expanding) the set of "reserved" characters, such that the syntax
    288296    URIs.</t>
    290 <t hangText="URL:">The term "URL" was originally used <xref
    291    target="RFC1738"/> for roughly what is now called a "URI".  Books,
    292    software and documentation often refers to URIs and IRIs using the
    293    "URL" term. Some usages restrict "URL" to those URIs which are not
    294    URNs. Because of the ambiguity of the term using the term "URL" is
    295    NOT RECOMMENDED in formal documents.</t>
    297298<t hangText="LEIRI (Legacy Extended IRI) processing:">  This term was used in
    298299   various XML specifications to refer
    300301   the processing rules in <xref target="LEIRIspec" />.</t>
    302 <t hangText="(Web Address, Hypertext Reference, HREF):"> These terms have been
    303    added in this document for convenience, to allow other
    304    specifications to refer to those strings that, although not valid
    305    IRIs, are acceptable input to the processing rules in <xref
    306    target="webaddress"/>. This usage corresponds to the parsing rules
    307    of some popular web browsing applications.
    308    ISSUE: Need to find a good name/abbreviation for these.</t>
    310303<t hangText="running text:">Human text (paragraphs, sentences,
    311304   phrases) with syntax according to orthographic conventions of a
    362355<t>To represent characters outside US-ASCII in examples, this document
    363 uses two notations: 'XML Notation' and 'Bidi Notation'.</t>
     356uses 'XML Notation'.</t>
    365358<t>XML Notation uses a leading '&amp;#x', a trailing ';', and the
    367360example, &amp;#x44F; stands for CYRILLIC CAPITAL LETTER YA. In this
    368361notation, an actual '&amp;' is denoted by '&amp;amp;'.</t>
    370 <t>Bidi Notation is used for bidirectional examples: Lower case
    371 letters stand for Latin letters or other letters that are written left
    372 to right, whereas upper case letters represent Arabic or Hebrew
    373 letters that are written right to left.</t>
    375363<t>To denote actual octets in examples (as opposed to percent-encoded
    626614given); the result is a set of parsed IRI components.</t>
    628 <t>NOTE: The result of parsing into components will correspond
    629 to subtrings of the IRI that may be accessible via an API.
    630 For example, in <xref target="HTML5"/>, the protocol
    631 components of interest are SCHEME (scheme), HOST (ireg-name), PORT
    632 (port), the PATH (ipath after the initial "/"), QUERY (iquery),
    633 FRAGMENT (ifragment), and AUTHORITY (iauthority).
    634 </t>
    636 <t>Subsequent processing rules are sometimes used to define other
    637 syntactic components. For example, <xref target="HTML5"/> defines APIs
    638 for IRI processing; in these APIs:
    640 <list style="hanging">
    641 <t hangText="HOSTSPECIFIC"> the substring that follows
    642 the substring matched by the iauthority production, or the whole
    643 string if the iauthority production wasn't matched.</t>
    644 <t hangText="HOSTPORT"> if there is a scheme component and a port
    645 component and the port given by the port component is different than
    646 the default port defined for the protocol given by the scheme
    647 component, then HOSTPORT is the substring that starts with the
    648 substring matched by the host production and ends with the substring
    649 matched by the port production, and includes the colon in between the
    650 two. Otherwise, it is the same as the host component.
    651 </t>
    652 </list>
    653 </t>
    654616</section> <!-- parse -->
    699661    a particular registered name lookup technology. For further background,
    700662    see <xref target="RFC6055"/> and <xref target="Gettys"/>.</t>
    701 </section>
     663</section> <!-- dnspercent -->
    702664<section title="Mapping using Punycode" anchor='dnspunycode'>
    703665  <t>The ireg-name component MAY also be converted as follows:</t>
    718680  <t>This conversion for ireg-name will be better able to deal with legacy
    719681    infrastructure that cannot handle percent-encoding in domain names.</t>
    720 </section>
     682</section> <!-- punicode -->
    721683  <section title="Additional Considerations">
    722685<t><list style="hanging">
    724686<t hangText="Note:">Domain Names may appear in parts of an IRI other
    725687than the ireg-name part.  It is the responsibility of scheme-specific
    751 </section>
     713</section> <!-- additional -->
    752714</section> <!-- dnsmapping -->
    754716<section title="Mapping query components" anchor="querymapping">
    756 <t>((NOTE: SEE ISSUES LIST))
    758 For compatibility with existing deployed HTTP infrastructure,
     718<t>For compatibility with existing deployed HTTP infrastructure,
    759719the following special case applies for schemes "http" and "https"
    760720and IRIs whose origin has a document charset other than one which
    769729<section title="Mapping IRIs to URIs" anchor="mapping">
    771 <t>The canonical mapping from a IRI to URI is defined by applying the
     731<t>The mapping from an IRI to URI is accomplished by applying the
    772732mapping above (from IRI to URI components) and then reassembling a URI
    773733from the parsed URI components using the original punctuation that
    814774<t hangText="3.">The conversion may result in a character that is not
    815     appropriate in an IRI. See <xref target="abnf"/>, <xref target="visual"/>,
     775    appropriate in an IRI. See <xref target="abnf"/>,
    816776      and <xref target="limitations"/> for further details.</t>
    839799<t hangText="4.">Re-percent-encode all octets produced in step 3 that
    840800      in UTF-8 represent characters that are not appropriate according
    841       to <xref target="abnf"/>, <xref target="visual"/>, and <xref
     801      to <xref target="abnf"/>  and <xref
    842802      target="limitations"/>.</t>
    910870<t>The following example contains "%e2%80%ae", which is the percent-encoded<vspace/>UTF-8
    911 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. <xref target="visual"/>
    912 forbids the direct use of this character in an IRI. Therefore, the
     871character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE.
     872The direct use of this character is forbiddin in an IRI. Therefore, the
    913873corresponding octets are re-percent-encoded in step 4. This example shows
    914874that the case (upper- or lowercase) of letters used in percent-encodings may not be preserved.
    931891</section> <!-- URItoIRI -->
    932892</section> <!-- processing -->
    933 <section title="Bidirectional IRIs for Right-to-Left Languages" anchor="Bidi">
    935 <t>Some UCS characters, such as those used in the Arabic and Hebrew
    936 scripts, have an inherent right-to-left (rtl) writing direction. IRIs
    937 containing these characters (called bidirectional IRIs or Bidi IRIs)
    938 require additional attention because of the non-trivial relation
    939 between logical representation (used for digital representation and
    940 for reading/spelling) and visual representation (used for
    941 display/printing).</t>
    943 <t>Because of the complex interaction between the logical representation,
    944 the visual representation, and the syntax of a Bidi IRI, a balance is
    945 needed between various requirements.
    946 The main requirements are<list style="hanging">
    947 <t hangText="1.">user-predictable conversion between visual and
    948     logical representation;</t>
    949 <t hangText="2.">the ability to include a wide range of characters
    950     in various parts of the IRI; and</t>
    951 <t hangText="3.">minor or no changes or restrictions for
    952       implementations.</t>
    953 </list></t>
    955 <section title="Logical Storage and Visual Presentation" anchor="visual">
    957 <t>When stored or transmitted in digital representation, bidirectional
    958 IRIs MUST be in full logical order and MUST conform to the IRI syntax
    959 rules (which includes the rules relevant to their scheme). This
    960 ensures that bidirectional IRIs can be processed in the same way as
    961 other IRIs.</t> <t>Bidirectional IRIs MUST be rendered by using the
    962 Unicode Bidirectional Algorithm <xref target="UNIV6"/>, <xref
    963 target="UNI9"/>.  Bidirectional IRIs MUST be rendered in the same way
    964 as they would be if they were in a left-to-right embedding; i.e., as
    965 if they were preceded by U+202A, LEFT-TO-RIGHT EMBEDDING (LRE), and
    966 followed by U+202C, POP DIRECTIONAL FORMATTING (PDF).  Setting the
    967 embedding direction can also be done in a higher-level protocol (e.g.,
    968 the dir='ltr' attribute in HTML).</t>
    970 <t>There is no requirement to use the above embedding if the display
    971 is still the same without the embedding. For example, a bidirectional
    972 IRI in a text with left-to-right base directionality (such as used for
    973 English or Cyrillic) that is preceded and followed by whitespace and
    974 strong left-to-right characters does not need an embedding.  Also, a
    975 bidirectional relative IRI reference that only contains strong
    976 right-to-left characters and weak characters and that starts and ends
    977 with a strong right-to-left character and appears in a text with
    978 right-to-left base directionality (such as used for Arabic or Hebrew)
    979 and is preceded and followed by whitespace and strong characters does
    980 not need an embedding.</t>
    982 <t>In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
    983 sufficient to force the correct display behavior.  However, the
    984 details of the Unicode Bidirectional algorithm are not always easy to
    985 understand. Implementers are strongly advised to err on the side of
    986 caution and to use embedding in all cases where they are not
    987 completely sure that the display behavior is unaffected without the
    988 embedding.</t>
    990 <t>The Unicode Bidirectional Algorithm (<xref target="UNI9"/>, section
    991 4.3) permits higher-level protocols to influence bidirectional
    992 rendering. Such changes by higher-level protocols MUST NOT be used if
    993 they change the rendering of IRIs.</t>
    995 <t>The bidirectional formatting characters that may be used before or
    996 after the IRI to ensure correct display are not themselves part of the
    997 IRI.  IRIs MUST NOT contain bidirectional formatting characters (LRM,
    998 RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of
    999 the IRI but do not appear themselves. It would therefore not be
    1000 possible to input an IRI with such characters correctly.</t>
    1002 </section> <!-- visual -->
    1003 <section title="Bidi IRI Structure" anchor="bidi-structure">
    1005 <t>The Unicode Bidirectional Algorithm is designed mainly for running
    1006 text.  To make sure that it does not affect the rendering of
    1007 bidirectional IRIs too much, some restrictions on bidirectional IRIs
    1008 are necessary. These restrictions are given in terms of delimiters
    1009 (structural characters, mostly punctuation such as "@", ".", ":",
    1010 and<vspace/>"/") and components (usually consisting mostly of letters
    1011 and digits).</t>
    1013 <t>The following syntax rules from <xref target="abnf"/> correspond to
    1014 components for the purpose of Bidi behavior: iuserinfo, ireg-name,
    1015 isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and
    1016 ifragment.</t>
    1018 <t>Specifications that define the syntax of any of the above
    1019 components MAY divide them further and define smaller parts to be
    1020 components according to this document. As an example, the restrictions
    1021 of <xref target="RFC3490"/> on bidirectional domain names correspond
    1022 to treating each label of a domain name as a component for schemes
    1023 with ireg-name as a domain name.  Even where the components are not
    1024 defined formally, it may be helpful to think about some syntax in
    1025 terms of components and to apply the relevant restrictions.  For
    1026 example, for the usual name/value syntax in query parts, it is
    1027 convenient to treat each name and each value as a component. As
    1028 another example, the extensions in a resource name can be treated as
    1029 separate components.</t>
    1031 <t>For each component, the following restrictions apply:</t>
    1032 <t>
    1033 <list style="hanging">
    1035 <t hangText="1.">A component SHOULD NOT use both right-to-left and
    1036   left-to-right characters.</t>
    1038 <t hangText="2.">A component using right-to-left characters SHOULD
    1039   start and end with right-to-left characters.</t>
    1041 </list></t>
    1043 <t>The above restrictions are given as "SHOULD"s, rather than as
    1044 "MUST"s.  For IRIs that are never presented visually, they are not
    1045 relevant.  However, for IRIs in general, they are very important to
    1046 ensure consistent conversion between visual presentation and logical
    1047 representation, in both directions.</t>
    1049 <t><list style="hanging">
    1051 <t hangText="Note:">In some components, the above restrictions may
    1052   actually be strictly enforced.  For example, <xref
    1053   target="RFC3490"></xref> requires that these restrictions apply to
    1054   the labels of a host name for those schemes where ireg-name is a
    1055   host name.  In some other components (for example, path components)
    1056   following these restrictions may not be too difficult.  For other
    1057   components, such as parts of the query part, it may be very
    1058   difficult to enforce the restrictions because the values of query
    1059   parameters may be arbitrary character sequences.</t>
    1061 </list></t>
    1063 <t>If the above restrictions cannot be satisfied otherwise, the
    1064 affected component can always be mapped to URI notation as described
    1065 in <xref target="compmapping"/>. Please note that the whole component
    1066 has to be mapped (see also Example 9 below).</t>
    1068 </section> <!-- bidi-structure -->
    1070 <section title="Input of Bidi IRIs" anchor="bidiInput">
    1072 <t>Bidi input methods MUST generate Bidi IRIs in logical order while
    1073 rendering them according to <xref target="visual"/>.  During input,
    1074 rendering SHOULD be updated after every new character is input to
    1075 avoid end-user confusion.</t>
    1077 </section> <!-- bidiInput -->
    1079 <section title="Examples">
    1081 <t>This section gives examples of bidirectional IRIs, in Bidi
    1082 Notation.  It shows legal IRIs with the relationship between logical
    1083 and visual representation and explains how certain phenomena in this
    1084 relationship may look strange to somebody not familiar with
    1085 bidirectional behavior, but familiar to users of Arabic and Hebrew. It
    1086 also shows what happens if the restrictions given in <xref
    1087 target="bidi-structure"/> are not followed. The examples below can be
    1088 seen at <xref target="BidiEx"/>, in Arabic, Hebrew, and Bidi Notation
    1089 variants.</t>
    1091 <t>To read the bidi text in the examples, read the visual
    1092 representation from left to right until you encounter a block of rtl
    1093 text. Read the rtl block (including slashes and other special
    1094 characters) from right to left, then continue at the next unread ltr
    1095 character.</t>
    1097 <t>Example 1: A single component with rtl characters is inverted:
    1098 <vspace/>Logical representation:
    1099 "http://ab.CDEFGH.ij/kl/mn/op.html"<vspace/>Visual representation:
    1100 "http://ab.HGFEDC.ij/kl/mn/op.html"<vspace/> Components can be read
    1101 one by one, and each component can be read in its natural
    1102 direction.</t>
    1104 <t>Example 2: More than one consecutive component with rtl characters
    1105 is inverted as a whole: <vspace/>Logical representation:
    1106 "http://ab.CDE.FGH/ij/kl/mn/op.html"<vspace/>Visual representation:
    1107 "http://ab.HGF.EDC/ij/kl/mn/op.html"<vspace/> A sequence of rtl
    1108 components is read rtl, in the same way as a sequence of rtl words is
    1109 read rtl in a bidi text.</t>
    1111 <t>Example 3: All components of an IRI (except for the scheme) are
    1112 rtl.  All rtl components are inverted overall: <vspace/>Logical
    1113 representation:
    1114 "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"<vspace/>Visual
    1115 representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"<vspace/> The
    1116 whole IRI (except the scheme) is read rtl. Delimiters between rtl
    1117 components stay between the respective components; delimiters between
    1118 ltr and rtl components don't move.</t>
    1120 <t>Example 4: Each of several sequences of rtl components is inverted
    1121 on its own: <vspace/>Logical representation:
    1122 "http://AB.CD.ef/gh/IJ/KL.html"<vspace/>Visual representation:
    1123 "http://DC.BA.ef/gh/LK/JI.html"<vspace/> Each sequence of rtl
    1124 components is read rtl, in the same way as each sequence of rtl words
    1125 in an ltr text is read rtl.</t>
    1127 <t>Example 5: Example 2, applied to components of different kinds:
    1128 <vspace/>Logical representation: ""
    1129 <vspace/>Visual representation:
    1130 ""<vspace/> The inversion of the domain
    1131 name label and the path component may be unexpected, but it is
    1132 consistent with other bidi behavior.  For reassurance that the domain
    1133 component really is "", it may be helpful to read aloud the
    1134 visual representation following the bidi algorithm. After
    1135 "" one reads the RTL block "E-F-slash-G-H", which
    1136 corresponds to the logical representation.
    1137 </t>
    1139 <t>Example 6: Same as Example 5, with more rtl components:
    1140 <vspace/>Logical representation:
    1141 "http://ab.CD.EF/GH/IJ/kl.html"<vspace/>Visual representation:
    1142 "http://ab.JI/HG/FE.DC/kl.html"<vspace/> The inversion of the domain
    1143 name labels and the path components may be easier to identify because
    1144 the delimiters also move.</t>
    1146 <t>Example 7: A single rtl component includes digits: <vspace/>Logical
    1147 representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"<vspace/>Visual
    1148 representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"<vspace/>
    1149 Numbers are written ltr in all cases but are treated as an additional
    1150 embedding inside a run of rtl characters. This is completely
    1151 consistent with usual bidirectional text.</t>
    1153 <t>Example 8 (not allowed): Numbers are at the start or end of an rtl
    1154 component:<vspace/>Logical representation:
    1155 ""<vspace/>Visual representation:
    1156 ""<vspace/> The sequence "1/2" is
    1157 interpreted by the bidi algorithm as a fraction, fragmenting the
    1158 components and leading to confusion. There are other characters that
    1159 are interpreted in a special way close to numbers; in particular, "+",
    1160 "-", "#", "$", "%", ",", ".", and ":".</t>
    1162 <t>Example 9 (not allowed): The numbers in the previous example are
    1163 percent-encoded: <vspace/>Logical representation:
    1164 "",<vspace/>Visual representation:
    1165 ""</t>
    1167 <t>Example 10 (allowed but not recommended): <vspace/>Logical
    1168 representation: "http://ab.CDEFGH.123/kl/mn/op.html"<vspace/>Visual
    1169 representation: "http://ab.123.HGFEDC/kl/mn/op.html"<vspace/>
    1170 Components consisting of only numbers are allowed (it would be rather
    1171 difficult to prohibit them), but these may interact with adjacent RTL
    1172 components in ways that are not easy to predict.</t>
    1174 <t>Example 11 (allowed but not recommended): <vspace/>Logical
    1175 representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"<vspace/>Visual
    1176 representation: "http://ab.123.HGFEDCij/kl/mn/op.html"<vspace/>
    1177 Components consisting of numbers and left-to-right characters are
    1178 allowed, but these may interact with adjacent RTL components in ways
    1179 that are not easy to predict.</t>
    1180 </section><!-- examples -->
    1181 </section><!-- bidi -->
    1183895<section title="Use of IRIs" anchor="IRIuse">
    1187899<t>This section discusses limitations on characters and character
    1188 sequences usable for IRIs beyond those given in <xref target="abnf"/>
    1189 and <xref target="visual"/>. The considerations in this section are
     900sequences usable for IRIs beyond those given in <xref target="abnf"/>.
     901The considerations in this section are
    1190902relevant when IRIs are created and when URIs are converted to
    13551067<section title="Liberal Handling of Otherwise Invalid IRIs" anchor="LEIRIHREF">
    1357 <t>(EDITOR NOTE: This Section may move to an appendix.)
    13591070Some technical specifications and widely-deployed software have
    13601071allowed additional variations and extensions of IRIs to be used in
    1361 syntactic components. This section describes two widely-used
    1362 preprocessing agreements. Other technical specifications may wish to
    1363 reference a syntactic component which is "a valid IRI or a string that
    1364 will map to a valid IRI after this preprocessing algorithm". These two
    1365 variants are known as <xref target="LEIRI">Legacy Extended IRI or
    1366 LEIRI</xref>, and <xref target="HTML5">Web Address</xref>).
    1367 </t>
     1072syntactic components. </t>
    13691073<t>Future technical specifications SHOULD NOT allow conforming
    13701074producers to produce, or conforming content to contain, such forms,
    13861090  Among other extensions, processors based on this specification also
    13871091  did not enforce the restriction on bidirectional formatting
    1388   characters in <xref target="visual"></xref>, and the iprivate
     1092  characters in <xref target="Bidi"></xref>, and the iprivate
    13891093  production becomes redundant.</postamble>
    13941098using <xref target="compmapping"/>.</t>
    13951099</section> <!-- leiriproc -->
    1397 <section title="Web Address Processing" anchor="webaddress">
    1399 <t>Many popular web browsers have taken the approach of being quite
    1400 liberal in what is accepted as a "URL" or its relative
    1401 forms. This section describes their behavior in terms of a preprocessor
    1402 which maps strings into the IRI space for subsequent parsing and
    1403 interpretation as an IRI.</t>
    1405 <t>In some situations, it might be appropriate to describe the syntax
    1406 that a liberal consumer implementation might accept as a "Web
    1407 Address" or "Hypertext Reference" or "HREF". However,
    1408 technical specifications SHOULD restrict the syntactic form allowed by compliant producers
    1409 to the IRI or IRI reference syntax defined in this document
    1410 even if they want to mandate this processing.</t>
    1412 <t>
    1413 Summary:
    1414 <list style="symbols">
    1415    <t>Leading and trailing whitespace is removed.</t>
    1416    <t>Some additional characters are removed.</t>
    1417    <t>Some additional characters are allowed and escaped (as with LEIRI).</t>
    1418    <t>If interpreting an IRI as a URI, the pct-encoding of the query
    1419    component of the parsed URI component depends on operational
    1420    context.</t>
    1421 </list>
    1422 </t>
    1424 <t>Each string provided may have an associated charset (called
    1425 the HREF-charset here); this defaults to UTF-8.
    1426 For web browsers interpreting HTML, the document
    1427 charset of a string is determined:
    1429 <list style="hanging">
    1430 <t hangText="If the string came from a script (e.g. as an argument to
    1431  a method)">The HRef-charset is the script's charset.</t>
    1433 <t hangText="If the string came from a DOM node (e.g. from an
    1434   element)">The node has a Document, and the HRef-charset is the
    1435   Document's character encoding.</t>
    1437 <t hangText="If the string had a HRef-charset defined when the string was
    1438 created or defined">The HRef-charset is as defined.</t>
    1440 </list></t>
    1442 <t>If the resulting HRef-charset is a unicode based character encoding
    1443 (e.g., UTF-16), then use UTF-8 instead.</t>
    1446 <figure>
    1447 <preamble>The syntax for Web Addresses is obtained by replacing the 'ucschar',
    1448   pct-form, path-sep, and ifragment rules with the href-ucschar, href-pct-form, href-path-sep,
    1449   and href-ifragment
    1450   rules below. In addition, some characters are stripped.</preamble>
    1452 <artwork type='abnf'>
    1453   href-ucschar   = " " / "&lt;" / "&gt;" / DQUOTE / "{" / "}" / "|"
    1454                  / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
    1455                  / %xE000-FFFD / %x10000-10FFFF
    1456   href-pct-form  = pct-encoded / "%"
    1457   href-path-sep  = "/" / "\"
    1458   href-ifragment = *( ipchar / "/" / "?" / "#" )  ; adding "#"
    1459   href-strip     = &lt;to be done&gt;
    1460 </artwork>
    1462 <postamble>
    1464 browsers did not enforce the restriction on bidirectional formatting
    1465   characters in <xref target="visual"></xref>, and the iprivate
    1466   production becomes redundant.</postamble>
    1467 </figure>
    1469 <t>'Web Address processing' requires the following additional
    1470 preprocessing steps:
    1472 <list style="numbers">
    1474 <t>Leading and trailing instances of space (U+0020),
    1475 CR (U+000A), LF (U+000D), and TAB (U+0009) characters are removed.</t>
    1477 <t>strip all characters in href-strip.</t>
    1478   <t>Percent-encode all characters in href-ucschar not in ucschar.</t>
    1479   <t>Replace occurrences of "%" not followed by two hexadecimal digits by "%25".</t>
    1480   <t>Convert backslashes ('\') matching href-path-sep to forward slashes ('/').</t>
    1481 </list></t>
    1482 </section> <!-- webaddress -->
     1100</section> <!-- LEIRIHREF -->
    14841102<section title="Characters Not Allowed in IRIs" anchor="notAllowed">
    14861104<t>This section provides a list of the groups of characters and code
    1487 points that are allowed by LEIRI or HREF but are not allowed in IRIs or are
     1105points that are allowed in some contexts but are not allowed in IRIs or are
    14881106allowed in IRIs only in the query part. For each group of characters,
    14891107advice on the usage of these characters is also given, concentrating
    15741192Unicode codepoints.</t></list></t>
    15751193</section> <!-- notallowed -->
    1576 </section> <!-- lieirihref -->
    15781195<section title="URI/IRI Processing Guidelines (Informative)" anchor="guidelines">
    16041221perform this inverse conversion unless it is certain this can be done
    1606 </section>
     1223</section><!-- software interfaces -->
    16081225<section title="URI/IRI Entry">
    16431260<t>For the input of IRIs with right-to-left characters, please see
    1644 <xref target="bidiInput"></xref>.</t>
    1645 </section>
     1261<xref target="Bidi"></xref>.</t>
     1262</section><!-- entry -->
    16471264<section title="URI/IRI Transfer between Applications">
    16681285Correctly used, the clipboard transfers characters, not octets, which
    16691286will do the right thing with IRIs.</t>
    1670 </section>
     1287</section><!-- transfer -->
    16721289<section title="URI/IRI Generation">
    16931310<t>This recommendation particularly applies to HTTP servers. For FTP
    16941311servers, similar considerations apply; see <xref target="RFC2640"/>.</t>
    1695 </section>
     1312</section><!-- generation -->
    16971314<section title="URI/IRI Selection" anchor="selection">
    17371354Greek, and Cyrillic, using lowercase letters results in fewer
    17381355ambiguities than using uppercase letters would.</t>
    1739 </section>
     1356</section><!-- selection -->
    17411358<section title="Display of URIs/IRIs" anchor="display">
    17451362resources, these parts should be percent-encoded before being displayed.</t>
    1747 <t>For display of Bidi IRIs, please see <xref target="visual"/>.</t>
    1748 </section>
     1364<t>For display of Bidi IRIs, please see <xref target="Bidi"/>.</t>
     1365</section> <!-- display -->
    17501367<section title="Interpretation of URIs and IRIs">
    17791396Also, the regularity of UTF-8 (see <xref target="Duerst97"/>) makes the
    17801397potential for collisions lower than it may seem at first.</t>
    1781 </section>
     1398</section> <!-- interpretation -->
    17831400<section title="Upgrading Strategy">
    18291446<xref target="UTF8use"/>.</t>
    1831 </section>
     1448</section> <!-- upgrading -->
    18321449</section> <!-- guidelines -->
    18771494  confusion include different forms of normalization and different normalization
    18781495  expectations, use of percent-encoding with various legacy encodings,
    1879   and bidirectionality issues. See also <xref target='UTR36'/>.</t>
     1496  and bidirectionality issues. See also <xref target="Bidi"/>.</t>
    18811498<t>Confusion can occur in various IRI components, such as the
    18881505Details are discussed in <xref target="selection"/>.</t>
    1890 <t>Confusion can occur with bidirectional IRIs, if the restrictions
    1891 in <xref target="bidi-structure"/> are not followed. The same visual
    1892 representation may be interpreted as different logical representations,
    1893 and vice versa. It is also very important that a correct Unicode bidirectional
    1894 implementation be used.</t>
    18951507  <t>The characters additionally allowed in Legacy Extended IRIs
    18961508    introduce additional security issues. For details, see <xref target='notAllowed'/>.</t>
    19001512<t>This document was derived from <xref target="RFC3987"/>; the acknowledgments from
    19011513that specification still apply.</t>
    1902 <t>We would like to thank Ian Hickson, Michael Sperberg-McQueen,
    1903   and Dan Connolly for their work on HyperText References, and Norman Walsh, Richard Tobin,
    1904   Henry S. Thomson, John Cowan, Paul Grosso, and the XML Core Working Group of the W3C for their work on LEIRIs.</t>
    1905 <t>In addition, this document was influenced by contributions from (in no particular order) Chris
     1514<t>In addition, this document was influenced by contributions from (in no particular order)Norman Walsh, Richard Tobin,
     1515  Henry S. Thomson, John Cowan, Paul Grosso, the XML Core Working Group of the W3C,
     1516 Chris
    19061517  Lilley, Bjoern Hoehrmann,
    19071518Felix Sasaki, Jeremy Carroll, Frank Ellermann, Michael Everson, Cary Karp, Matitiahu Allouche,
    19141525Goland, Sam Ruby, Adam Barth, Abdulrahman I. ALGhadir, Aharon Lanin, Thomas Milo, Murray Sargent,
    19151526Marc Blanchet, and Mykyta Yevstifeyev.</t>
    1916 </section>
     1527</section> <!-- Acknowledgements -->
    19181529<section title="Main Changes Since RFC 3987">
    19191530  <t>This section describes the main changes since <xref target="RFC3987"></xref>.</t>
     1531  <section title="Split out Bidi, processing guidelines, comparison sections">
     1532    <t>Move some components (comparison, bidi, processing) into separate documents.</t>
     1533  </section>
    19201534  <section title="Major restructuring of IRI processing model" anchor="forkChanges">
    19211535    <t>Major restructuring of IRI processing model to make scheme-specific translation
    20471661    <t>Added this Change Log Section.</t>
    20481662    <t>Added a section about "IRIs with Spaces/Controls" (converting from a Note in RFC 3987).</t></list></t>
    2049 </section>
     1663</section> <!-- -00 to -01 -->
    20501664<section title="Changes from RFC 3987 to -00 of draft-duerst-iri-bis">
    20511665  <t><list>
    20521666    <t>Fixed errata (see</t></list></t>
    2053 </section>
     1667</section> <!-- from 3987 -->
    2110 <reference anchor="UNI9" target="">
    2111 <front>
    2112 <title>The Bidirectional Algorithm</title>
    2113 <author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
    2114 <date year="2004" month="March"/>
    2115 </front>
    2116 <seriesInfo name="Unicode Standard Annex" value="#9"/>
    2117 </reference>
    21191724<reference anchor="UTR15" target="">
    21311736<references title="Informative References">
    2133 <reference anchor="BidiEx" target="">
    2134 <front>
    2135 <title>Examples of bidirectional IRIs</title>
    2136 <author><organization/></author>
    2137 <date year="" month=""/>
    2138 </front>
    2139 </reference>
    21411738<reference anchor="CharMod" target="">
    2205 &rfc1738;
     1806<reference anchor='Bidi'>
     1807  <front>
     1808    <title>Guidelines for Internationalized Resource Identifiers with Bi-directional Characters (Bidi IRIs)</title>
     1809    <author initials="M." surname="Duerst"/>
     1810    <author initials='L.' surname='Masinter' />
     1811    <date year="2011" month="August" day="14" />
     1812  </front>
     1813  <seriesInfo name="Internet-Draft" value="draft-ietf-iri-bidi-guidelines-00"/>
     1816<reference anchor='Equivalence'>
     1817  <front>
     1818    <title>Equivalence and Canonicalization of Internationalized Resource Identifiers (IRIs)</title>
     1819    <author initials='L.' surname='Masinter' />
     1820    <author initials="M." surname="Duerst"/>
     1821    <date year="2011" month="August" day="13" />
     1822  </front>
     1823  <seriesInfo name="Internet-Draft" value="draft-ietf-iri-comparison-00"/>
    22091826<reference anchor='RFC4395bis'>
    22101827  <front>
    22131830    <author initials='T.' surname='Hardie' fullname="Ted Hardie"><organization/></author>
    22141831    <author initials='L.' surname='Masinter' fullname="Larry Masinter"><organization/></author>
    2215     <date year="2010" month='September' day="30"/>
     1832    <date year="2011" month='July' day="29"/>
    22161833    <workgroup>IRI</workgroup>
    22171834  </front>
    2218   <seriesInfo name="Internet-Draft" value="draft-hansen-iri-4395bis-irireg-00"/>
     1835  <seriesInfo name="Internet-Draft" value="draft-ietf-iri-4395bis-irireg-03"/>
    2302 <reference anchor="HTML5" target="">
    2303 <front>
    2304 <title>A vocabulary and associated APIs for HTML and XHTML</title>
    2305 <author initials="I." surname="Hickson" fullname="Ian Hickson"><organization>Google, Inc.</organization></author>
    2306 <author initials="D." surname="Hyatt" fullname="David Hyatt"><organization>Apple, Inc.</organization></author>
    2307 <date year="2009"  month="April" day="23"/>
    2308 </front>
    2309 <seriesInfo name="World Wide Web Consortium" value="Working Draft"/>
    2310 </reference>
Note: See TracChangeset for help on using the changeset viewer.