source: draft-ietf-iri-3987bis/draft-ietf-iri-3987bis.xml @ 5

Last change on this file since 5 was 5, checked in by duerst@…, 9 years ago

editorial changes in last paragraph of abstract:

  • removed saying that this is 'essentially identical' to draft-duerst-iri-bis-07.txt
  • Added a pointer to the open issues list.
  • Property svn:executable set to *
File size: 131.6 KB
[2]1<?xml version="1.0"?>
2<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
3<!ENTITY rfc1738 SYSTEM "">
4<!ENTITY rfc2045 SYSTEM "">
5<!ENTITY rfc2119 SYSTEM "">
6<!ENTITY rfc2130 SYSTEM "">
7<!ENTITY rfc2141 SYSTEM "">
8<!ENTITY rfc2192 SYSTEM "">
9<!ENTITY rfc2277 SYSTEM "">
10<!ENTITY rfc2368 SYSTEM "">
11<!ENTITY rfc2384 SYSTEM "">
12<!ENTITY rfc2396 SYSTEM "">
13<!ENTITY rfc2397 SYSTEM "">
14<!ENTITY rfc2616 SYSTEM "">
15<!ENTITY rfc2640 SYSTEM "">
16<!ENTITY rfc3490 SYSTEM "">
17<!ENTITY rfc3491 SYSTEM "">
18<!ENTITY rfc3629 SYSTEM "">
19<!ENTITY rfc3986 SYSTEM "">
20<!ENTITY rfc4395 SYSTEM "">
22<?rfc strict='yes'?>
23<!--     complains about too long lines (2 cases)
24     and appendix, but otherwise is okay
26<?xml-stylesheet type='text/css' href='rfc2629.css' ?>
27<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
28<?rfc symrefs='yes'?>
29<?rfc sortrefs='yes'?>
30<?rfc iprnotified="no" ?>
31<?rfc toc='yes'?>
32<?rfc compact='yes'?>
33<?rfc subcompact='no'?>
[4]34<rfc ipr="pre5378Trust200902" docName="draft-ietf-iri-3987bis-01" category="std" xml:lang="en" obsoletes="3987">
36<title abbrev="IRIs">Internationalized Resource Identifiers (IRIs)</title>
38  <author initials="M.J." surname="Duerst" fullname='Martin Duerst'>
[3]39    <!-- (Note: Please write "Duerst" with u-umlaut wherever
40      possible, for example as "D&#252;rst" in XML and HTML') -->
[2]41  <organization abbrev="Aoyama Gakuin University">Aoyama Gakuin University</organization>
42  <address>
43  <postal>
44  <street>5-10-1 Fuchinobe</street>
45  <city>Sagamihara</city>
46  <region>Kanagawa</region>
47  <code>229-8558</code>
48  <country>Japan</country>
49  </postal>
50  <phone>+81 42 759 6329</phone>
51  <facsimile>+81 42 759 6495</facsimile>
[3]52  <email></email>
[4]53  <uri><!-- (Note: This is the percent-encoded form of an IRI)--></uri>
[2]54  </address>
57<author initials="M.L." surname="Suignard" fullname="Michel Suignard">
58   <organization>Unicode Consortium</organization>
59   <address>
60   <postal>
61   <street></street>
62   <street>P.O. Box 391476</street>
63   <city>Mountain View</city>
64   <region>CA</region>
65   <code>94039-1476</code>
66   <country>U.S.A.</country>
67   </postal>
68   <phone>+1-650-693-3921</phone>
[3]69   <email></email>
[2]70   <uri></uri>
71   </address>
73<author initials="L." surname="Masinter" fullname="Larry Masinter">
74   <organization>Adobe</organization>
75   <address>
76   <postal>
77   <street>345 Park Ave</street>
78   <city>San Jose</city>
79   <region>CA</region>
80   <code>95110</code>
81   <country>U.S.A.</country>
82   </postal>
83   <phone>+1-408-536-3024</phone>
[3]84   <email></email>
[2]85   <uri></uri>
86   </address>
[3]89<date year="2010" month="July" day="26"/>
91<workgroup>Internationalized Resource Identifiers (iri)</workgroup>
93<keyword>Internationalized Resource Identifier</keyword>
101<t>This document defines the Internationalized Resource Identifier
102(IRI) protocol element, as an extension of the Uniform Resource
103Identifier (URI).  An IRI is a sequence of characters from the
104Universal Character Set (Unicode/ISO 10646). Grammar and processing
105rules are given for IRIs and related syntactic forms.</t>
107<t>In addition, this document provides named additional rule sets
108for processing otherwise invalid IRIs, in a way that supports
109other specifications that wish to mandate common behavior for
110'error' handling. In particular, rules used in some XML languages
111(LEIRI) and web applications are given.</t>
113<t>Defining IRI as new protocol element (rather than updating or
114extending the definition of URI) allows independent orderly
115transitions: other protocols and languages that use URIs must
116explicitly choose to allow IRIs.</t>
118<t>Guidelines are provided for the use and deployment of IRIs and
119related protocol elements when revising protocols, formats, and
120software components that currently deal only with URIs.</t>
122<t>[RFC Editor: Please remove this paragraph before publication.]
123This document is intended to update RFC 3987 and move towards IETF
[5]124Draft Standard.  For discussion and comments on this
[2]125draft, please join the IETF IRI WG by subscribing to the mailing
[5]126list For a list of open issue, please see
127the issue tracker of the WG at</t>
133<section title="Introduction">
135<section title="Overview and Motivation" anchor="overview">
137<t>A Uniform Resource Identifier (URI) is defined in <xref
138target="RFC3986"/> as a sequence of characters chosen from a limited
139subset of the repertoire of US-ASCII <xref target="ASCII"/>
142<t>The characters in URIs are frequently used for representing words
143of natural languages.  This usage has many advantages: Such URIs are
144easier to memorize, easier to interpret, easier to transcribe, easier
145to create, and easier to guess. For most languages other than English,
146however, the natural script uses characters other than A - Z. For many
147people, handling Latin characters is as difficult as handling the
148characters of other scripts is for those who use only the Latin
149alphabet. Many languages with non-Latin scripts are transcribed with
150Latin letters. These transcriptions are now often used in URIs, but
151they introduce additional difficulties.</t>
153<t>The infrastructure for the appropriate handling of characters from
154additional scripts is now widely deployed in operating system and
155application software. Software that can handle a wide variety of
156scripts and languages at the same time is increasingly common. Also,
157an increasing number of protocols and formats can carry a wide range of
160<t>URIs are used both as a protocol element (for transmission and
161processing by software) and also a presentation element (for display
162and handling by people who read, interpret, coin, or guess them). The
163transition between these roles is more difficult and complex when
164dealing with the larger set of characters than allowed for URIs in
165<xref target="RFC3986"/>. </t>
167<t>This document defines the protocol element called Internationalized
168Resource Identifier (IRI), which allow applications of URIs to be
169extended to use resource identifiers that have a much wider repertoire
170of characters. It also provides corresponding "internationalized"
171versions of other constructs from <xref target="RFC3986"/>, such as
172URI references. The syntax of IRIs is defined in <xref
176<t>Using characters outside of A - Z in IRIs adds a number of
177difficulties. <xref target="Bidi"/> discusses the special case of
178bidirectional IRIs using characters from scripts written
179right-to-left.  <xref target="equivalence"/> discusses various forms
180of equivalence between IRIs. <xref target="IRIuse"/> discusses the use
181of IRIs in different situations.  <xref target="guidelines"/> gives
182additional informative guidelines.  <xref target="security"/>
183discusses IRI-specific security considerations.</t>
184</section> <!-- overview -->
186<section title="Applicability" anchor="Applicability">
188<t>IRIs are designed to allow protocols and software that deal with
189URIs to be updated to handle IRIs. A "URI scheme" (as defined by <xref
190target="RFC3986"/> and registered through the IANA process defined in
191<xref target="RFC4395"/> also serves as an "IRI scheme". Processing of
192IRIs is accomplished by extending the URI syntax while retaining (and
193not expanding) the set of "reserved" characters, such that the syntax
194for any URI scheme may be uniformly extended to allow non-ASCII
195characters. In addition, following parsing of an IRI, it is possible
196to construct a corresponding URI by first encoding characters outside
197of the allowed URI range and then reassembling the components.
200<t>Practical use of IRIs forms in place of URIs forms depends on the
201following conditions being met:</t>
203<t><list style="hanging">
205<t hangText="a.">A protocol or format element MUST be explicitly designated to be
206  able to carry IRIs. The intent is to avoid introducing IRIs into
207  contexts that are not defined to accept them.  For example, XML
208  schema <xref target="XMLSchema"/> has an explicit type "anyURI" that
209  includes IRIs and IRI references. Therefore, IRIs and IRI references
210  can be in attributes and elements of type "anyURI".  On the other
211  hand, in the <xref target="RFC2616"/> definition of HTTP/1.1, the
212  Request URI is defined as a URI, which means that direct use of IRIs
213  is not allowed in HTTP requests.</t>
215<t hangText="b.">The protocol or format carrying the IRIs MUST have a
216  mechanism to represent the wide range of characters used in IRIs,
217  either natively or by some protocol- or format-specific escaping
218  mechanism (for example, numeric character references in <xref
219  target="XML1"/>).</t>
221<t hangText="c.">The URI scheme definition, if it explicitly allows a
222  percent sign ("%") in any syntactic component, SHOULD define the
223  interpretation of sequences of percent-encoded octets (using "%XX"
224  hex octets) as octet from sequences of UTF-8 encoded strings; this
225  is recommended in the guidelines for registering new schemes, <xref
226  target="RFC4395"/>.  For example, this is the practice for IMAP URLs
227  <xref target="RFC2192"/>, POP URLs <xref target="RFC2384"/> and the
228  URN syntax <xref target="RFC2141"/>). Note that use of
229  percent-encoding may also be restricted in some situations, for
230  example, URI schemes that disallow percent-encoding might still be
231  used with a fragment identifier which is percent-encoded (e.g.,
232  <xref target="XPointer"/>). See <xref target="UTF8use"/> for further
233  discussion.</t>
236</section> <!-- applicability -->
238<section title="Definitions" anchor="sec-Definitions">
240<t>The following definitions are used in this document; they follow the
241terms in <xref target="RFC2130"/>, <xref target="RFC2277"/>, and
242<xref target="ISO10646"/>.</t>
243<t><list style="hanging">
245<t hangText="character:">A member of a set of elements used for the
246    organization, control, or representation of data. For example,
247    "LATIN CAPITAL LETTER A" names a character.</t>
249<t hangText="octet:">An ordered sequence of eight bits considered as a
250    unit.</t>
252<t hangText="character repertoire:">A set of characters (set in the
253    mathematical sense).</t>
255<t hangText="sequence of characters:">A sequence of characters (one
256    after another).</t>
258<t hangText="sequence of octets:">A sequence of octets (one after
259    another).</t>
261<t hangText="character encoding:">A method of representing a sequence
262    of characters as a sequence of octets (maybe with variants). Also,
263    a method of (unambiguously) converting a sequence of octets into a
264    sequence of characters.</t>
266<t hangText="charset:">The name of a parameter or attribute used to
267    identify a character encoding.</t>
269<t hangText="UCS:">Universal Character Set. The coded character set
270    defined by ISO/IEC 10646 <xref target="ISO10646"/> and the Unicode
271    Standard <xref target="UNIV4"/>.</t>
273<t hangText="IRI reference:">Denotes the common usage of an
274    Internationalized Resource Identifier. An IRI reference may be
275    absolute or relative.  However, the "IRI" that results from such a
276    reference only includes absolute IRIs; any relative IRI references
277    are resolved to their absolute form.  Note that in <xref
278    target="RFC2396"/> URIs did not include fragment identifiers, but
279    in <xref target="RFC3986"/> fragment identifiers are part of
280    URIs.</t>
282<t hangText="URL:">The term "URL" was originally used <xref
283   target="RFC1738"/> for roughly what is now called a "URI".  Books,
284   software and documentation often refers to URIs and IRIs using the
285   "URL" term. Some usages restrict "URL" to those URIs which are not
286   URNs. Because of the ambiguity of the term using the term "URL" is
287   NOT RECOMMENDED in formal documents.</t>
289<t hangText="LEIRI (Legacy Extended IRI) processing:">  This term was used in
290   various XML specifications to refer
291   to strings that, although not valid IRIs, were acceptable input to
292   the processing rules in <xref target="LEIRIspec" />.</t>
294<t hangText="(Web Address, Hypertext Reference, HREF):"> These terms have been
295   added in this document for convenience, to allow other
296   specifications to refer to those strings that, although not valid
297   IRIs, are acceptable input to the processing rules in <xref
298   target="webaddress"/>. This usage corresponds to the parsing rules
299   of some popular web browsing applications.
300   ISSUE: Need to find a good name/abbreviation for these.</t>
302<t hangText="running text:">Human text (paragraphs, sentences,
303   phrases) with syntax according to orthographic conventions of a
304   natural language, as opposed to syntax defined for ease of
305   processing by machines (e.g., markup, programming languages).</t>
307<t hangText="protocol element:">Any portion of a message that affects
308    processing of that message by the protocol in question.</t>
310<t hangText="presentation element:">A presentation form corresponding
311    to a protocol element; for example, using a wider range of
312    characters.</t>
314<t hangText="create (a URI or IRI):">With respect to URIs and IRIs,
315     the term is used for the initial creation. This may be the
316     initial creation of a resource with a certain identifier, or the
317     initial exposition of a resource under a particular
318     identifier.</t>
320<t hangText="generate (a URI or IRI):">With respect to URIs and IRIs,
321     the term is used when the identifier is generated by derivation
322     from other information.</t>
324<t hangText="parsed URI component:">When a URI processor parses a URI
325   (following the generic syntax or a scheme-specific syntax, the result
326   is a set of parsed URI components, each of which has a type
327   (corresponding to the syntactic definition) and a sequence of URI
328   characters.  </t>
330<t hangText="parsed IRI component:">When an IRI processor parses
331   an IRI directly, following the general syntax or a scheme-specific
332   syntax, the result is a set of parsed IRI components, each of
333   which has a type (corresponding to the syntactice definition)
334   and a sequence of IRI characters. (This definition is analogous
335   to "parsed URI component".)</t>
337<t hangText="IRI scheme:">A URI scheme may also be known as
338   an "IRI scheme" if the scheme's syntax has been extended to
339   allow non-US-ASCII characters according to the rules in this
340   document.</t>
343</section> <!-- definitions -->
344<section title="Notation" anchor="sec-Notation">
346<t>RFCs and Internet Drafts currently do not allow any characters
347outside the US-ASCII repertoire. Therefore, this document uses various
348special notations to denote such characters in examples.</t>
350<t>In text, characters outside US-ASCII are sometimes referenced by
351using a prefix of 'U+', followed by four to six hexadecimal
354<t>To represent characters outside US-ASCII in examples, this document
355uses two notations: 'XML Notation' and 'Bidi Notation'.</t>
357<t>XML Notation uses a leading '&amp;#x', a trailing ';', and the
358hexadecimal number of the character in the UCS in between. For
359example, &amp;#x44F; stands for CYRILLIC CAPITAL LETTER YA. In this
360notation, an actual '&amp;' is denoted by '&amp;amp;'.</t>
362<t>Bidi Notation is used for bidirectional examples: Lower case
363letters stand for Latin letters or other letters that are written left
364to right, whereas upper case letters represent Arabic or Hebrew
365letters that are written right to left.</t>
367<t>To denote actual octets in examples (as opposed to percent-encoded
368octets), the two hex digits denoting the octet are enclosed in "&lt;"
369and "&gt;".  For example, the octet often denoted as 0xc9 is denoted
370here as &lt;c9&gt;.</t>
372<t> In this document, the key words "MUST", "MUST NOT", "REQUIRED",
374and "OPTIONAL" are to be interpreted as described in <xref
377</section> <!-- notation -->
378</section> <!-- introduction -->
380<section title="IRI Syntax" anchor="syntax">
381<t>This section defines the syntax of Internationalized Resource
382Identifiers (IRIs).</t>
384<t>As with URIs, an IRI is defined as a sequence of characters, not as
385a sequence of octets. This definition accommodates the fact that IRIs
386may be written on paper or read over the radio as well as stored or
387transmitted digitally.  The same IRI might be represented as different
388sequences of octets in different protocols or documents if these
389protocols or documents use different character encodings (and/or
390transfer encodings).  Using the same character encoding as the
391containing protocol or document ensures that the characters in the IRI
392can be handled (e.g., searched, converted, displayed) in the same way
393as the rest of the protocol or document.</t>
395<section title="Summary of IRI Syntax" anchor="summary">
397<t>IRIs are defined by extending the URI syntax in <xref
398target="RFC3986"/>, but extending the class of unreserved characters
399by adding the characters of the UCS (Universal Character Set, <xref
400target="ISO10646"/>) beyond U+007F, subject to the limitations given
401in the syntax rules below and in <xref target="limitations"/>.</t>
403<t>The syntax and use of components and reserved characters is the
404same as that in <xref target="RFC3986"/>. Each "URI scheme" thus also
405functions as an "IRI scheme", in that scheme-specific parsing rules
406for URIs of a scheme are be extended to allow parsing of IRIs using
407the same parsing rules.</t>
409<t>All the operations defined in <xref target="RFC3986"/>, such as the
410resolution of relative references, can be applied to IRIs by
411IRI-processing software in exactly the same way as they are for URIs
412by URI-processing software.</t>
414<t>Characters outside the US-ASCII repertoire MUST NOT be reserved and
415therefore MUST NOT be used for syntactical purposes, such as to
416delimit components in newly defined schemes. For example, U+00A2, CENT
417SIGN, is not allowed as a delimiter in IRIs, because it is in the
418'iunreserved' category. This is similar to the fact that it is not
419possible to use '-' as a delimiter in URIs, because it is in the
420'unreserved' category.</t>
422</section> <!-- summary -->
423<section title="ABNF for IRI References and IRIs" anchor="abnf">
425<t>An ABNF definition for IRI references (which are the most general
426concept and the start of the grammar) and IRIs is given here. The
427syntax of this ABNF is described in <xref target="STD68"/>. Character
428numbers are taken from the UCS, without implying any actual binary
429encoding. Terminals in the ABNF are characters, not octets.</t>
431<t>The following grammar closely follows the URI grammar in <xref
432target="RFC3986"/>, except that the range of unreserved characters is
433expanded to include UCS characters, with the restriction that private
434UCS characters can occur only in query parts. The grammar is split
435into two parts: Rules that differ from <xref target="RFC3986"/>
436because of the above-mentioned expansion, and rules that are the same
437as those in <xref target="RFC3986"/>. For rules that are different
438than those in <xref target="RFC3986"/>, the names of the non-terminals
439have been changed as follows. If the non-terminal contains 'URI', this
440has been changed to 'IRI'. Otherwise, an 'i' has been prefixed.</t>
443for line length measuring in artwork (max 72 chars, three chars at start):
444      1         2         3         4         5         6         7
448<preamble>The following rules are different from those in <xref target="RFC3986"/>:</preamble>
450IRI            = scheme ":" ihier-part [ "?" iquery ]
451                 [ "#" ifragment ]
453ihier-part     = "//" iauthority ipath-abempty
454               / ipath-absolute
455               / ipath-rootless
456               / ipath-empty
458IRI-reference  = IRI / irelative-ref
460absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]
462irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]
464irelative-part = "//" iauthority ipath-abempty
465               / ipath-absolute
466               / ipath-noscheme
467               / ipath-empty
469iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
470iuserinfo      = *( iunreserved / pct-form / sub-delims / ":" )
471ihost          = IP-literal / IPv4address / ireg-name
473pct-form       = pct-encoded
475ireg-name      = *( iunreserved / sub-delims )
477ipath          = ipath-abempty   ; begins with "/" or is empty
478               / ipath-absolute  ; begins with "/" but not "//"
479               / ipath-noscheme  ; begins with a non-colon segment
480               / ipath-rootless  ; begins with a segment
481               / ipath-empty     ; zero characters
483ipath-abempty  = *( path-sep isegment )
484ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
485ipath-noscheme = isegment-nz-nc *( path-sep isegment )
486ipath-rootless = isegment-nz *( path-sep isegment )
487ipath-empty    = 0&lt;ipchar&gt;
488path-sep       = "/"
490isegment       = *ipchar
491isegment-nz    = 1*ipchar
492isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
493                     / "@" )
494               ; non-zero-length segment without any colon ":"                     
496ipchar         = iunreserved / pct-form / sub-delims / ":"
497               / "@"
499iquery         = *( ipchar / iprivate / "/" / "?" )
501ifragment      = *( ipchar / "/" / "?" / "#" )
503iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
505ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
506               / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
507               / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
508               / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
509               / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
510               / %xD0000-DFFFD / %xE1000-EFFFD
512iprivate       = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD
513               / %x100000-10FFFD
517<t>Some productions are ambiguous. The "first-match-wins" (a.k.a. "greedy")
518algorithm applies. For details, see <xref target="RFC3986"/>.</t>
521<preamble>The following rules are the same as those in <xref target="RFC3986"/>:</preamble>
523scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
525port           = *DIGIT
527IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"
529IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
531IPv6address    =                            6( h16 ":" ) ls32
532               /                       "::" 5( h16 ":" ) ls32
533               / [               h16 ] "::" 4( h16 ":" ) ls32
534               / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
535               / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
536               / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
537               / [ *4( h16 ":" ) h16 ] "::"              ls32
538               / [ *5( h16 ":" ) h16 ] "::"              h16
539               / [ *6( h16 ":" ) h16 ] "::"
541h16            = 1*4HEXDIG
542ls32           = ( h16 ":" h16 ) / IPv4address
544IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet
546dec-octet      = DIGIT                 ; 0-9
547               / %x31-39 DIGIT         ; 10-99
548               / "1" 2DIGIT            ; 100-199
549               / "2" %x30-34 DIGIT     ; 200-249
550               / "25" %x30-35          ; 250-255
552pct-encoded    = "%" HEXDIG HEXDIG
554unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
555reserved       = gen-delims / sub-delims
556gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
557sub-delims     = "!" / "$" / "&amp;" / "'" / "(" / ")"
558               / "*" / "+" / "," / ";" / "="
561<t>This syntax does not support IPv6 scoped addressing zone identifiers.</t>
563</section> <!-- abnf -->
565</section> <!-- syntax -->
567<section title="Processing IRIs and related protocol elements" anchor="processing">
569<t>IRIs are meant to replace URIs in identifying resources within new
570versions of protocols, formats, and software components that use a
571UCS-based character repertoire.  Protocols and components may use and
572process IRIs directly. However, there are still numerous systems and
573protocols which only accept URIs or components of parsed URIs; that is,
574they only accept sequences of characters within the subset of US-ASCII
575characters allowed in URIs. </t>
577<t>This section defines specific processing steps for IRI consumers
578which establish the relationship between the string given and the
579interpreted derivatives. These
580processing steps apply to both IRIs and IRI references (i.e., absolute
581or relative forms); for IRIs, some steps are scheme specific. </t>
583<section title="Converting to UCS" anchor="ucsconv"> 
585<t>Input that is already in a Unicode form (i.e., a sequence of Unicode
586 characters or an octet-stream representing a Unicode-based character
587 encoding such as UTF-8 or UTF-16) should be left as is and not
588 normalized (see (see <xref target="normalization"/>).</t>
590<t>If the IRI or IRI reference is an octet stream in some known
591 non-Unicode character encoding, convert the IRI to a sequence of
592 characters from the UCS; this sequence SHOULD also be normalized
593 according to Unicode Normalization Form C (NFC, <xref
594 target="UTR15"/>). In this case, retain the original character
595 encoding as the "document character encoding". (DESIGN QUESTION:
598<t> In other cases (written on paper, read aloud, or otherwise
599 represented independent of any character encoding) represent the IRI
600 as a sequence of characters from the UCS normalized according to
601 Unicode Normalization Form C (NFC, <xref target="UTR15"/>).</t>
602</section> <!-- ucsconv -->
604<section title="Parse the IRI into IRI components">
606<t>Parse the IRI, either as a relative reference (no scheme)
607or using scheme specific processing (according to the scheme
608given); the result resulting in a set of parsed IRI components.
612 </t>
614<t>NOTE: The result of parsing into components will correspond result
615in a correspondence of subtrings of the IRI according to the part
616matched.  For example, in <xref target="HTML5"/>, the protocol
617components of interest are SCHEME (scheme), HOST (ireg-name), PORT
618(port), the PATH (ipath after the initial "/"), QUERY (iquery),
619FRAGMENT (ifragment), and AUTHORITY (iauthority).
622<t>Subsequent processing rules are sometimes used to define other
623syntactic components. For example, <xref target="HTML5"/> defines APIs
624for IRI processing; in these APIs:
626<list style="hanging">
627<t hangText="HOSTSPECIFIC"> the substring that follows
628the substring matched by the iauthority production, or the whole
629string if the iauthority production wasn't matched.</t>
630<t hangText="HOSTPORT"> if there is a scheme component and a port
631component and the port given by the port component is different than
632the default port defined for the protocol given by the scheme
633component, then HOSTPORT is the substring that starts with the
634substring matched by the host production and ends with the substring
635matched by the port production, and includes the colon in between the
636two. Otherwise, it is the same as the host component.
640</section> <!-- parse -->
642<section title="General percent-encoding of IRI components" anchor="compmapping">
644<t>For most IRI components, it is possible to map the IRI component
645to an equivalent URI component by percent-encoding those characters
646not allowed in URIs. Previous processing steps will have removed
647some characters, and the interpretation of reserved characters will
648have already been done (with the syntactic reserved characters outside
649of the IRI component). This mapping is defined for all sequences
650of Unicode characters, whether or not they are valid for the component
651in question. </t>
653<t>For each character which is not allowed in a valid URI (NOTE: WHAT
654IS THE RIGHT REFERENCE HERE), apply the following steps. </t>
656<t><list style="hanging">
658<t hangText="Convert to UTF-8">Convert the character to a sequence of
659  one or more octets using UTF-8 <xref target="RFC3629"/>.</t>
661<t hangText="Percent encode">Convert each octet of this sequence to %HH,
662   where HH is the hexadecimal notation of the octet value. The
663   hexadecimal notation SHOULD use uppercase letters. (This is the
664   general URI percent-encoding mechanism in Section 2.1 of <xref
665   target="RFC3986"/>.)</t>
669<t>Note that the mapping is an identity transformation for parsed URI
670components of valid URIs, and is idempotent: applying the mapping a
671second time will not change anything.</t>
672</section> <!-- general conversion -->
674<section title="Mapping ireg-name" anchor="dnsmapping">
676<t>Schemes that allow non-ASCII based characters
677in the reg-name (ireg-name) position MUST convert the ireg-name
678component of an IRI as follows:</t>
680<t>Replace the ireg-name part of the IRI by the part converted using
681the ToASCII operation specified in Section 4.1 of <xref
682target="RFC3490"/> on each dot-separated label, and by using U+002E
683(FULL STOP) as a label separator, with the flag UseSTD3ASCIIRules set
684to FALSE, and with the flag AllowUnassigned set to FALSE.
685The ToASCII operation may
686fail, but this would mean that the IRI cannot be resolved.
687In such cases, if the domain name conversion fails, then the
688entire IRI conversion fails. Processors that have no mechanism for
689signalling a failure MAY instead substitute an otherwise
690invalid host name, although such processing SHOULD be avoided.
691 </t>
693<t>For example, the IRI
694<vspace/>"http://r&amp;#xE9;sum&amp;#xE9;"<vspace/> MAY be
695converted to <vspace/>""<vspace/>;
696conversion to percent-encoded form, e.g.,
697 <vspace/>"", MUST NOT be performed. </t>
699<t><list style="hanging"> 
701<t hangText="Note:">Domain Names may appear in parts of an IRI other
702than the ireg-name part.  It is the responsibility of scheme-specific
703implementations (if the Internationalized Domain Name is part of the
704scheme syntax) or of server-side implementations (if the
705Internationalized Domain Name is part of 'iquery') to apply the
706necessary conversions at the appropriate point. Example: Trying to
707validate the Web page at<vspace/>
708http://r&amp;#xE9;sum&amp;#xE9; would lead to an IRI of
710which would convert to a URI
712The server-side implementation is responsible for making the
713necessary conversions to be able to retrieve the Web page.</t>
715<t hangText="Note:">In this process, characters allowed in URI
716references and existing percent-encoded sequences are not encoded further.
717(This mapping is similar to, but different from, the encoding applied
718when arbitrary content is included in some part of a URI.)
720For example, an IRI of
722(in XML notation) is converted to
723<vspace/>"", not to
724something like
726((DESIGN QUESTION: What about e.g. in an IRI? Will that get converted to punycode, or not?))
731</section> <!-- dnsmapping -->
733<section title="Mapping query components" anchor="querymapping">
737For compatibility with existing deployed HTTP infrastructure,
738the following special case applies for schemes "http" and "https"
739and IRIs whose origin has a document charset other than one which
740is UCS-based (e.g., UTF-8 or UTF-16). In such a case, the "query"
741component of an IRI is mapped into a URI by using the document
742charset rather than UTF-8 as the binary representation before
743pct-encoding. This mapping is not applied for any other scheme
744or component.</t>
746</section> <!-- querymapping -->
748<section title="Mapping IRIs to URIs" anchor="mapping">
750<t>The canonical mapping from a IRI to URI is defined by applying the
751mapping above (from IRI to URI components) and then reassembling a URI
752from the parsed URI components using the original punctuation that
753delimited the IRI components. </t>
755</section> <!-- mapping -->
757<section title="Converting URIs to IRIs" anchor="URItoIRI">
759<t>In some situations, for presentation and further processing,
760it is desirable to convert a URI into an equivalent IRI in which
761natural characters are represented directly rather than
762percent encoded. Of course, every URI is already an IRI in
763its own right without any conversion, and in general there
764This section gives one such procedure for this conversion.
768The conversion described in this section, if given a valid URI, will
769result in an IRI that maps back to the URI used as an input for the
770conversion (except for potential case differences in percent-encoding
771and for potential percent-encoded unreserved characters).
773However, the IRI resulting from this conversion may differ
774from the original IRI (if there ever was one).</t>
776<t>URI-to-IRI conversion removes percent-encodings, but not all
777percent-encodings can be eliminated. There are several reasons for
780<t><list style="hanging">
782<t hangText="1.">Some percent-encodings are necessary to distinguish
783    percent-encoded and unencoded uses of reserved characters.</t>
785<t hangText="2.">Some percent-encodings cannot be interpreted as sequences
786    of UTF-8 octets.<vspace blankLines="1"/>
787    (Note: The octet patterns of UTF-8 are highly regular.
788    Therefore, there is a very high probability, but no guarantee,
789    that percent-encodings that can be interpreted as sequences of UTF-8
790    octets actually originated from UTF-8. For a detailed discussion,
791    see <xref target="Duerst97"/>.)</t>
793<t hangText="3.">The conversion may result in a character that is not
794    appropriate in an IRI. See <xref target="abnf"/>, <xref target="visual"/>,
795      and <xref target="limitations"/> for further details.</t>
797<t hangText="4.">IRI to URI conversion has different rules for
798    dealing with domain names and query parameters.</t>
802<t>Conversion from a URI to an IRI MAY be done by using the following
805<list style="hanging">
806<t hangText="1.">Represent the URI as a sequence of octets in
807       US-ASCII.</t>
809<t hangText="2.">Convert all percent-encodings ("%" followed by two
810      hexadecimal digits) to the corresponding octets, except those
811      corresponding to "%", characters in "reserved", and characters
812      in US-ASCII not allowed in URIs.</t>
814<t hangText="3.">Re-percent-encode any octet produced in step 2 that
815      is not part of a strictly legal UTF-8 octet sequence.</t>
818<t hangText="4.">Re-percent-encode all octets produced in step 3 that
819      in UTF-8 represent characters that are not appropriate according
820      to <xref target="abnf"/>, <xref target="visual"/>, and <xref
821      target="limitations"/>.</t>
823<t hangText="5.">Interpret the resulting octet sequence as a sequence
824      of characters encoded in UTF-8.</t>
826<t hangText="6.">URIs known to contain domain names in the reg-name
827      component SHOULD convert punycode-encoded domain name labels to
828      the corresponding characters using the ToUnicode procedure. </t>
831<t>This procedure will convert as many percent-encoded characters as
832possible to characters in an IRI. Because there are some choices when
833step 4 is applied (see <xref target="limitations"/>), results may
836<t>Conversions from URIs to IRIs MUST NOT use any character
837encoding other than UTF-8 in steps 3 and 4, even if it might be
838possible to guess from the context that another character encoding
839than UTF-8 was used in the URI.  For example, the URI
840"" might with some guessing be
841interpreted to contain two e-acute characters encoded as
842iso-8859-1. It must not be converted to an IRI containing these
843e-acute characters. Otherwise, in the future the IRI will be mapped to
844"", which is a different
845URI from "".</t>
847<section title="Examples">
849<t>This section shows various examples of converting URIs to IRIs.
850Each example shows the result after each of the steps 1 through 6 is
851applied. XML Notation is used for the final result.  Octets are
852denoted by "&lt;" followed by two hexadecimal digits followed by
855<t>The following example contains the sequence "%C3%BC", which is a
856strictly legal UTF-8 sequence, and which is converted into the actual
857character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
860<list style="hanging">
861<t hangText="1."></t>
862<t hangText="2.">;c3&gt;&lt;bc&gt;rst</t>
863<t hangText="3.">;c3&gt;&lt;bc&gt;rst</t>
864<t hangText="4.">;c3&gt;&lt;bc&gt;rst</t>
865<t hangText="5.">;#xFC;rst</t>
866<t hangText="6.">;#xFC;rst</t>
870<t>The following example contains the sequence "%FC", which might
872the<vspace/>iso-8859-1 character encoding.  (It might represent other
873characters in other character encodings. For example, the octet
874&lt;fc&gt; in iso-8859-5 represents U+045C, CYRILLIC SMALL LETTER
875KJE.)  Because &lt;fc&gt; is not part of a strictly legal UTF-8
876sequence, it is re-percent-encoded in step 3.
879<list style="hanging">
880<t hangText="1."></t>
881<t hangText="2.">;fc&gt;rst</t>
882<t hangText="3."></t>
883<t hangText="4."></t>
884<t hangText="5."></t>
885<t hangText="6."></t>
889<t>The following example contains "%e2%80%ae", which is the percent-encoded<vspace/>UTF-8
890character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. <xref target="visual"/>
891forbids the direct use of this character in an IRI. Therefore, the
892corresponding octets are re-percent-encoded in step 4. This example shows
893that the case (upper- or lowercase) of letters used in percent-encodings may not be preserved.
894The example also contains a punycode-encoded domain name label (xn--99zt52a),
895which is not converted.
897<list style="hanging">
898<t hangText="1."></t>
899<t hangText="2.">;e2&gt;&lt;80&gt;&lt;ae&gt;</t>
900<t hangText="3.">;e2&gt;&lt;80&gt;&lt;ae&gt;</t>
901<t hangText="4."></t>
902<t hangText="5."></t>
903<t hangText="6.">http://&amp;#x7D0D;&amp;#x8C46;</t>
906<t>Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
907(Japanese Natto). ((EDITOR NOTE: There is some inconsistency in this note.))</t>
909</section> <!-- examples -->
910</section> <!-- URItoIRI -->
911</section> <!-- processing -->
912<section title="Bidirectional IRIs for Right-to-Left Languages" anchor="Bidi">
914<t>Some UCS characters, such as those used in the Arabic and Hebrew
915scripts, have an inherent right-to-left (rtl) writing direction. IRIs
916containing these characters (called bidirectional IRIs or Bidi IRIs)
917require additional attention because of the non-trivial relation
918between logical representation (used for digital representation and
919for reading/spelling) and visual representation (used for
922<t>Because of the complex interaction between the logical representation,
923the visual representation, and the syntax of a Bidi IRI, a balance is
924needed between various requirements.
925The main requirements are<list style="hanging">
926<t hangText="1.">user-predictable conversion between visual and
927    logical representation;</t>
928<t hangText="2.">the ability to include a wide range of characters
929    in various parts of the IRI; and</t>
930<t hangText="3.">minor or no changes or restrictions for
931      implementations.</t>
934<section title="Logical Storage and Visual Presentation" anchor="visual">
936<t>When stored or transmitted in digital representation, bidirectional
937IRIs MUST be in full logical order and MUST conform to the IRI syntax
938rules (which includes the rules relevant to their scheme). This
939ensures that bidirectional IRIs can be processed in the same way as
940other IRIs.</t> <t>Bidirectional IRIs MUST be rendered by using the
941Unicode Bidirectional Algorithm <xref target="UNIV4"/>, <xref
942target="UNI9"/>.  Bidirectional IRIs MUST be rendered in the same way
943as they would be if they were in a left-to-right embedding; i.e., as
944if they were preceded by U+202A, LEFT-TO-RIGHT EMBEDDING (LRE), and
945followed by U+202C, POP DIRECTIONAL FORMATTING (PDF).  Setting the
946embedding direction can also be done in a higher-level protocol (e.g.,
947the dir='ltr' attribute in HTML).</t>
949<t>There is no requirement to use the above embedding if the display
950is still the same without the embedding. For example, a bidirectional
951IRI in a text with left-to-right base directionality (such as used for
952English or Cyrillic) that is preceded and followed by whitespace and
953strong left-to-right characters does not need an embedding.  Also, a
954bidirectional relative IRI reference that only contains strong
955right-to-left characters and weak characters and that starts and ends
956with a strong right-to-left character and appears in a text with
957right-to-left base directionality (such as used for Arabic or Hebrew)
958and is preceded and followed by whitespace and strong characters does
959not need an embedding.</t>
961<t>In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
962sufficient to force the correct display behavior.  However, the
963details of the Unicode Bidirectional algorithm are not always easy to
964understand. Implementers are strongly advised to err on the side of
965caution and to use embedding in all cases where they are not
966completely sure that the display behavior is unaffected without the
969<t>The Unicode Bidirectional Algorithm (<xref target="UNI9"/>, section
9704.3) permits higher-level protocols to influence bidirectional
971rendering. Such changes by higher-level protocols MUST NOT be used if
972they change the rendering of IRIs.</t>
974<t>The bidirectional formatting characters that may be used before or
975after the IRI to ensure correct display are not themselves part of the
976IRI.  IRIs MUST NOT contain bidirectional formatting characters (LRM,
977RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of
978the IRI but do not appear themselves. It would therefore not be
979possible to input an IRI with such characters correctly.</t>
981</section> <!-- visual -->
982<section title="Bidi IRI Structure" anchor="bidi-structure">
984<t>The Unicode Bidirectional Algorithm is designed mainly for running
985text.  To make sure that it does not affect the rendering of
986bidirectional IRIs too much, some restrictions on bidirectional IRIs
987are necessary. These restrictions are given in terms of delimiters
988(structural characters, mostly punctuation such as "@", ".", ":",
989and<vspace/>"/") and components (usually consisting mostly of letters
990and digits).</t>
992<t>The following syntax rules from <xref target="abnf"/> correspond to
993components for the purpose of Bidi behavior: iuserinfo, ireg-name,
994isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and
997<t>Specifications that define the syntax of any of the above
998components MAY divide them further and define smaller parts to be
999components according to this document. As an example, the restrictions
1000of <xref target="RFC3490"/> on bidirectional domain names correspond
1001to treating each label of a domain name as a component for schemes
1002with ireg-name as a domain name.  Even where the components are not
1003defined formally, it may be helpful to think about some syntax in
1004terms of components and to apply the relevant restrictions.  For
1005example, for the usual name/value syntax in query parts, it is
1006convenient to treat each name and each value as a component. As
1007another example, the extensions in a resource name can be treated as
1008separate components.</t>
1010<t>For each component, the following restrictions apply:</t>
1012<list style="hanging">
1014<t hangText="1.">A component SHOULD NOT use both right-to-left and
1015  left-to-right characters.</t>
1017<t hangText="2.">A component using right-to-left characters SHOULD
1018  start and end with right-to-left characters.</t>
1022<t>The above restrictions are given as "SHOULD"s, rather than as
1023"MUST"s.  For IRIs that are never presented visually, they are not
1024relevant.  However, for IRIs in general, they are very important to
1025ensure consistent conversion between visual presentation and logical
1026representation, in both directions.</t>
1028<t><list style="hanging">
1030<t hangText="Note:">In some components, the above restrictions may
1031  actually be strictly enforced.  For example, <xref
1032  target="RFC3490"></xref> requires that these restrictions apply to
1033  the labels of a host name for those schemes where ireg-name is a
1034  host name.  In some other components (for example, path components)
1035  following these restrictions may not be too difficult.  For other
1036  components, such as parts of the query part, it may be very
1037  difficult to enforce the restrictions because the values of query
1038  parameters may be arbitrary character sequences.</t>
1042<t>If the above restrictions cannot be satisfied otherwise, the
1043affected component can always be mapped to URI notation as described
1044in <xref target="compmapping"/>. Please note that the whole component
1045has to be mapped (see also Example 9 below).</t>
1047</section> <!-- bidi-structure -->
1049<section title="Input of Bidi IRIs" anchor="bidiInput">
1051<t>Bidi input methods MUST generate Bidi IRIs in logical order while
1052rendering them according to <xref target="visual"/>.  During input,
1053rendering SHOULD be updated after every new character is input to
1054avoid end-user confusion.</t>
1056</section> <!-- bidiInput -->
1058<section title="Examples">
1060<t>This section gives examples of bidirectional IRIs, in Bidi
1061Notation.  It shows legal IRIs with the relationship between logical
1062and visual representation and explains how certain phenomena in this
1063relationship may look strange to somebody not familiar with
1064bidirectional behavior, but familiar to users of Arabic and Hebrew. It
1065also shows what happens if the restrictions given in <xref
1066target="bidi-structure"/> are not followed. The examples below can be
1067seen at <xref target="BidiEx"/>, in Arabic, Hebrew, and Bidi Notation
1070<t>To read the bidi text in the examples, read the visual
1071representation from left to right until you encounter a block of rtl
1072text. Read the rtl block (including slashes and other special
1073characters) from right to left, then continue at the next unread ltr
1076<t>Example 1: A single component with rtl characters is inverted:
1077<vspace/>Logical representation:
1078"http://ab.CDEFGH.ij/kl/mn/op.html"<vspace/>Visual representation:
1079"http://ab.HGFEDC.ij/kl/mn/op.html"<vspace/> Components can be read
1080one by one, and each component can be read in its natural
1083<t>Example 2: More than one consecutive component with rtl characters
1084is inverted as a whole: <vspace/>Logical representation:
1085"http://ab.CDE.FGH/ij/kl/mn/op.html"<vspace/>Visual representation:
1086"http://ab.HGF.EDC/ij/kl/mn/op.html"<vspace/> A sequence of rtl
1087components is read rtl, in the same way as a sequence of rtl words is
1088read rtl in a bidi text.</t>
1090<t>Example 3: All components of an IRI (except for the scheme) are
1091rtl.  All rtl components are inverted overall: <vspace/>Logical
1094representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"<vspace/> The
1095whole IRI (except the scheme) is read rtl. Delimiters between rtl
1096components stay between the respective components; delimiters between
1097ltr and rtl components don't move.</t>
1099<t>Example 4: Each of several sequences of rtl components is inverted
1100on its own: <vspace/>Logical representation:
1101"http://AB.CD.ef/gh/IJ/KL.html"<vspace/>Visual representation:
1102"http://DC.BA.ef/gh/LK/JI.html"<vspace/> Each sequence of rtl
1103components is read rtl, in the same way as each sequence of rtl words
1104in an ltr text is read rtl.</t>
1106<t>Example 5: Example 2, applied to components of different kinds:
1107<vspace/>Logical representation: ""
1108<vspace/>Visual representation:
1109""<vspace/> The inversion of the domain
1110name label and the path component may be unexpected, but it is
1111consistent with other bidi behavior.  For reassurance that the domain
1112component really is "", it may be helpful to read aloud the
1113visual representation following the bidi algorithm. After
1114"" one reads the RTL block "E-F-slash-G-H", which
1115corresponds to the logical representation.
1118<t>Example 6: Same as Example 5, with more rtl components:
1119<vspace/>Logical representation:
1120"http://ab.CD.EF/GH/IJ/kl.html"<vspace/>Visual representation:
1121"http://ab.JI/HG/FE.DC/kl.html"<vspace/> The inversion of the domain
1122name labels and the path components may be easier to identify because
1123the delimiters also move.</t>
1125<t>Example 7: A single rtl component includes digits: <vspace/>Logical
1126representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"<vspace/>Visual
1127representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"<vspace/>
1128Numbers are written ltr in all cases but are treated as an additional
1129embedding inside a run of rtl characters. This is completely
1130consistent with usual bidirectional text.</t>
1132<t>Example 8 (not allowed): Numbers are at the start or end of an rtl
1133component:<vspace/>Logical representation:
1134""<vspace/>Visual representation:
1135""<vspace/> The sequence "1/2" is
1136interpreted by the bidi algorithm as a fraction, fragmenting the
1137components and leading to confusion. There are other characters that
1138are interpreted in a special way close to numbers; in particular, "+",
1139"-", "#", "$", "%", ",", ".", and ":".</t>
1141<t>Example 9 (not allowed): The numbers in the previous example are
1142percent-encoded: <vspace/>Logical representation:
1143"",<vspace/>Visual representation:
1146<t>Example 10 (allowed but not recommended): <vspace/>Logical
1147representation: "http://ab.CDEFGH.123/kl/mn/op.html"<vspace/>Visual
1148representation: "http://ab.123.HGFEDC/kl/mn/op.html"<vspace/>
1149Components consisting of only numbers are allowed (it would be rather
1150difficult to prohibit them), but these may interact with adjacent RTL
1151components in ways that are not easy to predict.</t>
1153<t>Example 11 (allowed but not recommended): <vspace/>Logical
1154representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"<vspace/>Visual
1155representation: "http://ab.123.HGFEDCij/kl/mn/op.html"<vspace/>
1156Components consisting of numbers and left-to-right characters are
1157allowed, but these may interact with adjacent RTL components in ways
1158that are not easy to predict.</t>
1159</section><!-- examples -->
1160</section><!-- bidi -->
1162<section title="Normalization and Comparison" anchor="equivalence">
1164<t><list style="hanging"><t hangText="Note:">The structure and much of
1165  the material for this section is taken from section 6 of <xref
1166  target="RFC3986"></xref>; the differences are due to the specifics
1167  of IRIs.</t></list></t>
1169<t>One of the most common operations on IRIs is simple comparison:
1170Determining whether two IRIs are equivalent, without using the IRIs to
1171access their respective resource(s). A comparison is performed
1172whenever a response cache is accessed, a browser checks its history to
1173color a link, or an XML parser processes tags within a
1174namespace. Extensive normalization prior to comparison of IRIs may be
1175used by spiders and indexing engines to prune a search space or reduce
1176duplication of request actions and response storage.</t>
1178<t>IRI comparison is performed for some particular purpose. Protocols
1179or implementations that compare IRIs for different purposes will often
1180be subject to differing design trade-offs in regards to how much
1181effort should be spent in reducing aliased identifiers. This section
1182describes various methods that may be used to compare IRIs, the
1183trade-offs between them, and the types of applications that might use
1186<section title="Equivalence">
1188<t>Because IRIs exist to identify resources, presumably they should be
1189considered equivalent when they identify the same resource. However,
1190this definition of equivalence is not of much practical use, as there
1191is no way for an implementation to compare two resources to determine
1192if they are "the same" unless it has full knowledge or control of
1193them. For this reason, determination of equivalence or difference of
1194IRIs is based on string comparison, perhaps augmented by reference to
1195additional rules provided by URI scheme definitions.  We use the terms
1196"different" and "equivalent" to describe the possible outcomes of such
1197comparisons, but there are many application-dependent versions of
1200<t>Even when it is possible to determine that two IRIs are equivalent,
1201IRI comparison is not sufficient to determine whether two IRIs
1202identify different resources. For example, an owner of two different
1203domain names could decide to serve the same resource from both,
1204resulting in two different IRIs. Therefore, comparison methods are
1205designed to minimize false negatives while strictly avoiding false
1208<t>In testing for equivalence, applications should not directly
1209compare relative references; the references should be converted to
1210their respective target IRIs before comparison. When IRIs are compared
1211to select (or avoid) a network action, such as retrieval of a
1212representation, fragment components (if any) should be excluded from
1213the comparison.</t>
1215<t>Applications using IRIs as identity tokens with no relationship to
1216a protocol MUST use the Simple String Comparison (see <xref
1217target="stringcomp"></xref>).  All other applications MUST select one
1218of the comparison practices from the Comparison Ladder (see <xref
1220</section> <!-- equivalence -->
1223<section title="Preparation for Comparison">
1224<t>Any kind of IRI comparison REQUIRES that any additional contextual
1225processing is first performed, including undoing higher-level
1226escapings or encodings in the protocol or format that carries an
1227IRI. This preprocessing is usually done when the protocol or format is
1230<t>Examples of contextual preprocessing steps are described in <xref
1231target="LEIRIHREF"/>. </t>
1233<t>Examples of such escapings or encodings are entities and
1234numeric character references in <xref target="HTML4"></xref> and <xref
1235target="XML1"></xref>. As an example,
1236";eacute;" (in HTML),
1237";#233;" (in HTML or XML), and
1238<vspace/>";#xE9;" (in HTML or XML) are all
1239resolved into what is denoted in this document (see <xref
1240target="sec-Notation"></xref>) as ";#xE9;"
1241(the "&amp;#xE9;" here standing for the actual e-acute character, to
1242compensate for the fact that this document cannot contain non-ASCII
1245<t>Similar considerations apply to encodings such as Transfer Codings
1246in HTTP (see <xref target="RFC2616"></xref>) and Content Transfer
1247Encodings in MIME (<xref target="RFC2045"></xref>), although in these
1248cases, the encoding is based not on characters but on octets, and
1249additional care is required to make sure that characters, and not just
1250arbitrary octets, are compared (see <xref
1253</section> <!-- preparation -->
1255<section title="Comparison Ladder" anchor="ladder">
1257<t>In practice, a variety of methods are used to test IRI
1258equivalence. These methods fall into a range distinguished by the
1259amount of processing required and the degree to which the probability
1260of false negatives is reduced. As noted above, false negatives cannot
1261be eliminated. In practice, their probability can be reduced, but this
1262reduction requires more processing and is not cost-effective for all
1266<t>If this range of comparison practices is considered as a ladder,
1267the following discussion will climb the ladder, starting with
1268practices that are cheap but have a relatively higher chance of
1269producing false negatives, and proceeding to those that have higher
1270computational cost and lower risk of false negatives.</t>
1272<section title="Simple String Comparison" anchor="stringcomp">
1274<t>If two IRIs, when considered as character strings, are identical,
1275then it is safe to conclude that they are equivalent.  This type of
1276equivalence test has very low computational cost and is in wide use in
1277a variety of applications, particularly in the domain of parsing. It
1278is also used when a definitive answer to the question of IRI
1279equivalence is needed that is independent of the scheme used and that
1280can be calculated quickly and without accessing a network. An example
1281of such a case is XML Namespaces (<xref
1285<t>Testing strings for equivalence requires some basic precautions.
1286This procedure is often referred to as "bit-for-bit" or
1287"byte-for-byte" comparison, which is potentially misleading. Testing
1288strings for equality is normally based on pair comparison of the
1289characters that make up the strings, starting from the first and
1290proceeding until both strings are exhausted and all characters are
1291found to be equal, until a pair of characters compares unequal, or
1292until one of the strings is exhausted before the other.</t>
1294<t>This character comparison requires that each pair of characters be
1295put in comparable encoding form. For example, should one IRI be stored
1296in a byte array in UTF-8 encoding form and the second in a UTF-16
1297encoding form, bit-for-bit comparisons applied naively will produce
1298errors. It is better to speak of equality on a character-for-character
1299rather than on a byte-for-byte or bit-for-bit basis.  In practical
1300terms, character-by-character comparisons should be done codepoint by
1301codepoint after conversion to a common character encoding form.
1303When comparing character by character, the comparison function MUST
1304NOT map IRIs to URIs, because such a mapping would create additional
1305spurious equivalences. It follows that an IRI SHOULD NOT be modified
1306when being transported if there is any chance that this IRI might be
1307used in a context that uses Simple String Comparison.</t>
1310<t>False negatives are caused by the production and use of IRI
1311aliases. Unnecessary aliases can be reduced, regardless of the
1312comparison method, by consistently providing IRI references in an
1313already normalized form (i.e., a form identical to what would be
1314produced after normalization is applied, as described below).
1315Protocols and data formats often limit some IRI comparisons to simple
1316string comparison, based on the theory that people and implementations
1317will, in their own best interest, be consistent in providing IRI
1318references, or at least be consistent enough to negate any efficiency
1319that might be obtained from further normalization.</t>
1320</section> <!-- stringcomp -->
1322<section title="Syntax-Based Normalization">
1324<figure><preamble>Implementations may use logic based on the
1325definitions provided by this specification to reduce the probability
1326of false negatives. This processing is moderately higher in cost than
1327character-for-character string comparison. For example, an application
1328using this approach could reasonably consider the following two IRIs
1332   example://a/b/c/%7Bfoo%7D/ros&amp;#xE9;
1333   eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9
1336<t>Web user agents, such as browsers, typically apply this type of IRI
1337normalization when determining whether a cached response is
1338available. Syntax-based normalization includes such techniques as case
1339normalization, character normalization, percent-encoding
1340normalization, and removal of dot-segments.</t>
1342<section title="Case Normalization">
1344<t>For all IRIs, the hexadecimal digits within a percent-encoding
1345triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
1346should be normalized to use uppercase letters for the digits A-F.</t>
1348<t>When an IRI uses components of the generic syntax, the component
1349syntax equivalence rules always apply; namely, that the scheme and
1350US-ASCII only host are case insensitive and therefore should be
1351normalized to lowercase. For example, the URI
1352"HTTP://" is equivalent to
1353"". Case equivalence for non-ASCII characters
1354in IRI components that are IDNs are discussed in <xref
1355target="schemecomp"></xref>.  The other generic syntax components are
1356assumed to be case sensitive unless specifically defined otherwise by
1357the scheme.</t>
1359<t>Creating schemes that allow case-insensitive syntax components
1360containing non-ASCII characters should be avoided. Case normalization
1361of non-ASCII characters can be culturally dependent and is always a
1362complex operation. The only exception concerns non-ASCII host names
1363for which the character normalization includes a mapping step derived
1364from case folding.</t>
1366</section> <!-- casenorm -->
1368<section title="Character Normalization" anchor="normalization">
1370<t>The Unicode Standard <xref target="UNIV4"></xref> defines various
1371equivalences between sequences of characters for various
1372purposes. Unicode Standard Annex #15 <xref target="UTR15"></xref>
1373defines various Normalization Forms for these equivalences, in
1374particular Normalization Form C (NFC, Canonical Decomposition,
1375followed by Canonical Composition) and Normalization Form KC (NFKC,
1376Compatibility Decomposition, followed by Canonical Composition).</t>
1378<t> IRIs already in Unicode MUST NOT be normalized before parsing or
1379interpreting. In many non-Unicode character encodings, some text
1380cannot be represented directly. For example, the word "Vietnam" is
1381natively written "Vi&amp;#x1EC7;t Nam" (containing a LATIN SMALL
1383transcoding from the windows-1258 character encoding leads to
1384"Vi&amp;#xEA;&amp;#x323;t Nam" (containing a LATIN SMALL LETTER E WITH
1385CIRCUMFLEX followed by a COMBINING DOT BELOW). Direct transcoding of
1386other 8-bit encodings of Vietnamese may lead to other
1389<t>Equivalence of IRIs MUST rely on the assumption that IRIs are
1390appropriately pre-character-normalized rather than apply character
1391normalization when comparing two IRIs. The exceptions are conversion
1392from a non-digital form, and conversion from a non-UCS-based character
1393encoding to a UCS-based character encoding. In these cases, NFC or a
1394normalizing transcoder using NFC MUST be used for interoperability. To
1395avoid false negatives and problems with transcoding, IRIs SHOULD be
1396created by using NFC. Using NFKC may avoid even more problems; for
1397example, by choosing half-width Latin letters instead of full-width
1398ones, and full-width instead of half-width Katakana.</t>
1401<t>As an example,
1402";#xE9;sum&amp;#xE9;.html" (in XML
1403Notation) is in NFC. On the other hand,
1404";#x301;sume&amp;#x301;.html" is not in
1407<t>The former uses precombined e-acute characters, and the latter uses
1408"e" characters followed by combining acute accents. Both usages are
1409defined as canonically equivalent in <xref target="UNIV4"></xref>.</t>
1411<t><list style="hanging">
1413<t hangText="Note:">
1414Because it is unknown how a particular sequence of characters is being
1415treated with respect to character normalization, it would be
1416inappropriate to allow third parties to normalize an IRI
1417arbitrarily. This does not contradict the recommendation that when a
1418resource is created, its IRI should be as character normalized as
1419possible (i.e., NFC or even NFKC). This is similar to the
1420uppercase/lowercase problems.  Some parts of a URI are case
1421insensitive (for example, the domain name). For others, it is unclear
1422whether they are case sensitive, case insensitive, or something in
1423between (e.g., case sensitive, but with a multiple choice selection if
1424the wrong case is used, instead of a direct negative result).  The
1425best recipe is that the creator use a reasonable capitalization and,
1426when transferring the URI, capitalization never be
1429<t>Various IRI schemes may allow the usage of Internationalized Domain
1430Names (IDN) <xref target="RFC3490"></xref> either in the ireg-name
1431part or elsewhere. Character Normalization also applies to IDNs, as
1432discussed in <xref target="schemecomp"></xref>.</t>
1433</section> <!-- charnorm -->
1435<section title="Percent-Encoding Normalization">
1437<t>The percent-encoding mechanism (Section 2.1 of <xref
1438target="RFC3986"></xref>) is a frequent source of variance among
1439otherwise identical IRIs. In addition to the case normalization issue
1440noted above, some IRI producers percent-encode octets that do not
1441require percent-encoding, resulting in IRIs that are equivalent to
1442their nonencoded counterparts. These IRIs should be normalized by
1443decoding any percent-encoded octet sequence that corresponds to an
1444unreserved character, as described in section 2.3 of <xref
1447<t>For actual resolution, differences in percent-encoding (except for
1448the percent-encoding of reserved characters) MUST always result in the
1449same resource.  For example, "",
1450"", and "", must
1451resolve to the same resource.</t>
1453<t>If this kind of equivalence is to be tested, the percent-encoding
1454of both IRIs to be compared has to be aligned; for example, by
1455converting both IRIs to URIs (see Section 3.1), eliminating escape
1456differences in the resulting URIs, and making sure that the case of
1457the hexadecimal characters in the percent-encoding is always the same
1458(preferably upper case). If the IRI is to be passed to another
1459application or used further in some other way, its original form MUST
1460be preserved.  The conversion described here should be performed only
1461for local comparison.</t>
1463</section> <!-- pctnorm -->
1465<section title="Path Segment Normalization">
1467<t>The complete path segments "." and ".." are intended only for use
1468within relative references (Section 4.1 of <xref
1469target="RFC3986"></xref>) and are removed as part of the reference
1470resolution process (Section 5.2 of <xref target="RFC3986"></xref>).
1471However, some implementations may incorrectly assume that reference
1472resolution is not necessary when the reference is already an IRI, and
1473thus fail to remove dot-segments when they occur in non-relative
1474paths.  IRI normalizers should remove dot-segments by applying the
1475remove_dot_segments algorithm to the path, as described in Section
14765.2.4 of <xref target="RFC3986"></xref>.</t>
1478</section> <!-- pathnorm -->
1479</section> <!-- ladder -->
1481<section title="Scheme-Based Normalization" anchor="schemecomp">
1483<t>The syntax and semantics of IRIs vary from scheme to scheme, as
1484described by the defining specification for each
1485scheme. Implementations may use scheme-specific rules, at further
1486processing cost, to reduce the probability of false negatives. For
1487example, because the "http" scheme makes use of an authority
1488component, has a default port of "80", and defines an empty path to be
1489equivalent to "/", the following four IRIs are equivalent:</t>
1497<t>In general, an IRI that uses the generic syntax for authority with
1498an empty path should be normalized to a path of "/". Likewise, an
1499explicit ":port", for which the port is empty or the default for the
1500scheme, is equivalent to one where the port and its ":" delimiter are
1501elided and thus should be removed by scheme-based normalization. For
1502example, the second IRI above is the normal form for the "http"
1505<t>Another case where normalization varies by scheme is in the
1506handling of an empty authority component or empty host
1507subcomponent. For many scheme specifications, an empty authority or
1508host is considered an error; for others, it is considered equivalent
1509to "localhost" or the end-user's host. When a scheme defines a default
1510for authority and an IRI reference to that default is desired, the
1511reference should be normalized to an empty authority for the sake of
1512uniformity, brevity, and internationalization. If, however, either the
1513userinfo or port subcomponents are non-empty, then the host should be
1514given explicitly even if it matches the default.</t>
1516<t>Normalization should not remove delimiters when their associated
1517component is empty unless it is licensed to do so by the scheme
1518specification. For example, the IRI "" cannot be
1519assumed to be equivalent to any of the examples above. Likewise, the
1520presence or absence of delimiters within a userinfo subcomponent is
1521usually significant to its interpretation.  The fragment component is
1522not subject to any scheme-based normalization; thus, two IRIs that
1523differ only by the suffix "#" are considered different regardless of
1524the scheme.</t>
1527Some IRI schemes may allow the usage of Internationalized Domain
1528Names (IDN) <xref target="RFC3490"></xref> either in their ireg-name
1529part or elsewhere. When in use in IRIs, those names SHOULD be
1530validated by using the ToASCII operation defined in <xref
1531target="RFC3490"></xref>, with the flags "UseSTD3ASCIIRules" and
1532"AllowUnassigned". An IRI containing an invalid IDN cannot
1533successfully be resolved.  Validated IDN components of IRIs SHOULD be
1534character normalized by using the Nameprep process <xref
1535target="RFC3491"></xref>; however, for legibility purposes, they
1536SHOULD NOT be converted into ASCII Compatible Encoding (ACE).</t>
1538<t>Scheme-based normalization may also consider IDN
1539components and their conversions to punycode as equivalent. As an
1540example, "http://r&amp;#xE9;sum&amp;#xE9;" may be
1541considered equivalent to
1542"".</t><t>Other scheme-specific
1543normalizations are possible.</t>
1545</section> <!-- schemenorm -->
1547<section title="Protocol-Based Normalization">
1549<t>Substantial effort to reduce the incidence of false negatives is
1550often cost-effective for web spiders. Consequently, they implement
1551even more aggressive techniques in IRI comparison. For example, if
1552they observe that an IRI such as</t>
1556<t>redirects to an IRI differing only in the trailing slash</t>
1560<t>they will likely regard the two as equivalent in the future.  This
1561kind of technique is only appropriate when equivalence is clearly
1562indicated by both the result of accessing the resources and the common
1563conventions of their scheme's dereference algorithm (in this case, use
1564of redirection by HTTP origin servers to avoid problems with relative
1567</section> <!-- protonorm -->
1568</section> <!-- equivalence -->
1571<section title="Use of IRIs" anchor="IRIuse">
1573<section title="Limitations on UCS Characters Allowed in IRIs" anchor="limitations">
1575<t>This section discusses limitations on characters and character
1576sequences usable for IRIs beyond those given in <xref target="abnf"/>
1577and <xref target="visual"/>. The considerations in this section are
1578relevant when IRIs are created and when URIs are converted to
1583<list style="hanging"><t hangText="a.">The repertoire of characters allowed
1584    in each IRI component is limited by the definition of that component.
1585    For example, the definition of the scheme component does not allow
1586    characters beyond US-ASCII.
1587    <vspace blankLines="1"/>
1588    (Note: In accordance with URI practice, generic IRI
1589    software cannot and should not check for such limitations.)</t>
1591<t hangText="b.">The UCS contains many areas of characters for which
1592    there are strong visual look-alikes. Because of the likelihood of
1593    transcription errors, these also should be avoided. This includes
1594    the full-width equivalents of Latin characters, half-width
1595    Katakana characters for Japanese, and many others. It also
1596    includes many look-alikes of "space", "delims", and "unwise",
1597    characters excluded in <xref target="RFC3491"/>.</t>
1602<t>Additional information is available from <xref target="UNIXML"/>.
1603    <xref target="UNIXML"/> is written in the context of running text
1604    rather than in that of identifiers. Nevertheless, it discusses
1605    many of the categories of characters not appropriate for IRIs.</t>
1606</section> <!-- limitations -->
1608<section title="Software Interfaces and Protocols">
1610<t>Although an IRI is defined as a sequence of characters, software
1611interfaces for URIs typically function on sequences of octets or other
1612kinds of code units. Thus, software interfaces and protocols MUST
1613define which character encoding is used.</t>
1615<t>Intermediate software interfaces between IRI-capable components and
1616URI-only components MUST map the IRIs per <xref target="mapping"/>,
1617when transferring from IRI-capable to URI-only components.
1619This mapping SHOULD be applied as late as possible. It SHOULD NOT be
1620applied between components that are known to be able to handle IRIs.</t>
1621</section> <!-- software -->
1623<section title="Format of URIs and IRIs in Documents and Protocols">
1625<t>Document formats that transport URIs may have to be upgraded to allow
1626the transport of IRIs. In cases where the document as a whole
1627has a native character encoding, IRIs MUST also be encoded in this
1628character encoding and converted accordingly by a parser or interpreter.
1630IRI characters not expressible in the native character encoding SHOULD
1631be escaped by using the escaping conventions of the document format if
1632such conventions are available. Alternatively, they MAY be
1633percent-encoded according to <xref target="mapping"/>. For example, in
1634HTML or XML, numeric character references SHOULD be used. If a
1635document as a whole has a native character encoding and that character
1636encoding is not UTF-8, then IRIs MUST NOT be placed into the document
1637in the UTF-8 character encoding.</t>
1639<t>((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
1640although they use different terminology. HTML 4.0 <xref
1641target="HTML4"/> defines the conversion from IRIs to URIs as
1642error-avoiding behavior. XML 1.0 <xref target="XML1"/>, XLink <xref
1643target="XLink"/>, XML Schema <xref target="XMLSchema"/>, and
1644specifications based upon them allow IRIs. Also, it is expected that
1645all relevant new W3C formats and protocols will be required to handle
1646IRIs <xref target="CharMod"/>.</t>
1648</section> <!-- format -->
1650<section title="Use of UTF-8 for Encoding Original Characters" anchor="UTF8use">
1652<t>This section discusses details and gives examples for point c) in
1653<xref target="Applicability"/>. To be able to use IRIs, the URI
1654corresponding to the IRI in question has to encode original characters
1655into octets by using UTF-8.  This can be specified for all URIs of a
1656URI scheme or can apply to individual URIs for schemes that do not
1657specify how to encode original characters.  It can apply to the whole
1658URI, or only to some part. For background information on encoding
1659characters into URIs, see also Section 2.5 of <xref
1662<t>For new URI schemes, using UTF-8 is recommended in <xref
1663target="RFC4395"/>.  Examples where UTF-8 is already used are the URN
1664syntax <xref target="RFC2141"/>, IMAP URLs <xref target="RFC2192"/>,
1665and POP URLs <xref target="RFC2384"/>.  On the other hand, because the
1666HTTP URI scheme does not specify how to encode original characters,
1667only some HTTP URLs can have corresponding but different IRIs.</t>
1669<t>For example, for a document with a URI
1670of<vspace/>"", it is
1671possible to construct a corresponding IRI (in XML notation, see <xref
1673";#xE9;sum&amp;#xE9;.html" ("&amp;#xE9;"
1674stands for the e-acute character, and "%C3%A9" is the UTF-8 encoded
1675and percent-encoded representation of that character). On the other
1676hand, for a document with a URI of
1677"", the percent-encoding octets
1678cannot be converted to actual characters in an IRI, as the
1679percent-encoding is not based on UTF-8.</t>
1681<t>For most URI schemes, there is no need to upgrade their scheme
1682definition in order for them to work with IRIs.  The main case where
1683upgrading makes sense is when a scheme definition, or a particular
1684component of a scheme, is strictly limited to the use of US-ASCII
1685characters with no provision to include non-ASCII characters/octets
1686via percent-encoding, or if a scheme definition currently uses highly
1687scheme-specific provisions for the encoding of non-ASCII characters.
1688An example of this is the mailto: scheme <xref target="RFC2368"/>.</t>
1690<t>This specification updates the IANA registry of URI schemes to note
1691their applicability to IRIs, see <xref target="iana"/>.  All IRIs use
1692URI schemes, and all URIs with URI schemes can be used as IRIs, even
1693though in some cases only by using URIs directly as IRIs, without any
1696<t>Scheme definitions can impose restrictions on the syntax of
1697scheme-specific URIs; i.e., URIs that are admissible under the generic
1698URI syntax <xref target="RFC3986"/> may not be admissible due to
1699narrower syntactic constraints imposed by a URI scheme
1700specification. URI scheme definitions cannot broaden the syntactic
1701restrictions of the generic URI syntax; otherwise, it would be
1702possible to generate URIs that satisfied the scheme-specific syntactic
1703constraints without satisfying the syntactic constraints of the
1704generic URI syntax. However, additional syntactic constraints imposed
1705by URI scheme specifications are applicable to IRI, as the
1706corresponding URI resulting from the mapping defined in <xref
1707target="mapping"/> MUST be a valid URI under the syntactic
1708restrictions of generic URI syntax and any narrower restrictions
1709imposed by the corresponding URI scheme specification.</t>
1711<t>The requirement for the use of UTF-8 generally applies to all parts
1712of a URI.  However, it is possible that the capability of IRIs to
1713represent a wide range of characters directly is used just in some
1714parts of the IRI (or IRI reference). The other parts of the IRI may
1715only contain US-ASCII characters, or they may not be based on
1716UTF-8. They may be based on another character encoding, or they may
1717directly encode raw binary data (see also <xref
1718target="RFC2397"/>). </t>
1720<t>For example, it is possible to have a URI reference
1722where the document name is encoded in iso-8859-1 based on server
1723settings, but where the fragment identifier is encoded in UTF-8 according
1724to <xref target="XPointer"/>. The IRI corresponding to the above
1725URI would be (in XML notation)<vspace/>";#xE9;sum&amp;#xE9;".</t>
1727<t>Similar considerations apply to query parts. The functionality
1728of IRIs (namely, to be able to include non-ASCII characters) can
1729only be used if the query part is encoded in UTF-8.</t>
1731</section> <!-- utf8 -->
1733<section title="Relative IRI References">
1734<t>Processing of relative IRI references against a base is handled
1735straightforwardly; the algorithms of <xref target="RFC3986"/> can
1736be applied directly, treating the characters additionally allowed
1737in IRI references in the same way that unreserved characters are in URI
1740</section> <!-- relative -->
1741</section> <!-- IRIuse -->
1743<section title="Liberal handling of otherwise invalid IRIs" anchor="LEIRIHREF">
1745<t>(EDITOR NOTE: This Section may move to an appendix.)
1747Some technical specifications and widely-deployed software have
1748allowed additional variations and extensions of IRIs to be used in
1749syntactic components. This section describes two widely-used
1750preprocessing agreements. Other technical specifications may wish to
1751reference a syntactic component which is "a valid IRI or a string that
1752will map to a valid IRI after this preprocessing algorithm". These two
1753variants are known as <xref target="LEIRI">Legacy Extended IRI or
1754LEIRI</xref>, and <xref target="HTML5">Web Address</xref>).
1757<t>Future technical specifications SHOULD NOT allow conforming
1758producers to produce, or conforming content to contain, such forms,
1759as they are not interoperable with other IRI consuming software.</t>
1761<section title="LEIRI processing"  anchor="LEIRIspec">
1762  <t>This section defines Legacy Extended IRIs (LEIRIs).
1763    The syntax of Legacy Extended IRIs is the same as that for IRIs,
1764    except that the ucschar production is replaced by the leiri-ucschar production:</t>
1768  leiri-ucschar  = " " / "&lt;" / "&gt;" / '"' / "{" / "}" / "|"
1769                   / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1770                   / %xE000-FFFD / %x10000-10FFFF
1774  Among other extensions, processors based on this specification also
1775  did not enforce the restriction on bidirectional formatting
1776  characters in <xref target="visual"></xref>, and the iprivate
1777  production becomes redundant.</postamble>
1780<t>To convert a string allowed as a LEIRI to an IRI, each character
1781allowed in leiri-ucschar but not in ucschar must be percent-encoded
1782using <xref target="compmapping"/>.</t>
1783</section> <!-- leiriproc -->
1785<section title="Web Address processing" anchor="webaddress">
1787<t>Many popular web browsers have taken the approach of being quite
1788liberal in what is accepted as a "URL" or its relative
1789forms. This section describes their behavior in terms of a preprocessor
1790which maps strings into the IRI space for subsequent parsing and
1791interpretation as an IRI.</t>
1793<t>In some situations, it might be appropriate to describe the syntax
1794that a liberal consumer implementation might accept as a "Web
1795Address" or "Hypertext Reference" or "HREF". However,
1796technical specifications SHOULD restrict the syntactic form allowed by compliant producers
1797to the IRI or IRI reference syntax defined in this document
1798even if they want to mandate this processing.</t>
1802<list style="symbols">
1803   <t>Leading and trailing whitespace is removed.</t>
1804   <t>Some additional characters are removed.</t>
1805   <t>Some additional characters are allowed and escaped (as with LEIRI).</t>
1806   <t>If interpreting an IRI as a URI, the pct-encoding of the query
1807   component of the parsed URI component depends on operational
1808   context.</t>
1812<t>Each string provided may have an associated charset (called
1813the HREF-charset here); this defaults to UTF-8.
1814For web browsers interpreting HTML, the document
1815charset of a string is determined:
1817<list style="hanging">
1818<t hangText="If the string came from a script (e.g. as an argument to
1819 a method)">The HRef-charset is the script's charset.</t>
1821<t hangText="If the string came from a DOM node (e.g. from an
1822  element)">The node has a Document, and the HRef-charset is the
1823  Document's character encoding.</t>
1825<t hangText="If the string had a HRef-charset defined when the string was
1826created or defined">The HRef-charset is as defined.</t>
1830<t>If the resulting HRef-charset is a unicode based character encoding
1831(e.g., UTF-16), then use UTF-8 instead.</t>
1835<preamble>The syntax for Web Addresses is obtained by replacing the 'ucschar',
1836  pct-form, and path-sep rules with the href-ucschar, href-pct-form, and href-path-sep
1837  rules below. In addition, some characters are stripped.</preamble>
1840  href-ucschar  = " " / "&lt;" / "&gt;" / '"' / "{" / "}" / "|"
1841                   / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1842                   / %xE000-FFFD / %x10000-10FFFF
[3]1843  href-pct-form = pct-encoded / "%"
1844  href-path-sep = "/" / "\"
1845  href-strip    = &lt;to be done&gt;
1850browsers did not enforce the restriction on bidirectional formatting
1851  characters in <xref target="visual"></xref>, and the iprivate
1852  production becomes redundant.</postamble>
1855<t>'Web Address processing' requires the following additional
1856preprocessing steps:
1858<list style="numbers">
1860<t>Leading and trailing instances of space (U+0020),
1861CR (U+000A), LF (U+000D), and TAB (U+0009) characters are removed.</t>
1863<t>strip all characters in href-strip.</t>
1864  <t>Percent-encode all characters in href-ucschar not in ucschar.</t>
1865  <t>Replace occurrences of "%" not followed by two hexadecimal digits by "%25".</t>
1866  <t>Convert backslashes ('\') matching href-path-sep to forward slashes ('/').</t>
1868</section> <!-- webaddress -->
1870<section title="Characters not allowed in IRIs" anchor="notAllowed">
1872<t>This section provides a list of the groups of characters and code
1873points that are allowed by LEIRI or HREF but are not allowed in IRIs or are
1874allowed in IRIs only in the query part. For each group of characters,
1875advice on the usage of these characters is also given, concentrating
1876on the reasons for why they are excluded from IRI use.</t>
1880<list><t>Space (U+0020): Some formats and applications use space as a
1881delimiter, e.g. for items in a list. Appendix C of <xref
1882target="RFC3986"></xref> also mentions that white space may have to be
1883added when displaying or printing long URIs; the same applies to long
1884IRIs. This means that spaces can disappear, or can make the what is
1885intended as a single IRI or IRI reference to be treated as two or more
1886separate IRIs.</t>
1888<t>Delimiters "&lt;" (U+003C), "&gt;" (U+003E), and '"' (U+0022):
1889Appendix C of <xref target="RFC3986"></xref> suggests the use of
1890double-quotes ("") and angle brackets
1891(&lt;;) as delimiters for URIs in plain
1892text. These conventions are often used, and also apply to IRIs.  Using
1893these characters in strings intended to be IRIs would result in the
1894IRIs being cut off at the wrong place.</t>
1896<t>Unwise characters "\" (U+005C), "^" (U+005E), "`"
1897(U+0060), "{" (U+007B), "|" (U+007C), and "}" (U+007D): These
1898characters originally have been excluded from URIs because the
1899respective codepoints are assigned to different graphic characters in
1900some 7-bit or 8-bit encoding. Despite the move to Unicode, some of
1901these characters are still occasionally displayed differently on some
1902systems, e.g. U+005C may appear as a Japanese Yen symbol on some
1903systems. Also, the fact that these characters are not used in URIs or
1904IRIs has encouraged their use outside URIs or IRIs in contexts that
1905may include URIs or IRIs. If a string with such a character were used
1906as an IRI in such a context, it would likely be interpreted
1909<t>The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
1910#x9F): There is generally no way to transmit these characters reliably
1911as text outside of a charset encoding.  Even when in encoded form,
1912many software components silently filter out some of these characters,
1913or may stop processing alltogether when encountering some of
1914them. These characters may affect text display in subtle, unnoticable
1915ways or in drastic, global, and irreversible ways depending on the
1916hardware and software involved. The use of some of these characters
1917would allow malicious users to manipulate the display of an IRI and
1918its context in many situations.</t>
1920<t>Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
1921characters affect the display ordering of characters. If IRIs were
1922allowed to contain these characters and the resulting visual display
1923transcribed. they could not be converted back to electronic form
1924(logical order) unambiguously. These characters, if allowed in IRIs,
1925might allow malicious users to manipulate the display of IRI and its
1928<t>Specials (U+FFF0-FFFD): These code points provide functionality
1929beyond that useful in an IRI, for example byte order identification,
1930annotation, and replacements for unknown characters and objects. Their
1931use and interpretation in an IRI would serve no purpose and might lead
1932to confusing display variations.</t>
1934<t>Private use code points (U+E000-F8FF, U+F0000-FFFFD,
1935U+100000-10FFFD): Display and interpretation of these code points is
1936by definition undefined without private agreement. Therefore, these
1937code points are not suited for use on the Internet. They are not
1938interoperable and may have unpredictable effects.</t>
1940<t>Tags (U+E0000-E0FFF): These characters provide a way to language
1941tag in Unicode plain text. They are not appropriate for IRIs because
1942language information in identifiers cannot reliably be input,
1943transmitted (e.g. on a visual medium such as paper), or
1946<t>Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
1950U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
1951non-characters. Applications may use some of them internally, but are
1952not prepared to interchange them.</t>
1956<t>LEIRI preprocessing disallowed some code points and
1957code units:
1959<list><t>Surrogate code units (D800-DFFF): These do not represent
1960Unicode codepoints.</t></list></t>
1961</section> <!-- notallowed -->
1962</section> <!-- lieirihref -->
1964<section title="URI/IRI Processing Guidelines (Informative)" anchor="guidelines">
1966<t>This informative section provides guidelines for supporting IRIs in
1967the same software components and operations that currently process
1968URIs: Software interfaces that handle URIs, software that allows users
1969to enter URIs, software that creates or generates URIs, software that
1970displays URIs, formats and protocols that transport URIs, and software
1971that interprets URIs. These may all require modification before
1972functioning properly with IRIs. The considerations in this section
1973also apply to URI references and IRI references.</t>
1975<section title="URI/IRI Software Interfaces">
1976<t>Software interfaces that handle URIs, such as URI-handling APIs and
1977protocols transferring URIs, need interfaces and protocol elements
1978that are designed to carry IRIs.</t>
1980<t>In case the current handling in an API or protocol is based on
1981US-ASCII, UTF-8 is recommended as the character encoding for IRIs, as
1982it is compatible with US-ASCII, is in accordance with the
1983recommendations of <xref target="RFC2277"/>, and makes converting to
1984URIs easy. In any case, the API or protocol definition must clearly
1985define the character encoding to be used.</t>
1987<t>The transfer from URI-only to IRI-capable components requires no
1988mapping, although the conversion described in <xref
1989target="URItoIRI"/> above may be performed. It is preferable not to
1990perform this inverse conversion unless it is certain this can be done
1994<section title="URI/IRI Entry">
1996<t>Some components allow users to enter URIs into the system
1997by typing or dictation, for example. This software must be updated to allow
1998for IRI entry.</t>
2000<t>A person viewing a visual representation of an IRI (as a sequence
2001of glyphs, in some order, in some visual display) or hearing an IRI
2002will use an entry method for characters in the user's language to
2003input the IRI. Depending on the script and the input method used, this
2004may be a more or less complicated process.</t>
2006<t>The process of IRI entry must ensure, as much as possible, that the
2007restrictions defined in <xref target="abnf"/> are met. This may be
2008done by choosing appropriate input methods or variants/settings
2009thereof, by appropriately converting the characters being input, by
2010eliminating characters that cannot be converted, and/or by issuing a
2011warning or error message to the user.</t>
2013<t>As an example of variant settings, input method editors for East
2014Asian Languages usually allow the input of Latin letters and related
2015characters in full-width or half-width versions. For IRI input, the
2016input method editor should be set so that it produces half-width Latin
2017letters and punctuation and full-width Katakana.</t>
2019<t>An input field primarily or solely used for the input of URIs/IRIs
2020might allow the user to view an IRI as it is mapped to a URI.  Places
2021where the input of IRIs is frequent may provide the possibility for
2022viewing an IRI as mapped to a URI. This will help users when some of
2023the software they use does not yet accept IRIs.</t>
2025<t>An IRI input component interfacing to components that handle URIs,
2026but not IRIs, must map the IRI to a URI before passing it to these
2029<t>For the input of IRIs with right-to-left characters, please see
2030<xref target="bidiInput"></xref>.</t>
2033<section title="URI/IRI Transfer between Applications">
2035<t>Many applications (for example, mail user agents) try to detect
2036URIs appearing in plain text. For this, they use some heuristics based
2037on URI syntax. They then allow the user to click on such URIs and
2038retrieve the corresponding resource in an appropriate (usually
2039scheme-dependent) application.</t>
2041<t>Such applications would need to be upgraded, in order to use the
2042IRI syntax as a base for heuristics. In particular, a non-ASCII
2043character should not be taken as the indication of the end of an IRI.
2044Such applications also would need to make sure that they correctly
2045convert the detected IRI from the character encoding of the document
2046or application where the IRI appears, to the character encoding used
2047by the system-wide IRI invocation mechanism, or to a URI (according to
2048<xref target="mapping"/>) if the system-wide invocation mechanism only
2049accepts URIs.</t>
2051<t>The clipboard is another frequently used way to transfer URIs and
2052IRIs from one application to another. On most platforms, the clipboard
2053is able to store and transfer text in many languages and scripts.
2054Correctly used, the clipboard transfers characters, not octets, which
2055will do the right thing with IRIs.</t>
2058<section title="URI/IRI Generation">
2060<t>Systems that offer resources through the Internet, where those
2061resources have logical names, sometimes automatically generate URIs
2062for the resources they offer. For example, some HTTP servers can
2063generate a directory listing for a file directory and then respond to
2064the generated URIs with the files.</t>
2066<t>Many legacy character encodings are in use in various file systems.
2067Many currently deployed systems do not transform the local character
2068representation of the underlying system before generating URIs.</t>
2070<t>For maximum interoperability, systems that generate resource
2071identifiers should make the appropriate transformations. For example,
2072if a file system contains a file named
2073"r&amp;#xE9;sum&amp;#xE9;.html", a server should expose this as
2074"r%C3%A9sum%C3%A9.html" in a URI, which allows use of
2075"r&amp;#xE9;sum&amp;#xE9;.html" in an IRI, even if locally the file
2076name is kept in a character encoding other than UTF-8.
2079<t>This recommendation particularly applies to HTTP servers. For FTP
2080servers, similar considerations apply; see <xref target="RFC2640"/>.</t>
2083<section title="URI/IRI Selection" anchor="selection">
2084<t>In some cases, resource owners and publishers have control over the
2085IRIs used to identify their resources. This control is mostly
2086executed by controlling the resource names, such as file names,
2089<t>In these cases, it is recommended to avoid choosing IRIs that are
2090easily confused. For example, for US-ASCII, the lower-case ell ("l") is
2091easily confused with the digit one ("1"), and the upper-case oh ("O") is
2092easily confused with the digit zero ("0"). Publishers should avoid
2093confusing users with "br0ken" or "1ame" identifiers.</t>
2095<t>Outside the US-ASCII repertoire, there are many more opportunities for
2096confusion; a complete set of guidelines is too lengthy to include
2097here. As long as names are limited to characters from a single script,
2098native writers of a given script or language will know best when
2099ambiguities can appear, and how they can be avoided. What may look
2100ambiguous to a stranger may be completely obvious to the average
2101native user. On the other hand, in some cases, the UCS contains
2102variants for compatibility reasons; for example, for typographic purposes.
2103These should be avoided wherever possible. Although there may be exceptions,
2104newly created resource names should generally be in NFKC
2105<xref target="UTR15"></xref> (which means that they are also in NFC).</t>
2107<t>As an example, the UCS contains the "fi" ligature at U+FB01
2108for compatibility reasons.
2109Wherever possible, IRIs should use the two letters "f" and "i" rather
2110than the "fi" ligature. An example where the latter may be used is
2111in the query part of an IRI for an explicit search for a word written
2112containing the "fi" ligature.</t>
2114<t>In certain cases, there is a chance that characters from different
2115scripts look the same. The best known example is the similarity of the
2116Latin "A", the Greek "Alpha", and the Cyrillic "A". To avoid such
2117cases, IRIs should only be created where all the characters in a
2118single component are used together in a given language. This usually
2119means that all of these characters will be from the same script, but
2120there are languages that mix characters from different scripts (such
2121as Japanese).  This is similar to the heuristics used to distinguish
2122between letters and numbers in the examples above. Also, for Latin,
2123Greek, and Cyrillic, using lowercase letters results in fewer
2124ambiguities than using uppercase letters would.</t>
2127<section title="Display of URIs/IRIs" anchor="display">
2129In situations where the rendering software is not expected to display
2130non-ASCII parts of the IRI correctly using the available layout and font
2131resources, these parts should be percent-encoded before being displayed.</t>
2133<t>For display of Bidi IRIs, please see <xref target="visual"/>.</t>
2136<section title="Interpretation of URIs and IRIs">
2137<t>Software that interprets IRIs as the names of local resources should
2138accept IRIs in multiple forms and convert and match them with the
2139appropriate local resource names.</t>
2141<t>First, multiple representations include both IRIs in the native
2142character encoding of the protocol and also their URI counterparts.</t>
2144<t>Second, it may include URIs constructed based on character
2145encodings other than UTF-8. These URIs may be produced by user agents that do
2146not conform to this specification and that use legacy character encodings to
2147convert non-ASCII characters to URIs. Whether this is necessary, and what
2148character encodings to cover, depends on a number of factors, such as
2149the legacy character encodings used locally and the distribution of
2150various versions of user agents. For example, software for Japanese
2151may accept URIs in Shift_JIS and/or EUC-JP in addition to UTF-8.</t>
2153<t>Third, it may include additional mappings to be more user-friendly
2154and robust against transmission errors. These would be similar to how
2155some servers currently treat URIs as case insensitive or perform
2156additional matching to account for spelling errors. For characters
2157beyond the US-ASCII repertoire, this may, for example, include
2158ignoring the accents on received IRIs or resource names. Please note
2159that such mappings, including case mappings, are language
2162<t>It can be difficult to identify a resource unambiguously if too
2163many mappings are taken into consideration. However, percent-encoded
2164and not percent-encoded parts of IRIs can always be clearly distinguished.
2165Also, the regularity of UTF-8 (see <xref target="Duerst97"/>) makes the
2166potential for collisions lower than it may seem at first.</t>
2169<section title="Upgrading Strategy">
2170<t>Where this recommendation places further constraints on software
2171for which many instances are already deployed, it is important to
2172introduce upgrades carefully and to be aware of the various
2175<t>If IRIs cannot be interpreted correctly, they should not be created,
2176generated, or transported. This suggests that upgrading URI interpreting
2177software to accept IRIs should have highest priority.</t>
2179<t>On the other hand, a single IRI is interpreted only by a single or
2180very few interpreters that are known in advance, although it may be
2181entered and transported very widely.</t>
2183<t>Therefore, IRIs benefit most from a broad upgrade of software to be
2184able to enter and transport IRIs. However, before an
2185individual IRI is published, care should be taken to upgrade the corresponding
2186interpreting software in order to cover the forms expected to be
2187received by various versions of entry and transport software.</t>
2189<t>The upgrade of generating software to generate IRIs instead of using a
2190local character encoding should happen only after the service is upgraded
2191to accept IRIs. Similarly, IRIs should only be generated when the service
2192accepts IRIs and the intervening infrastructure and protocol is known
2193to transport them safely.</t>
2195<t>Software converting from URIs to IRIs for display should be upgraded
2196only after upgraded entry software has been widely deployed to the
2197population that will see the displayed result.</t>
2200<t>Where there is a free choice of character encodings, it is often
2201possible to reduce the effort and dependencies for upgrading to IRIs
2202by using UTF-8 rather than another encoding. For example, when a new
2203file-based Web server is set up, using UTF-8 as the character encoding
2204for file names will make the transition to IRIs easier. Likewise, when
2205a new Web form is set up using UTF-8 as the character encoding of the
2206form page, the returned query URIs will use UTF-8 as the character
2207encoding (unless the user, for whatever reason, changes the character
2208encoding) and will therefore be compatible with IRIs.</t>
2211<t>These recommendations, when taken together, will allow for the
2212extension from URIs to IRIs in order to handle characters other than
2213US-ASCII while minimizing interoperability problems. For
2214considerations regarding the upgrade of URI scheme definitions, see
2215<xref target="UTF8use"/>.</t>
2218</section> <!-- guidelines -->
2220<section title="IANA Considerations" anchor="iana">
2222<t>RFC Editor and IANA note: Please Replace RFC XXXX with the
2223number of this document when it issues as an RFC. </t>
2225<t>IANA maintains a registry of "URI schemes". A "URI scheme" also
2226serves an "IRI scheme". </t>
2228<t>To clarify that the URI scheme registration process also applies to
2229IRIs, change the description of the "URI schemes" registry
2230header to say "[RFC4395] defines an IANA-maintained registry of URI
2231Schemes. These registries include the Permanent and Provisional URI
2232Schemes.  RFC XXXX updates this registry to designate that schemes may
2233also indicate their usability as IRI schemes.</t>
2235<t> Update "per RFC 4395" to "per RFC 4395 and RFC XXXX".
2238</section> <!-- IANA -->
2240<section title="Security Considerations" anchor="security">
2241<t>The security considerations discussed in <xref target="RFC3986"/>
2242also apply to IRIs. In addition, the following issues require
2243particular care for IRIs.</t>
2244<t>Incorrect encoding or decoding can lead to security problems.
2245In particular, some UTF-8 decoders do not check against overlong
2246byte sequences. As an example, a "/" is encoded with the byte 0x2F
2247both in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly
2248interpret the sequence 0xC0 0xAF as a "/". A sequence such as "%C0%AF.."
2249may pass some security tests and then be interpreted
2250as "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion
2251and checking are not done in the right order, and/or if reserved
2252characters and unreserved characters are not clearly distinguished.</t>
2254<t>There are various ways in which "spoofing" can occur with IRIs.
2255"Spoofing" means that somebody may add a resource name that looks the
2256same or similar to the user, but that points to a different resource.
2257The added resource may pretend to be the real resource by looking
2258very similar but may contain all kinds of changes that may be
2259difficult to spot and that can cause all kinds of problems.
2260Most spoofing possibilities for IRIs are extensions of those for URIs.</t>
2262<t>Spoofing can occur for various reasons. First, a user's normalization expectations or actual normalization
2263when entering an IRI or  transcoding an IRI from a legacy character
2264encoding do not match the normalization used on the
2265server side. Conceptually, this is no different from the problems
2266surrounding the use of case-insensitive web servers. For example,
2267a popular web page with a mixed-case name ("")
2268might be "spoofed" by someone who is able to create "".
2269However, the use of unnormalized character sequences, and of additional
2270mappings for user convenience, may increase the chance for spoofing.
2271Protocols and servers that allow the creation of resources with
2272names that are not normalized are particularly vulnerable to such
2273attacks. This is an inherent
2274security problem of the relevant protocol, server, or resource
2275and is not specific to IRIs, but it is mentioned here for completeness.</t>
2277<t>Spoofing can occur in various IRI components, such as the
2278domain name part or a path part. For considerations specific
2279to the domain name part, see <xref target="RFC3491"/>.
2280For the path part, administrators of sites that allow independent
2281users to create resources in the same sub area may have to be careful
2282to check for spoofing.</t>
2284<t>Spoofing can occur because in the UCS many characters look very similar. Details are discussed in <xref target="selection"/>.
2285Again, this is very similar to spoofing possibilities on US-ASCII,
2286e.g., using "br0ken" or "1ame" URIs.</t>
2288<t>Spoofing can occur when URIs with percent-encodings based on various
2289character encodings are accepted to deal with older user agents. In some
2290cases, particularly for Latin-based resource names, this is usually easy to
2291detect because UTF-8-encoded names, when interpreted and viewed as
2292legacy character encodings, produce mostly garbage.</t><t>When
2293concurrently used character encodings have a similar structure but there
2294are no characters that have exactly the same encoding, detection is more
2297<t>Spoofing can occur with bidirectional IRIs, if the restrictions
2298in <xref target="bidi-structure"/> are not followed. The same visual
2299representation may be interpreted as different logical representations,
2300and vice versa. It is also very important that a correct Unicode bidirectional
2301implementation be used.</t><t>The use of Legacy Extended IRIs introduces additional security issues.</t>
2302</section><!-- security -->
2304<section title="Acknowledgements">
2305<t>For contributions to this update, we would like to thank Ian Hickson, Michael Sperberg-McQueen, Dan Connolly, Norman Walsh, Richard Tobin, Henry S. Thomson, and the XML Core Working Group of the W3C.</t>
2307<t>The discussion on the issue addressed here started a long time
2308ago. There was a thread in the HTML working
2309group in August 1995 (under the topic of "Globalizing URIs") and in the
2310www-international mailing list in July 1996 (under the topic of
2311"Internationalization and URLs"), and there were ad-hoc meetings at the Unicode
2312conferences in September 1995 and September 1997.</t>
2314<t>For contributions to the previous version of this document, RFC 3987, many thanks go to
2315Francois Yergeau, Matitiahu Allouche,
2316Roy Fielding, Tim Berners-Lee, Mark Davis,
2317M.T. Carrasco Benitez, James Clark, Tim Bray, Chris Wendt, Yaron Goland,
2318Andrea Vine, Misha Wolf, Leslie Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman,
2319Russ Housley, Makoto MURATA, Steven Atkin,
2320Ryan Stansifer, Tex Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs,
2321Adam Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown,
2322Roy Badami, Jonathan Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio,
2323Chris Haynes, Walter Underwood, and many others.</t>
2324<t>A definition of HyperText Reference was initially produced by Ian Hixson,
2325and further edited by Dan Connolly and C. M. Spergerg-McQueen.</t>
2326<t>Thanks to the Internationalization Working
2327Group (I18N WG) of the World Wide Web Consortium (W3C),
2328and the members of the W3C
2329I18N Working Group and Interest Group for their contributions and their
2330work on <xref target="CharMod"/>. Thanks also go
2331to the members of many other W3C Working Groups for adopting IRIs, and to
2332the members of the Montreal IAB Workshop on Internationalization and
2333Localization for their review.</t>
2337<section title="Open Issues">
2339<t>NOTE: The issues noted in this section should be addressed before the document is submitted as an RFC.
2340These issues are not in any particular order, and do not necessarily form a complete list of all known issues.
2342<list style="hanging">
2344<t hangText="length limits on domain name">See, for example,
2345 discussion on
2346(that discussion is mostly irrelevant now as the "63 octets in UTF-8 per label" restriction was
2349<t hangText="Allow generic scheme-independent IRI to URI translation">Previous drafts of this
2350  specification proposed a generic IRI to URI transformation using pct-encoding,
2351  and allowed domain name translation to be optionally handled by retranslating host names
2352  from pct-encoding back into Unicode and then into punycode.
2353  This draft does not allow that behavior, but this should be fixed to be in line
2354  with RFC 3986 syntax and to lead implementations towards an uniform an long-term
2355  URI&lt;-&gt;IRI correspondence. See also <xref target='Gettys'/></t>
2357<t hangText="update URI scheme registry?">This document starts the process of making minor
2358  changes to the URI scheme registry. This should be handled as an update to RFC 4395.</t>
2360<t hangText="utf8 in HTTP">Not really IRI issue, but some HTTP implementations send UTF8 path directly, review. </t>
2362<t hangText="handling of \">Some web applications convert \ to / and others don't.
2363  Make this mandatory or disallowed (but not optional), for Web Addresses.</t>
2365<t hangText="dealing with disallowed IRI characters"> </t>
2367<t hangText="misplaced text">
2368Find a place to note that some older software transcoding to UTF-8 may produce
2369illegal output for some input, in particular for characters outside
2370the BMP (Basic Multilingual Plane). As an example, for the IRI with non-BMP characters (in XML Notation):
2372<vspace/>which contains the first three letters of the Old Italic alphabet,
2373the correct conversion to a URI is
2376<t hangText="Special Query Handling needed?">
2377The percent-encoding handling of query components in the HTTP scheme
2378is really unfortunate. There is no good normative advice to give if
2379the percent-encoding is delayed until the query-IRI is interpreted. Could
2380HTML ask browsers to percent-encode the form data using the document
2381character set BEFORE the query IRI is constructed, and only in the case where
2382the document character set isn't Unicode-based and the query is
2383being added to http: or https: URIs?  This would give more
2384consistent results.  Browsers might have to change their behavior in
2385constructing the IRI-with-query-added, but the results would be more
2386consistent and fewer bugs, and it wouldn't affect interpretation of
2387any existing web pages. It would remove the need to have a normative
2388special case for queries in HTML documents, just for http, in a way
2389in which things like transcoding etc. wouldn't work well.  You could
2390tell the difference between a query URI in the address bar and one
2391created via a form because the address bar would always be UTF-8.
2392The browsers might have to change the algorithm for showing the address
2393in the adress bar to know how to undo the encoding.</t>
2395<t hangText="handling illegal characters">
2397<xref target="compmapping" /> used to apply only to characters in
2398either 'ucschar' or 'iprivate', but then later said that systems
2399accepting IRIs MAY also deal with the printable characters in US-ASCII
2400that are not allowed in URIs, namely "&lt;", "&gt;", '"', space, "{",
2401"}", "|", "\", "^", and "`".  Larry felt that this a MAY would result
2402in non-uniform behavior, because some systems would produce valid URI
2403components and others wouldn't.  Non-printable US-ASCII characters
2404should be stripped by most software, so if they get to if they're
2405passed on somewhere as IRI characters, encoding them makes sense.
2407The section also used to say "If these characters are found
2408but are not converted, then the conversion SHOULD fail." but there is
2409no notion of conversion failing -- every string is converted.  Please
2410note that the number sign ("#"), the percent sign ("%"), and the
2411square bracket characters ("[", "]") are not part of the above list
2412and MUST NOT be converted.
2413 </t>
2414<t hangText="adding single % and hash">
2415Changed the BNF to not match the URI document in allowing
2416single % in path but not everywhere, and allowing a # in the
2417fragment part.</t>
2422<section title="Change Log">
2424<t>Note to RFC Editor: Please completely remove this section before publication.</t>
2426<section title='Changes from draft-duerst-iri-bis-07 to draft-ietf-iri-3987bis-00'>
2427     <t>Changed draft name, date, last paragraph of abstract, and titles in change log, and added this section
2428     in moving from draft-duerst-iri-bis-07 (personal submission) to draft-ietf-iri-3987bis-00 (WG document).</t>
2431<section title="Changes from -06 to -07 of draft-duerst-iri-bis" anchor="forkChanges"><t>
2433Major restructuring of IRI processing model to make scheme-specific translation necessary to handle IDNA requirements and for consistency with web implementations. </t>
2434<t>Starting with IRI, you want one of:
2435<list style="hanging">
2436<t hangText="a"> IRI components (IRI parsed into UTF8 pieces)</t>
2437<t hangText="b"> URI components (URI parsed into ASCII pieces, encoded correctly) </t>
2438<t hangText="c"> whole URI  (for passing on to some other system that wants whole URIs) </t>
2441<section title="OLD WAY">
2442<t><list style="numbers">
2444 <t>Pct-encoding on the whole thing to a URI.
2445 (c1) If you want a (maybe broken) whole URI, you might
2446        stop here.</t>
2448 <t>Parsing the URI into URI components.
2449   (b1) If you want (maybe broken) URI components, stop here.</t>
2451 <t> Decode the components (undoing the pct-encoding).
2452   (a) if you want IRI components, stop here.</t>
2454 <t> reencode:  Either using a different encoding some components
2455   (for domain names, and query components in web pages, which
2456   depends on the component, scheme and context), and otherwise
2457   using pct-encoding.
2458   (b2) if you want (good) URI components, stop here.</t>
2460 <t> reassemble the reencoded components.
2461   (c2) if you want a (*good*) whole URI stop here.</t>
2468<section title="NEW WAY">
2470<list style="numbers">
2472<t> Parse the IRI into IRI components using the generic syntax.
2473   (a) if you want IRI components, stop here.</t>
2475<t> Encode each components, using pct-encoding, IDN encoding, or
2476         special query part encoding depending on the component
2477         scheme or context. (b) If you want URI components, stop here.</t>
2478<t> reassemble the a whole URI from URI components.
2479   (c) if you want a whole URI stop here.</t>
[3]2484<section title='Changes from -00 to -01'><t><list style="symbols">
2485  <t>Removed 'mailto:' before mail addresses of authors.</t>
2486  <t>Added "&lt;to be done&gt;" as right side of 'href-strip' rule. Fixed '|' to '/' for
2487    alternatives.</t>
2491<section title="Changes from -05 to -06 of draft-duerst-iri-bis-00"><t><list style="symbols">
[2]2492<t>Add HyperText Reference, change abstract, acks and references for it</t>
2493<t>Add Masinter back as another editor.</t>
2494<t>Masinter integrates HRef material from HTML5 spec.</t>
2495<t>Rewrite introduction sections to modernize.</t>
2499<section title="Changes from -04 to -05 of draft-duerst-iri-bis"><t><list style="symbols"><t>Updated references.</t><t>Changed IPR text to pre5378Trust200902.</t></list></t>
2502<section title="Changes from -03 to -04 of draft-duerst-iri-bis"><t><list style="symbols"><t>Added explicit abbreviation for LEIRIs.</t><t>Mentioned LEIRI references.</t><t>Completed text in LEIRI section about tag characters and about specials.</t></list></t>
2505<section title="Changes from -02 to -03 of draft-duerst-iri-bis"><t><list style="symbols"><t>Updated some references.</t><t>Updated Michel Suginard's coordinates.</t></list></t>
2508<section title="Changes from -01 to -02 of draft-duerst-iri-bis"><t><list style="symbols"><t>Added tag range to iprivate (issue private-include-tags-115).</t><t>Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs.</t></list></t>
2510<section title="Changes from -00 to -01 of draft-duerst-iri-bis"><t><list style="symbols"><t>Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI" based on input from the W3C XML Core WG. Moved the relevant subsections to the back and promoted them to a section.</t><t>Added some text re. Legacy Extended IRIs to the security section.</t><t>Added a IANA Consideration Section.</t><t>Added this Change Log Section.</t><t>Added a section about "IRIs with Spaces/Controls" (converting from a Note in RFC 3987).</t></list></t>
2512<section title="Changes from RFC 3987 to -00 of draft-duerst-iri-bis"><t><list><t>Fixed errata (see</t></list></t>
2518<references title="Normative References">
2520<reference anchor="ASCII">
2522<title>Coded Character Set -- 7-bit American Standard Code for Information
2525<organization>American National Standards Institute</organization>
2527<date year="1986"/>
2529<seriesInfo name="ANSI" value="X3.4"/>
2532<reference anchor="ISO10646">
2534<title>ISO/IEC 10646:2003: Information Technology -
2535Universal Multiple-Octet Coded Character Set (UCS)</title>
2537<organization>International Organization for Standardization</organization>
2539<date month="December" year="2003"/>
2541<seriesInfo name="ISO" value="Standard 10646"/>
2550<reference anchor="STD68">
2552<title abbrev="ABNF">Augmented BNF for Syntax Specifications: ABNF</title>
2553<author initials="D." surname="Crocker" fullname="Dave Crocker"><organization/></author>
2554<author initials="P." surname="Overell" fullname="Paul Overell"><organization/></author>
2555<date month="January" year="2008"/></front>
2556<seriesInfo name="STD" value="68"/><seriesInfo name="RFC" value="5234"/>
2559<reference anchor="UNIV4">
2561<title>The Unicode Standard, Version 5.1.0, defined by: The Unicode Standard,
2562Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0),
2563as amended by Unicode 4.1.0 (</title>
2564<author><organization>The Unicode Consortium</organization></author>
2565<date year="2008" month="April"/>
2569<reference anchor="UNI9" target="">
2571<title>The Bidirectional Algorithm</title>
2572<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
2573<date year="2004" month="March"/>
2575<seriesInfo name="Unicode Standard Annex" value="#9"/>
2578<reference anchor="UTR15" target="">
2580<title>Unicode Normalization Forms</title>
2581<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
2582<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2583<date year="2008" month="March"/>
2585<seriesInfo name="Unicode Standard Annex" value="#15"/>
2590<references title="Informative References">
2592<reference anchor="BidiEx" target="">
2594<title>Examples of bidirectional IRIs</title>
2596<date year="" month=""/>
2600<reference anchor="CharMod" target="">
2602<title>Character Model for the World Wide Web: Resource Identifiers</title>
2603<author initials="M." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2604<author initials="F." surname="Yergeau" fullname="Francois Yergeau"><organization/></author>
2605<author initials="R." surname="Ishida" fullname="Richard Ishida"><organization/></author>
2606<author initials="M." surname="Wolf" fullname="Misha Wolf"><organization/></author>
2607<author initials="T." surname="Texin" fullname="Tex Texin"><organization/></author>
2608<date year="2004" month="November" day="25"/>
2610<seriesInfo name="World Wide Web Consortium" value="Candidate Recommendation"/>
2613<reference anchor="Duerst97" target="">
2615<title>The Properties and Promises of UTF-8</title>
2616<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2617<date year="1997" month="September"/>
2619<seriesInfo name="Proc. 11th International Unicode Conference, San Jose" value=""/>
2622<reference anchor="Gettys" target="">
2624<title>URI Model Consequences</title>
2625<author initials="J." surname="Gettys" fullname="Jim Gettys"><organization/></author>
2626<date month="" year=""/>
2630<reference anchor="HTML4" target="">
2632<title>HTML 4.01 Specification</title>
2633<author initials="D." surname="Raggett" fullname="Dave Raggett"><organization/></author>
2634<author initials="A." surname="Le Hors" fullname="Arnaud Le Hors"><organization/></author>
2635<author initials="I." surname="Jacobs" fullname="Ian Jacobs"><organization/></author>
2636<date year="1999" month="December" day="24"/>
2638<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2641<reference anchor="LEIRI" target="">
2643<title>Legacy extended IRIs for XML resource identification</title>
2644<author initials="H." surname="Thompson" fullname="Henry Thompson"><organization/></author>
2645<author initials="R." surname="Tobin"    fullname="Richard Tobin"><organization/></author>
2646<author initials="N." surname="Walsh" fullname="Norman Walsh"><organization/></author>
2647  <date year="2008" month="November" day="3"/>
2650<seriesInfo name="World Wide Web Consortium" value="Note"/>
2668<reference anchor="UNIXML" target="">
2671<title>Unicode in XML and other Markup Languages</title>
2672<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2673<author initials="A." surname="Freytag" fullname="Asmus Freytag"><organization/></author>
2674<date year="2003" month="June" day="18"/>
2676<seriesInfo name="Unicode Technical Report" value="#20"/>
2677<seriesInfo name="World Wide Web Consortium" value="Note"/>
2680<reference anchor="XLink" target="">
2682<title>XML Linking Language (XLink) Version 1.0</title>
2683<author initials="S." surname="DeRose" fullname="Steve DeRose"><organization/></author>
2684<author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2685<author initials="D." surname="Orchard" fullname="David Orchard"><organization/></author>
2686<date year="2001" month="June" day="27"/>
2688<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2691<reference anchor="XML1" target="">
2692  <front>
2693    <title>Extensible Markup Language (XML) 1.0 (Forth Edition)</title>
2694    <author initials="T." surname="Bray" fullname="Tim Bray"><organization/></author>
2695    <author initials="J." surname="Paoli" fullname="Jean Paoli"><organization/></author>
2696    <author initials="C.M." surname="Sperberg-McQueen" fullname="C. M. Sperberg-McQueen">
2697      <organization/></author>
2698    <author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2699    <author initials="F." surname="Yergeau" fullname="Francois Yergeau"><organization/></author>
2700    <date day="16" month="August" year="2006"/>
2701  </front>
2702  <seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2705<reference anchor="XMLNamespace" target="">
2706  <front>
2707    <title>Namespaces in XML (Second Edition)</title>
2708    <author initials="T." surname="Bray" fullname="Tim Bray"><organization/></author>
2709    <author initials="D." surname="Hollander" fullname="Dave Hollander"><organization/></author>
2710    <author initials="A." surname="Layman" fullname="Andrew Layman"><organization/></author>
2711    <author initials="R." surname="Tobin" fullname="Richard Tobin"><organization></organization></author><date day="16" month="August" year="2006"/>
2712  </front>
2713  <seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2716<reference anchor="XMLSchema" target="">
2718<title>XML Schema Part 2: Datatypes</title>
2719<author initials="P." surname="Biron" fullname="Paul Biron"><organization/></author>
2720<author initials="A." surname="Malhotra" fullname="Ashok Malhotra"><organization/></author>
2721<date year="2001" month="May" day="2"/>
2723<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2726<reference anchor="XPointer" target="">
2728<title>XPointer Framework</title>
2729<author initials="P." surname="Grosso" fullname="Paul Grosso"><organization/></author>
2730<author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2731<author initials="J." surname="Marsh" fullname="Jonathan Marsh"><organization/></author>
2732<author initials="N." surname="Walsh" fullname="Norman Walsh"><organization/></author>
2733<date year="2003" month="March" day="25"/>
2735<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2738<reference anchor="HTML5" target="">
2740<title>A vocabulary and associated APIs for HTML and XHTML</title>
2741<author initials="I." surname="Hickson" fullname="Ian Hickson"><organization>Google, Inc.</organization></author>
2742<author initials="D." surname="Hyatt" fullname="David Hyatt"><organization>Apple, Inc.</organization></author>
2743<date year="2009"  month="April" day="23"/>
2745<seriesInfo name="World Wide Web Consortium" value="Working Draft"/>
2750<section title="Design Alternatives">
2751<t>This section briefly summarizes some design alternatives
2752considered earlier and the reasons why they were not chosen.</t>
2753<section title="New Scheme(s)">
2754<t>Introducing new schemes (for example, httpi:, ftpi:,...) or a
2755new metascheme (e.g., i:, leading to URI/IRI prefixes such as
2756i:http:, i:ftp:,...) was proposed to make IRI-to-URI conversion
2757scheme dependent or to distinguish between percent-encodings
2758resulting from IRI-to-URI conversion and percent-encodings from
2759legacy character encodings.</t>
2761<t>New schemes are not needed to distinguish URIs from true IRIs (i.e.,
2762  IRIs that contain non-ASCII characters). The benefit of being able
2763  to detect the origin of percent-encodings is marginal, as UTF-8
2764  can be detected with very high reliability. Deploying new schemes is
2765  extremely hard, so not requiring new schemes for IRIs makes
2766  deployment of IRIs vastly easier. Making conversion scheme dependent
2767  is highly inadvisable and would be encouraged by separate schemes for IRIs.
2768  Using a uniform convention for conversion from IRIs to URIs makes
2769  IRI implementation orthogonal to the introduction of actual new
2770  schemes.</t>
2772<section title="Character Encodings Other Than UTF-8">
2773<t>At an early stage, UTF-7 was considered as an alternative to
2774UTF-8 when IRIs are converted to URIs. UTF-7 would not have needed
2775percent-encoding and  in most cases would have been shorter than
2776percent-encoded UTF-8.</t>
2777<t>Using UTF-8 avoids a double layering and overloading of the use of
2778   the "+" character. UTF-8 is fully compatible with US-ASCII and has
2779   therefore been recommended by the IETF, and is being used widely.</t>
2781  <t>UTF-7 has never been used much and is now clearly being
2782   discouraged. Requiring implementations to convert from UTF-8
2783   to UTF-7 and back would be an additional implementation burden.</t>
2784</section> <!-- notutf8 -->
2785<section title="New Encoding Convention">
2786<t>Instead of using the existing percent-encoding convention
2787of URIs, which is based on octets, the idea was to create a new
2788encoding convention; for example, to use "%u" to introduce
2789UCS code points.</t>
2790<t>Using the existing octet-based percent-encoding mechanism
2791does not need an upgrade of the URI syntax and does not
2792need corresponding server upgrades.</t>
2793</section> <!-- new encoding -->
2794<section title="Indicating Character Encodings in the URI/IRI">
2795<t>Some proposals suggested indicating the character encodings used
2796in an URI or IRI with some new syntactic convention in the URI itself,
2797similar to the "charset" parameter for e-mails and Web pages.
2798As an example, the label in square brackets in
2799"[iso-8859-1]&amp;#xE9;" indicated that
2800the following "&amp;#xE9;" had to be interpreted as iso-8859-1.</t>
2801<t>If UTF-8 is used exclusively, an upgrade to the URI syntax is not needed.
2802It avoids potentially multiple labels that have to be copied correctly
2803in all cases, even on the side of a bus or on a napkin, leading to
2804usability problems (and being prohibitively annoying).
2805Exclusively using UTF-8 also reduces transcoding errors and confusion.</t>
2806</section> <!-- indicating -->
Note: See TracBrowser for help on using the repository browser.