source: draft-ietf-iri-3987bis/draft-ietf-iri-3987bis.xml @ 3

Last change on this file since 3 was 3, checked in by duerst@…, 9 years ago

xml version of draft-ietf-iri-3987bis-01 (before final tweaks that have to be done by hand)

  • Property svn:executable set to *
File size: 131.6 KB
Line 
1<?xml version="1.0"?>
2<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
3<!ENTITY rfc1738 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.1738.xml">
4<!ENTITY rfc2045 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2045.xml">
5<!ENTITY rfc2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
6<!ENTITY rfc2130 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2130.xml">
7<!ENTITY rfc2141 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2141.xml">
8<!ENTITY rfc2192 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2192.xml">
9<!ENTITY rfc2277 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2277.xml">
10<!ENTITY rfc2368 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2368.xml">
11<!ENTITY rfc2384 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2384.xml">
12<!ENTITY rfc2396 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2396.xml">
13<!ENTITY rfc2397 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2397.xml">
14<!ENTITY rfc2616 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2616.xml">
15<!ENTITY rfc2640 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2640.xml">
16<!ENTITY rfc3490 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3490.xml">
17<!ENTITY rfc3491 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3491.xml">
18<!ENTITY rfc3629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3629.xml">
19<!ENTITY rfc3986 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3986.xml">
20<!ENTITY rfc4395 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4395.xml">
21]>
22<?rfc strict='yes'?>
23<!--     complains about too long lines (2 cases)
24     and appendix, but otherwise is okay
25-->
26<?xml-stylesheet type='text/css' href='rfc2629.css' ?>
27<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
28<?rfc symrefs='yes'?>
29<?rfc sortrefs='yes'?>
30<?rfc iprnotified="no" ?>
31<?rfc toc='yes'?>
32<?rfc compact='yes'?>
33<?rfc subcompact='no'?>
34<rfc ipr="pre5378Trust200902" docName="draft-ietf-iri-3987bis-01" category="std" xml:lang="en" obsoletes="RFC 3987">
35<front>
36<title abbrev="IRIs">Internationalized Resource Identifiers (IRIs)</title>
37
38  <author initials="M.J." surname="Duerst" fullname='Martin Duerst'>
39    <!-- (Note: Please write "Duerst" with u-umlaut wherever
40      possible, for example as "D&#252;rst" in XML and HTML') -->
41  <organization abbrev="Aoyama Gakuin University">Aoyama Gakuin University</organization>
42  <address>
43  <postal>
44  <street>5-10-1 Fuchinobe</street>
45  <city>Sagamihara</city>
46  <region>Kanagawa</region>
47  <code>229-8558</code>
48  <country>Japan</country>
49  </postal>
50  <phone>+81 42 759 6329</phone>
51  <facsimile>+81 42 759 6495</facsimile>
52  <email>duerst@it.aoyama.ac.jp</email>
53  <uri>http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/<!-- (Note: This is the percent-encoded form of an IRI.)--></uri>
54  </address>
55</author>
56
57<author initials="M.L." surname="Suignard" fullname="Michel Suignard">
58   <organization>Unicode Consortium</organization>
59   <address>
60   <postal>
61   <street></street>
62   <street>P.O. Box 391476</street>
63   <city>Mountain View</city>
64   <region>CA</region>
65   <code>94039-1476</code>
66   <country>U.S.A.</country>
67   </postal>
68   <phone>+1-650-693-3921</phone>
69   <email>michel@unicode.org</email>
70   <uri>http://www.suignard.com</uri>
71   </address>
72</author>
73<author initials="L." surname="Masinter" fullname="Larry Masinter">
74   <organization>Adobe</organization>
75   <address>
76   <postal>
77   <street>345 Park Ave</street>
78   <city>San Jose</city>
79   <region>CA</region>
80   <code>95110</code>
81   <country>U.S.A.</country>
82   </postal>
83   <phone>+1-408-536-3024</phone>
84   <email>masinter@adobe.com</email>
85   <uri>http://larry.masinter.net</uri>
86   </address>
87</author>
88
89<date year="2010" month="July" day="26"/>
90<area>Applications</area>
91<workgroup>Internationalized Resource Identifiers (iri)</workgroup>
92<keyword>IRI</keyword>
93<keyword>Internationalized Resource Identifier</keyword>
94<keyword>UTF-8</keyword>
95<keyword>URI</keyword>
96<keyword>URL</keyword>
97<keyword>IDN</keyword>
98<keyword>LEIRI</keyword>
99
100<abstract>
101<t>This document defines the Internationalized Resource Identifier
102(IRI) protocol element, as an extension of the Uniform Resource
103Identifier (URI).  An IRI is a sequence of characters from the
104Universal Character Set (Unicode/ISO 10646). Grammar and processing
105rules are given for IRIs and related syntactic forms.</t>
106
107<t>In addition, this document provides named additional rule sets
108for processing otherwise invalid IRIs, in a way that supports
109other specifications that wish to mandate common behavior for
110'error' handling. In particular, rules used in some XML languages
111(LEIRI) and web applications are given.</t>
112
113<t>Defining IRI as new protocol element (rather than updating or
114extending the definition of URI) allows independent orderly
115transitions: other protocols and languages that use URIs must
116explicitly choose to allow IRIs.</t>
117
118<t>Guidelines are provided for the use and deployment of IRIs and
119related protocol elements when revising protocols, formats, and
120software components that currently deal only with URIs.</t>
121
122<t>[RFC Editor: Please remove this paragraph before publication.]
123This document is intended to update RFC 3987 and move towards IETF
124Draft Standard.  This version is essentially identical to
125draft-duerst-iri-bis-07.txt, and is submitted as an initial draft to
126start WG discussions. For discussion and comments on this
127draft, please join the IETF IRI WG by subscribing to the mailing
128list public-iri@w3.org.</t>
129
130</abstract>
131</front>
132<middle>
133
134<section title="Introduction">
135
136<section title="Overview and Motivation" anchor="overview">
137
138<t>A Uniform Resource Identifier (URI) is defined in <xref
139target="RFC3986"/> as a sequence of characters chosen from a limited
140subset of the repertoire of US-ASCII <xref target="ASCII"/>
141characters.</t>
142
143<t>The characters in URIs are frequently used for representing words
144of natural languages.  This usage has many advantages: Such URIs are
145easier to memorize, easier to interpret, easier to transcribe, easier
146to create, and easier to guess. For most languages other than English,
147however, the natural script uses characters other than A - Z. For many
148people, handling Latin characters is as difficult as handling the
149characters of other scripts is for those who use only the Latin
150alphabet. Many languages with non-Latin scripts are transcribed with
151Latin letters. These transcriptions are now often used in URIs, but
152they introduce additional difficulties.</t>
153
154<t>The infrastructure for the appropriate handling of characters from
155additional scripts is now widely deployed in operating system and
156application software. Software that can handle a wide variety of
157scripts and languages at the same time is increasingly common. Also,
158an increasing number of protocols and formats can carry a wide range of
159characters.</t>
160
161<t>URIs are used both as a protocol element (for transmission and
162processing by software) and also a presentation element (for display
163and handling by people who read, interpret, coin, or guess them). The
164transition between these roles is more difficult and complex when
165dealing with the larger set of characters than allowed for URIs in
166<xref target="RFC3986"/>. </t>
167
168<t>This document defines the protocol element called Internationalized
169Resource Identifier (IRI), which allow applications of URIs to be
170extended to use resource identifiers that have a much wider repertoire
171of characters. It also provides corresponding "internationalized"
172versions of other constructs from <xref target="RFC3986"/>, such as
173URI references. The syntax of IRIs is defined in <xref
174target="syntax"/>.
175</t>
176
177<t>Using characters outside of A - Z in IRIs adds a number of
178difficulties. <xref target="Bidi"/> discusses the special case of
179bidirectional IRIs using characters from scripts written
180right-to-left.  <xref target="equivalence"/> discusses various forms
181of equivalence between IRIs. <xref target="IRIuse"/> discusses the use
182of IRIs in different situations.  <xref target="guidelines"/> gives
183additional informative guidelines.  <xref target="security"/>
184discusses IRI-specific security considerations.</t>
185</section> <!-- overview -->
186
187<section title="Applicability" anchor="Applicability">
188
189<t>IRIs are designed to allow protocols and software that deal with
190URIs to be updated to handle IRIs. A "URI scheme" (as defined by <xref
191target="RFC3986"/> and registered through the IANA process defined in
192<xref target="RFC4395"/> also serves as an "IRI scheme". Processing of
193IRIs is accomplished by extending the URI syntax while retaining (and
194not expanding) the set of "reserved" characters, such that the syntax
195for any URI scheme may be uniformly extended to allow non-ASCII
196characters. In addition, following parsing of an IRI, it is possible
197to construct a corresponding URI by first encoding characters outside
198of the allowed URI range and then reassembling the components.
199</t>
200
201<t>Practical use of IRIs forms in place of URIs forms depends on the
202following conditions being met:</t>
203
204<t><list style="hanging">
205   
206<t hangText="a.">A protocol or format element MUST be explicitly designated to be
207  able to carry IRIs. The intent is to avoid introducing IRIs into
208  contexts that are not defined to accept them.  For example, XML
209  schema <xref target="XMLSchema"/> has an explicit type "anyURI" that
210  includes IRIs and IRI references. Therefore, IRIs and IRI references
211  can be in attributes and elements of type "anyURI".  On the other
212  hand, in the <xref target="RFC2616"/> definition of HTTP/1.1, the
213  Request URI is defined as a URI, which means that direct use of IRIs
214  is not allowed in HTTP requests.</t>
215
216<t hangText="b.">The protocol or format carrying the IRIs MUST have a
217  mechanism to represent the wide range of characters used in IRIs,
218  either natively or by some protocol- or format-specific escaping
219  mechanism (for example, numeric character references in <xref
220  target="XML1"/>).</t>
221
222<t hangText="c.">The URI scheme definition, if it explicitly allows a
223  percent sign ("%") in any syntactic component, SHOULD define the
224  interpretation of sequences of percent-encoded octets (using "%XX"
225  hex octets) as octet from sequences of UTF-8 encoded strings; this
226  is recommended in the guidelines for registering new schemes, <xref
227  target="RFC4395"/>.  For example, this is the practice for IMAP URLs
228  <xref target="RFC2192"/>, POP URLs <xref target="RFC2384"/> and the
229  URN syntax <xref target="RFC2141"/>). Note that use of
230  percent-encoding may also be restricted in some situations, for
231  example, URI schemes that disallow percent-encoding might still be
232  used with a fragment identifier which is percent-encoded (e.g.,
233  <xref target="XPointer"/>). See <xref target="UTF8use"/> for further
234  discussion.</t>
235</list></t>
236
237</section> <!-- applicability -->
238
239<section title="Definitions" anchor="sec-Definitions">
240 
241<t>The following definitions are used in this document; they follow the
242terms in <xref target="RFC2130"/>, <xref target="RFC2277"/>, and
243<xref target="ISO10646"/>.</t>
244<t><list style="hanging">
245   
246<t hangText="character:">A member of a set of elements used for the
247    organization, control, or representation of data. For example,
248    "LATIN CAPITAL LETTER A" names a character.</t>
249   
250<t hangText="octet:">An ordered sequence of eight bits considered as a
251    unit.</t>
252   
253<t hangText="character repertoire:">A set of characters (set in the
254    mathematical sense).</t>
255   
256<t hangText="sequence of characters:">A sequence of characters (one
257    after another).</t>
258   
259<t hangText="sequence of octets:">A sequence of octets (one after
260    another).</t>
261   
262<t hangText="character encoding:">A method of representing a sequence
263    of characters as a sequence of octets (maybe with variants). Also,
264    a method of (unambiguously) converting a sequence of octets into a
265    sequence of characters.</t>
266   
267<t hangText="charset:">The name of a parameter or attribute used to
268    identify a character encoding.</t>
269   
270<t hangText="UCS:">Universal Character Set. The coded character set
271    defined by ISO/IEC 10646 <xref target="ISO10646"/> and the Unicode
272    Standard <xref target="UNIV4"/>.</t>
273   
274<t hangText="IRI reference:">Denotes the common usage of an
275    Internationalized Resource Identifier. An IRI reference may be
276    absolute or relative.  However, the "IRI" that results from such a
277    reference only includes absolute IRIs; any relative IRI references
278    are resolved to their absolute form.  Note that in <xref
279    target="RFC2396"/> URIs did not include fragment identifiers, but
280    in <xref target="RFC3986"/> fragment identifiers are part of
281    URIs.</t>
282   
283<t hangText="URL:">The term "URL" was originally used <xref
284   target="RFC1738"/> for roughly what is now called a "URI".  Books,
285   software and documentation often refers to URIs and IRIs using the
286   "URL" term. Some usages restrict "URL" to those URIs which are not
287   URNs. Because of the ambiguity of the term using the term "URL" is
288   NOT RECOMMENDED in formal documents.</t>
289
290<t hangText="LEIRI (Legacy Extended IRI) processing:">  This term was used in
291   various XML specifications to refer
292   to strings that, although not valid IRIs, were acceptable input to
293   the processing rules in <xref target="LEIRIspec" />.</t>
294
295<t hangText="(Web Address, Hypertext Reference, HREF):"> These terms have been
296   added in this document for convenience, to allow other
297   specifications to refer to those strings that, although not valid
298   IRIs, are acceptable input to the processing rules in <xref
299   target="webaddress"/>. This usage corresponds to the parsing rules
300   of some popular web browsing applications.
301   ISSUE: Need to find a good name/abbreviation for these.</t>
302   
303<t hangText="running text:">Human text (paragraphs, sentences,
304   phrases) with syntax according to orthographic conventions of a
305   natural language, as opposed to syntax defined for ease of
306   processing by machines (e.g., markup, programming languages).</t>
307   
308<t hangText="protocol element:">Any portion of a message that affects
309    processing of that message by the protocol in question.</t>
310   
311<t hangText="presentation element:">A presentation form corresponding
312    to a protocol element; for example, using a wider range of
313    characters.</t>
314   
315<t hangText="create (a URI or IRI):">With respect to URIs and IRIs,
316     the term is used for the initial creation. This may be the
317     initial creation of a resource with a certain identifier, or the
318     initial exposition of a resource under a particular
319     identifier.</t>
320   
321<t hangText="generate (a URI or IRI):">With respect to URIs and IRIs,
322     the term is used when the identifier is generated by derivation
323     from other information.</t>
324
325<t hangText="parsed URI component:">When a URI processor parses a URI
326   (following the generic syntax or a scheme-specific syntax, the result
327   is a set of parsed URI components, each of which has a type
328   (corresponding to the syntactic definition) and a sequence of URI
329   characters.  </t>
330
331<t hangText="parsed IRI component:">When an IRI processor parses
332   an IRI directly, following the general syntax or a scheme-specific
333   syntax, the result is a set of parsed IRI components, each of
334   which has a type (corresponding to the syntactice definition)
335   and a sequence of IRI characters. (This definition is analogous
336   to "parsed URI component".)</t>
337
338<t hangText="IRI scheme:">A URI scheme may also be known as
339   an "IRI scheme" if the scheme's syntax has been extended to
340   allow non-US-ASCII characters according to the rules in this
341   document.</t>
342
343</list></t>
344</section> <!-- definitions -->
345<section title="Notation" anchor="sec-Notation">
346     
347<t>RFCs and Internet Drafts currently do not allow any characters
348outside the US-ASCII repertoire. Therefore, this document uses various
349special notations to denote such characters in examples.</t>
350     
351<t>In text, characters outside US-ASCII are sometimes referenced by
352using a prefix of 'U+', followed by four to six hexadecimal
353digits.</t>
354
355<t>To represent characters outside US-ASCII in examples, this document
356uses two notations: 'XML Notation' and 'Bidi Notation'.</t>
357
358<t>XML Notation uses a leading '&amp;#x', a trailing ';', and the
359hexadecimal number of the character in the UCS in between. For
360example, &amp;#x44F; stands for CYRILLIC CAPITAL LETTER YA. In this
361notation, an actual '&amp;' is denoted by '&amp;amp;'.</t>
362
363<t>Bidi Notation is used for bidirectional examples: Lower case
364letters stand for Latin letters or other letters that are written left
365to right, whereas upper case letters represent Arabic or Hebrew
366letters that are written right to left.</t>
367
368<t>To denote actual octets in examples (as opposed to percent-encoded
369octets), the two hex digits denoting the octet are enclosed in "&lt;"
370and "&gt;".  For example, the octet often denoted as 0xc9 is denoted
371here as &lt;c9&gt;.</t>
372
373<t> In this document, the key words "MUST", "MUST NOT", "REQUIRED",
374"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
375and "OPTIONAL" are to be interpreted as described in <xref
376target="RFC2119"/>.</t>
377
378</section> <!-- notation -->
379</section> <!-- introduction -->
380
381<section title="IRI Syntax" anchor="syntax">
382<t>This section defines the syntax of Internationalized Resource
383Identifiers (IRIs).</t>
384
385<t>As with URIs, an IRI is defined as a sequence of characters, not as
386a sequence of octets. This definition accommodates the fact that IRIs
387may be written on paper or read over the radio as well as stored or
388transmitted digitally.  The same IRI might be represented as different
389sequences of octets in different protocols or documents if these
390protocols or documents use different character encodings (and/or
391transfer encodings).  Using the same character encoding as the
392containing protocol or document ensures that the characters in the IRI
393can be handled (e.g., searched, converted, displayed) in the same way
394as the rest of the protocol or document.</t>
395
396<section title="Summary of IRI Syntax" anchor="summary">
397
398<t>IRIs are defined by extending the URI syntax in <xref
399target="RFC3986"/>, but extending the class of unreserved characters
400by adding the characters of the UCS (Universal Character Set, <xref
401target="ISO10646"/>) beyond U+007F, subject to the limitations given
402in the syntax rules below and in <xref target="limitations"/>.</t>
403
404<t>The syntax and use of components and reserved characters is the
405same as that in <xref target="RFC3986"/>. Each "URI scheme" thus also
406functions as an "IRI scheme", in that scheme-specific parsing rules
407for URIs of a scheme are be extended to allow parsing of IRIs using
408the same parsing rules.</t>
409
410<t>All the operations defined in <xref target="RFC3986"/>, such as the
411resolution of relative references, can be applied to IRIs by
412IRI-processing software in exactly the same way as they are for URIs
413by URI-processing software.</t>
414
415<t>Characters outside the US-ASCII repertoire MUST NOT be reserved and
416therefore MUST NOT be used for syntactical purposes, such as to
417delimit components in newly defined schemes. For example, U+00A2, CENT
418SIGN, is not allowed as a delimiter in IRIs, because it is in the
419'iunreserved' category. This is similar to the fact that it is not
420possible to use '-' as a delimiter in URIs, because it is in the
421'unreserved' category.</t>
422
423</section> <!-- summary -->
424<section title="ABNF for IRI References and IRIs" anchor="abnf">
425
426<t>An ABNF definition for IRI references (which are the most general
427concept and the start of the grammar) and IRIs is given here. The
428syntax of this ABNF is described in <xref target="STD68"/>. Character
429numbers are taken from the UCS, without implying any actual binary
430encoding. Terminals in the ABNF are characters, not octets.</t>
431
432<t>The following grammar closely follows the URI grammar in <xref
433target="RFC3986"/>, except that the range of unreserved characters is
434expanded to include UCS characters, with the restriction that private
435UCS characters can occur only in query parts. The grammar is split
436into two parts: Rules that differ from <xref target="RFC3986"/>
437because of the above-mentioned expansion, and rules that are the same
438as those in <xref target="RFC3986"/>. For rules that are different
439than those in <xref target="RFC3986"/>, the names of the non-terminals
440have been changed as follows. If the non-terminal contains 'URI', this
441has been changed to 'IRI'. Otherwise, an 'i' has been prefixed.</t>
442
443<!--
444for line length measuring in artwork (max 72 chars, three chars at start):
445      1         2         3         4         5         6         7
446456789012345678901234567890123456789012345678901234567890123456789012
447-->
448<figure>
449<preamble>The following rules are different from those in <xref target="RFC3986"/>:</preamble>
450<artwork>
451IRI            = scheme ":" ihier-part [ "?" iquery ]
452                 [ "#" ifragment ]
453
454ihier-part     = "//" iauthority ipath-abempty
455               / ipath-absolute
456               / ipath-rootless
457               / ipath-empty
458
459IRI-reference  = IRI / irelative-ref
460
461absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]
462
463irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]
464
465irelative-part = "//" iauthority ipath-abempty
466               / ipath-absolute
467               / ipath-noscheme
468               / ipath-empty
469
470iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
471iuserinfo      = *( iunreserved / pct-form / sub-delims / ":" )
472ihost          = IP-literal / IPv4address / ireg-name
473
474pct-form       = pct-encoded
475
476ireg-name      = *( iunreserved / sub-delims )
477
478ipath          = ipath-abempty   ; begins with "/" or is empty
479               / ipath-absolute  ; begins with "/" but not "//"
480               / ipath-noscheme  ; begins with a non-colon segment
481               / ipath-rootless  ; begins with a segment
482               / ipath-empty     ; zero characters
483
484ipath-abempty  = *( path-sep isegment )
485ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
486ipath-noscheme = isegment-nz-nc *( path-sep isegment )
487ipath-rootless = isegment-nz *( path-sep isegment )
488ipath-empty    = 0&lt;ipchar&gt;
489path-sep       = "/"
490
491isegment       = *ipchar
492isegment-nz    = 1*ipchar
493isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
494                     / "@" )
495               ; non-zero-length segment without any colon ":"                     
496
497ipchar         = iunreserved / pct-form / sub-delims / ":"
498               / "@"
499 
500iquery         = *( ipchar / iprivate / "/" / "?" )
501
502ifragment      = *( ipchar / "/" / "?" / "#" )
503
504iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
505
506ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
507               / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
508               / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
509               / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
510               / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
511               / %xD0000-DFFFD / %xE1000-EFFFD
512
513iprivate       = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD
514               / %x100000-10FFFD
515</artwork>
516</figure>
517
518<t>Some productions are ambiguous. The "first-match-wins" (a.k.a. "greedy")
519algorithm applies. For details, see <xref target="RFC3986"/>.</t>
520
521<figure>
522<preamble>The following rules are the same as those in <xref target="RFC3986"/>:</preamble>
523<artwork>
524scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
525 
526port           = *DIGIT
527 
528IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"
529 
530IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
531 
532IPv6address    =                            6( h16 ":" ) ls32
533               /                       "::" 5( h16 ":" ) ls32
534               / [               h16 ] "::" 4( h16 ":" ) ls32
535               / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
536               / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
537               / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
538               / [ *4( h16 ":" ) h16 ] "::"              ls32
539               / [ *5( h16 ":" ) h16 ] "::"              h16
540               / [ *6( h16 ":" ) h16 ] "::"
541               
542h16            = 1*4HEXDIG
543ls32           = ( h16 ":" h16 ) / IPv4address
544
545IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet
546
547dec-octet      = DIGIT                 ; 0-9
548               / %x31-39 DIGIT         ; 10-99
549               / "1" 2DIGIT            ; 100-199
550               / "2" %x30-34 DIGIT     ; 200-249
551               / "25" %x30-35          ; 250-255
552           
553pct-encoded    = "%" HEXDIG HEXDIG
554
555unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
556reserved       = gen-delims / sub-delims
557gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
558sub-delims     = "!" / "$" / "&amp;" / "'" / "(" / ")"
559               / "*" / "+" / "," / ";" / "="
560</artwork></figure>
561
562<t>This syntax does not support IPv6 scoped addressing zone identifiers.</t>
563
564</section> <!-- abnf -->
565
566</section> <!-- syntax -->
567
568<section title="Processing IRIs and related protocol elements" anchor="processing">
569
570<t>IRIs are meant to replace URIs in identifying resources within new
571versions of protocols, formats, and software components that use a
572UCS-based character repertoire.  Protocols and components may use and
573process IRIs directly. However, there are still numerous systems and
574protocols which only accept URIs or components of parsed URIs; that is,
575they only accept sequences of characters within the subset of US-ASCII
576characters allowed in URIs. </t>
577
578<t>This section defines specific processing steps for IRI consumers
579which establish the relationship between the string given and the
580interpreted derivatives. These
581processing steps apply to both IRIs and IRI references (i.e., absolute
582or relative forms); for IRIs, some steps are scheme specific. </t>
583
584<section title="Converting to UCS" anchor="ucsconv"> 
585 
586<t>Input that is already in a Unicode form (i.e., a sequence of Unicode
587 characters or an octet-stream representing a Unicode-based character
588 encoding such as UTF-8 or UTF-16) should be left as is and not
589 normalized (see (see <xref target="normalization"/>).</t>
590 
591<t>If the IRI or IRI reference is an octet stream in some known
592 non-Unicode character encoding, convert the IRI to a sequence of
593 characters from the UCS; this sequence SHOULD also be normalized
594 according to Unicode Normalization Form C (NFC, <xref
595 target="UTR15"/>). In this case, retain the original character
596 encoding as the "document character encoding". (DESIGN QUESTION:
597 NOT WHAT MOST IMPLEMENTATIONS DO, CHANGE? ) </t>
598
599<t> In other cases (written on paper, read aloud, or otherwise
600 represented independent of any character encoding) represent the IRI
601 as a sequence of characters from the UCS normalized according to
602 Unicode Normalization Form C (NFC, <xref target="UTR15"/>).</t>
603</section> <!-- ucsconv -->
604
605<section title="Parse the IRI into IRI components">
606
607<t>Parse the IRI, either as a relative reference (no scheme)
608or using scheme specific processing (according to the scheme
609given); the result resulting in a set of parsed IRI components.
610(NOTE: FIX BEFORE RELEASE: INTENT IS THAT ALL IRI SCHEMES
611THAT USE GENERIC SYNTAX AND ALLOW NON-ASCII AUTHORITY CAN
612ONLY USE AUTHORITY FOR NAMES THAT FOLLOW PUNICODE.)
613 </t>
614
615<t>NOTE: The result of parsing into components will correspond result
616in a correspondence of subtrings of the IRI according to the part
617matched.  For example, in <xref target="HTML5"/>, the protocol
618components of interest are SCHEME (scheme), HOST (ireg-name), PORT
619(port), the PATH (ipath after the initial "/"), QUERY (iquery),
620FRAGMENT (ifragment), and AUTHORITY (iauthority).
621</t>
622
623<t>Subsequent processing rules are sometimes used to define other
624syntactic components. For example, <xref target="HTML5"/> defines APIs
625for IRI processing; in these APIs:
626
627<list style="hanging">
628<t hangText="HOSTSPECIFIC"> the substring that follows
629the substring matched by the iauthority production, or the whole
630string if the iauthority production wasn't matched.</t>
631<t hangText="HOSTPORT"> if there is a scheme component and a port
632component and the port given by the port component is different than
633the default port defined for the protocol given by the scheme
634component, then HOSTPORT is the substring that starts with the
635substring matched by the host production and ends with the substring
636matched by the port production, and includes the colon in between the
637two. Otherwise, it is the same as the host component.
638</t>
639</list>
640</t>
641</section> <!-- parse -->
642
643<section title="General percent-encoding of IRI components" anchor="compmapping">
644   
645<t>For most IRI components, it is possible to map the IRI component
646to an equivalent URI component by percent-encoding those characters
647not allowed in URIs. Previous processing steps will have removed
648some characters, and the interpretation of reserved characters will
649have already been done (with the syntactic reserved characters outside
650of the IRI component). This mapping is defined for all sequences
651of Unicode characters, whether or not they are valid for the component
652in question. </t>
653   
654<t>For each character which is not allowed in a valid URI (NOTE: WHAT
655IS THE RIGHT REFERENCE HERE), apply the following steps. </t>
656
657<t><list style="hanging">
658
659<t hangText="Convert to UTF-8">Convert the character to a sequence of
660  one or more octets using UTF-8 <xref target="RFC3629"/>.</t>
661
662<t hangText="Percent encode">Convert each octet of this sequence to %HH,
663   where HH is the hexadecimal notation of the octet value. The
664   hexadecimal notation SHOULD use uppercase letters. (This is the
665   general URI percent-encoding mechanism in Section 2.1 of <xref
666   target="RFC3986"/>.)</t>
667   
668</list></t>
669
670<t>Note that the mapping is an identity transformation for parsed URI
671components of valid URIs, and is idempotent: applying the mapping a
672second time will not change anything.</t>
673</section> <!-- general conversion -->
674
675<section title="Mapping ireg-name" anchor="dnsmapping">
676
677<t>Schemes that allow non-ASCII based characters
678in the reg-name (ireg-name) position MUST convert the ireg-name
679component of an IRI as follows:</t>
680
681<t>Replace the ireg-name part of the IRI by the part converted using
682the ToASCII operation specified in Section 4.1 of <xref
683target="RFC3490"/> on each dot-separated label, and by using U+002E
684(FULL STOP) as a label separator, with the flag UseSTD3ASCIIRules set
685to FALSE, and with the flag AllowUnassigned set to FALSE.
686The ToASCII operation may
687fail, but this would mean that the IRI cannot be resolved.
688In such cases, if the domain name conversion fails, then the
689entire IRI conversion fails. Processors that have no mechanism for
690signalling a failure MAY instead substitute an otherwise
691invalid host name, although such processing SHOULD be avoided.
692 </t>
693
694<t>For example, the IRI
695<vspace/>"http://r&amp;#xE9;sum&amp;#xE9;.example.org"<vspace/> MAY be
696converted to <vspace/>"http://xn--rsum-bad.example.org"<vspace/>;
697conversion to percent-encoded form, e.g.,
698 <vspace/>"http://r%C3%A9sum%C3%A9.example.org", MUST NOT be performed. </t>
699
700<t><list style="hanging"> 
701
702<t hangText="Note:">Domain Names may appear in parts of an IRI other
703than the ireg-name part.  It is the responsibility of scheme-specific
704implementations (if the Internationalized Domain Name is part of the
705scheme syntax) or of server-side implementations (if the
706Internationalized Domain Name is part of 'iquery') to apply the
707necessary conversions at the appropriate point. Example: Trying to
708validate the Web page at<vspace/>
709http://r&amp;#xE9;sum&amp;#xE9;.example.org would lead to an IRI of
710<vspace/>http://validator.w3.org/check?uri=http%3A%2F%2Fr&amp;#xE9;sum&amp;#xE9;.<vspace/>example.org,
711which would convert to a URI
712of<vspace/>http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.<vspace/>example.org.
713The server-side implementation is responsible for making the
714necessary conversions to be able to retrieve the Web page.</t>
715
716<t hangText="Note:">In this process, characters allowed in URI
717references and existing percent-encoded sequences are not encoded further.
718(This mapping is similar to, but different from, the encoding applied
719when arbitrary content is included in some part of a URI.)
720
721For example, an IRI of
722<vspace/>"http://www.example.org/red%09ros&amp;#xE9;#red"
723(in XML notation) is converted to
724<vspace/>"http://www.example.org/red%09ros%C3%A9#red", not to
725something like
726<vspace/>"http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".
727((DESIGN QUESTION: What about e.g. http://r%C3%A9sum%C3%A9.example.org in an IRI? Will that get converted to punycode, or not?))
728
729</t>
730
731</list></t>
732</section> <!-- dnsmapping -->
733
734<section title="Mapping query components" anchor="querymapping">
735
736<t>((NOTE: SEE ISSUES LIST))
737
738For compatibility with existing deployed HTTP infrastructure,
739the following special case applies for schemes "http" and "https"
740and IRIs whose origin has a document charset other than one which
741is UCS-based (e.g., UTF-8 or UTF-16). In such a case, the "query"
742component of an IRI is mapped into a URI by using the document
743charset rather than UTF-8 as the binary representation before
744pct-encoding. This mapping is not applied for any other scheme
745or component.</t>
746
747</section> <!-- querymapping -->
748
749<section title="Mapping IRIs to URIs" anchor="mapping">
750
751<t>The canonical mapping from a IRI to URI is defined by applying the
752mapping above (from IRI to URI components) and then reassembling a URI
753from the parsed URI components using the original punctuation that
754delimited the IRI components. </t>
755
756</section> <!-- mapping -->
757
758<section title="Converting URIs to IRIs" anchor="URItoIRI">
759
760<t>In some situations, for presentation and further processing,
761it is desirable to convert a URI into an equivalent IRI in which
762natural characters are represented directly rather than
763percent encoded. Of course, every URI is already an IRI in
764its own right without any conversion, and in general there
765This section gives one such procedure for this conversion.
766</t>
767
768<t>
769The conversion described in this section, if given a valid URI, will
770result in an IRI that maps back to the URI used as an input for the
771conversion (except for potential case differences in percent-encoding
772and for potential percent-encoded unreserved characters).
773
774However, the IRI resulting from this conversion may differ
775from the original IRI (if there ever was one).</t> 
776
777<t>URI-to-IRI conversion removes percent-encodings, but not all
778percent-encodings can be eliminated. There are several reasons for
779this:</t>
780
781<t><list style="hanging">
782
783<t hangText="1.">Some percent-encodings are necessary to distinguish
784    percent-encoded and unencoded uses of reserved characters.</t>
785
786<t hangText="2.">Some percent-encodings cannot be interpreted as sequences
787    of UTF-8 octets.<vspace blankLines="1"/>
788    (Note: The octet patterns of UTF-8 are highly regular.
789    Therefore, there is a very high probability, but no guarantee,
790    that percent-encodings that can be interpreted as sequences of UTF-8
791    octets actually originated from UTF-8. For a detailed discussion,
792    see <xref target="Duerst97"/>.)</t>
793
794<t hangText="3.">The conversion may result in a character that is not
795    appropriate in an IRI. See <xref target="abnf"/>, <xref target="visual"/>,
796      and <xref target="limitations"/> for further details.</t>
797
798<t hangText="4.">IRI to URI conversion has different rules for
799    dealing with domain names and query parameters.</t>
800
801</list></t>
802
803<t>Conversion from a URI to an IRI MAY be done by using the following
804steps:
805
806<list style="hanging">
807<t hangText="1.">Represent the URI as a sequence of octets in
808       US-ASCII.</t>
809
810<t hangText="2.">Convert all percent-encodings ("%" followed by two
811      hexadecimal digits) to the corresponding octets, except those
812      corresponding to "%", characters in "reserved", and characters
813      in US-ASCII not allowed in URIs.</t> 
814
815<t hangText="3.">Re-percent-encode any octet produced in step 2 that
816      is not part of a strictly legal UTF-8 octet sequence.</t>
817
818
819<t hangText="4.">Re-percent-encode all octets produced in step 3 that
820      in UTF-8 represent characters that are not appropriate according
821      to <xref target="abnf"/>, <xref target="visual"/>, and <xref
822      target="limitations"/>.</t> 
823
824<t hangText="5.">Interpret the resulting octet sequence as a sequence
825      of characters encoded in UTF-8.</t>
826
827<t hangText="6.">URIs known to contain domain names in the reg-name
828      component SHOULD convert punycode-encoded domain name labels to
829      the corresponding characters using the ToUnicode procedure. </t>
830</list></t>
831
832<t>This procedure will convert as many percent-encoded characters as
833possible to characters in an IRI. Because there are some choices when
834step 4 is applied (see <xref target="limitations"/>), results may
835vary.</t>
836
837<t>Conversions from URIs to IRIs MUST NOT use any character
838encoding other than UTF-8 in steps 3 and 4, even if it might be
839possible to guess from the context that another character encoding
840than UTF-8 was used in the URI.  For example, the URI
841"http://www.example.org/r%E9sum%E9.html" might with some guessing be
842interpreted to contain two e-acute characters encoded as
843iso-8859-1. It must not be converted to an IRI containing these
844e-acute characters. Otherwise, in the future the IRI will be mapped to
845"http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
846URI from "http://www.example.org/r%E9sum%E9.html".</t>
847
848<section title="Examples">
849
850<t>This section shows various examples of converting URIs to IRIs.
851Each example shows the result after each of the steps 1 through 6 is
852applied. XML Notation is used for the final result.  Octets are
853denoted by "&lt;" followed by two hexadecimal digits followed by
854"&gt;".</t>
855
856<t>The following example contains the sequence "%C3%BC", which is a
857strictly legal UTF-8 sequence, and which is converted into the actual
858character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
859u-umlaut).
860
861<list style="hanging">
862<t hangText="1.">http://www.example.org/D%C3%BCrst</t>
863<t hangText="2.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
864<t hangText="3.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
865<t hangText="4.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
866<t hangText="5.">http://www.example.org/D&amp;#xFC;rst</t>
867<t hangText="6.">http://www.example.org/D&amp;#xFC;rst</t>
868</list>
869</t>
870
871<t>The following example contains the sequence "%FC", which might
872represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in
873the<vspace/>iso-8859-1 character encoding.  (It might represent other
874characters in other character encodings. For example, the octet
875&lt;fc&gt; in iso-8859-5 represents U+045C, CYRILLIC SMALL LETTER
876KJE.)  Because &lt;fc&gt; is not part of a strictly legal UTF-8
877sequence, it is re-percent-encoded in step 3.
878
879
880<list style="hanging">
881<t hangText="1.">http://www.example.org/D%FCrst</t>
882<t hangText="2.">http://www.example.org/D&lt;fc&gt;rst</t>
883<t hangText="3.">http://www.example.org/D%FCrst</t>
884<t hangText="4.">http://www.example.org/D%FCrst</t>
885<t hangText="5.">http://www.example.org/D%FCrst</t>
886<t hangText="6.">http://www.example.org/D%FCrst</t>
887</list>
888</t>
889
890<t>The following example contains "%e2%80%ae", which is the percent-encoded<vspace/>UTF-8
891character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. <xref target="visual"/>
892forbids the direct use of this character in an IRI. Therefore, the
893corresponding octets are re-percent-encoded in step 4. This example shows
894that the case (upper- or lowercase) of letters used in percent-encodings may not be preserved.
895The example also contains a punycode-encoded domain name label (xn--99zt52a),
896which is not converted.
897
898<list style="hanging">
899<t hangText="1.">http://xn--99zt52a.example.org/%e2%80%ae</t>
900<t hangText="2.">http://xn--99zt52a.example.org/&lt;e2&gt;&lt;80&gt;&lt;ae&gt;</t>
901<t hangText="3.">http://xn--99zt52a.example.org/&lt;e2&gt;&lt;80&gt;&lt;ae&gt;</t>
902<t hangText="4.">http://xn--99zt52a.example.org/%E2%80%AE</t>
903<t hangText="5.">http://xn--99zt52a.example.org/%E2%80%AE</t>
904<t hangText="6.">http://&amp;#x7D0D;&amp;#x8C46;.example.org/%E2%80%AE</t>
905</list></t>
906
907<t>Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
908(Japanese Natto). ((EDITOR NOTE: There is some inconsistency in this note.))</t>
909
910</section> <!-- examples -->
911</section> <!-- URItoIRI -->
912</section> <!-- processing -->
913<section title="Bidirectional IRIs for Right-to-Left Languages" anchor="Bidi">
914
915<t>Some UCS characters, such as those used in the Arabic and Hebrew
916scripts, have an inherent right-to-left (rtl) writing direction. IRIs
917containing these characters (called bidirectional IRIs or Bidi IRIs)
918require additional attention because of the non-trivial relation
919between logical representation (used for digital representation and
920for reading/spelling) and visual representation (used for
921display/printing).</t>
922
923<t>Because of the complex interaction between the logical representation,
924the visual representation, and the syntax of a Bidi IRI, a balance is
925needed between various requirements.
926The main requirements are<list style="hanging">
927<t hangText="1.">user-predictable conversion between visual and
928    logical representation;</t>
929<t hangText="2.">the ability to include a wide range of characters
930    in various parts of the IRI; and</t>
931<t hangText="3.">minor or no changes or restrictions for
932      implementations.</t>
933</list></t>
934
935<section title="Logical Storage and Visual Presentation" anchor="visual">
936
937<t>When stored or transmitted in digital representation, bidirectional
938IRIs MUST be in full logical order and MUST conform to the IRI syntax
939rules (which includes the rules relevant to their scheme). This
940ensures that bidirectional IRIs can be processed in the same way as
941other IRIs.</t> <t>Bidirectional IRIs MUST be rendered by using the
942Unicode Bidirectional Algorithm <xref target="UNIV4"/>, <xref
943target="UNI9"/>.  Bidirectional IRIs MUST be rendered in the same way
944as they would be if they were in a left-to-right embedding; i.e., as
945if they were preceded by U+202A, LEFT-TO-RIGHT EMBEDDING (LRE), and
946followed by U+202C, POP DIRECTIONAL FORMATTING (PDF).  Setting the
947embedding direction can also be done in a higher-level protocol (e.g.,
948the dir='ltr' attribute in HTML).</t> 
949
950<t>There is no requirement to use the above embedding if the display
951is still the same without the embedding. For example, a bidirectional
952IRI in a text with left-to-right base directionality (such as used for
953English or Cyrillic) that is preceded and followed by whitespace and
954strong left-to-right characters does not need an embedding.  Also, a
955bidirectional relative IRI reference that only contains strong
956right-to-left characters and weak characters and that starts and ends
957with a strong right-to-left character and appears in a text with
958right-to-left base directionality (such as used for Arabic or Hebrew)
959and is preceded and followed by whitespace and strong characters does
960not need an embedding.</t>
961
962<t>In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
963sufficient to force the correct display behavior.  However, the
964details of the Unicode Bidirectional algorithm are not always easy to
965understand. Implementers are strongly advised to err on the side of
966caution and to use embedding in all cases where they are not
967completely sure that the display behavior is unaffected without the
968embedding.</t>
969
970<t>The Unicode Bidirectional Algorithm (<xref target="UNI9"/>, section
9714.3) permits higher-level protocols to influence bidirectional
972rendering. Such changes by higher-level protocols MUST NOT be used if
973they change the rendering of IRIs.</t> 
974
975<t>The bidirectional formatting characters that may be used before or
976after the IRI to ensure correct display are not themselves part of the
977IRI.  IRIs MUST NOT contain bidirectional formatting characters (LRM,
978RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of
979the IRI but do not appear themselves. It would therefore not be
980possible to input an IRI with such characters correctly.</t>
981
982</section> <!-- visual -->
983<section title="Bidi IRI Structure" anchor="bidi-structure">
984
985<t>The Unicode Bidirectional Algorithm is designed mainly for running
986text.  To make sure that it does not affect the rendering of
987bidirectional IRIs too much, some restrictions on bidirectional IRIs
988are necessary. These restrictions are given in terms of delimiters
989(structural characters, mostly punctuation such as "@", ".", ":",
990and<vspace/>"/") and components (usually consisting mostly of letters
991and digits).</t>
992
993<t>The following syntax rules from <xref target="abnf"/> correspond to
994components for the purpose of Bidi behavior: iuserinfo, ireg-name,
995isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and
996ifragment.</t>
997
998<t>Specifications that define the syntax of any of the above
999components MAY divide them further and define smaller parts to be
1000components according to this document. As an example, the restrictions
1001of <xref target="RFC3490"/> on bidirectional domain names correspond
1002to treating each label of a domain name as a component for schemes
1003with ireg-name as a domain name.  Even where the components are not
1004defined formally, it may be helpful to think about some syntax in
1005terms of components and to apply the relevant restrictions.  For
1006example, for the usual name/value syntax in query parts, it is
1007convenient to treat each name and each value as a component. As
1008another example, the extensions in a resource name can be treated as
1009separate components.</t>
1010
1011<t>For each component, the following restrictions apply:</t>
1012<t>
1013<list style="hanging">
1014
1015<t hangText="1.">A component SHOULD NOT use both right-to-left and
1016  left-to-right characters.</t>
1017
1018<t hangText="2.">A component using right-to-left characters SHOULD
1019  start and end with right-to-left characters.</t>
1020
1021</list></t>
1022
1023<t>The above restrictions are given as "SHOULD"s, rather than as
1024"MUST"s.  For IRIs that are never presented visually, they are not
1025relevant.  However, for IRIs in general, they are very important to
1026ensure consistent conversion between visual presentation and logical
1027representation, in both directions.</t>
1028
1029<t><list style="hanging">
1030
1031<t hangText="Note:">In some components, the above restrictions may
1032  actually be strictly enforced.  For example, <xref
1033  target="RFC3490"></xref> requires that these restrictions apply to
1034  the labels of a host name for those schemes where ireg-name is a
1035  host name.  In some other components (for example, path components)
1036  following these restrictions may not be too difficult.  For other
1037  components, such as parts of the query part, it may be very
1038  difficult to enforce the restrictions because the values of query
1039  parameters may be arbitrary character sequences.</t>
1040
1041</list></t>
1042
1043<t>If the above restrictions cannot be satisfied otherwise, the
1044affected component can always be mapped to URI notation as described
1045in <xref target="compmapping"/>. Please note that the whole component
1046has to be mapped (see also Example 9 below).</t>
1047
1048</section> <!-- bidi-structure -->
1049
1050<section title="Input of Bidi IRIs" anchor="bidiInput">
1051
1052<t>Bidi input methods MUST generate Bidi IRIs in logical order while
1053rendering them according to <xref target="visual"/>.  During input,
1054rendering SHOULD be updated after every new character is input to
1055avoid end-user confusion.</t>
1056
1057</section> <!-- bidiInput -->
1058
1059<section title="Examples">
1060
1061<t>This section gives examples of bidirectional IRIs, in Bidi
1062Notation.  It shows legal IRIs with the relationship between logical
1063and visual representation and explains how certain phenomena in this
1064relationship may look strange to somebody not familiar with
1065bidirectional behavior, but familiar to users of Arabic and Hebrew. It
1066also shows what happens if the restrictions given in <xref
1067target="bidi-structure"/> are not followed. The examples below can be
1068seen at <xref target="BidiEx"/>, in Arabic, Hebrew, and Bidi Notation
1069variants.</t>
1070
1071<t>To read the bidi text in the examples, read the visual
1072representation from left to right until you encounter a block of rtl
1073text. Read the rtl block (including slashes and other special
1074characters) from right to left, then continue at the next unread ltr
1075character.</t>
1076
1077<t>Example 1: A single component with rtl characters is inverted:
1078<vspace/>Logical representation:
1079"http://ab.CDEFGH.ij/kl/mn/op.html"<vspace/>Visual representation:
1080"http://ab.HGFEDC.ij/kl/mn/op.html"<vspace/> Components can be read
1081one by one, and each component can be read in its natural
1082direction.</t>
1083
1084<t>Example 2: More than one consecutive component with rtl characters
1085is inverted as a whole: <vspace/>Logical representation:
1086"http://ab.CDE.FGH/ij/kl/mn/op.html"<vspace/>Visual representation:
1087"http://ab.HGF.EDC/ij/kl/mn/op.html"<vspace/> A sequence of rtl
1088components is read rtl, in the same way as a sequence of rtl words is
1089read rtl in a bidi text.</t>
1090
1091<t>Example 3: All components of an IRI (except for the scheme) are
1092rtl.  All rtl components are inverted overall: <vspace/>Logical
1093representation:
1094"http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"<vspace/>Visual
1095representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"<vspace/> The
1096whole IRI (except the scheme) is read rtl. Delimiters between rtl
1097components stay between the respective components; delimiters between
1098ltr and rtl components don't move.</t>
1099
1100<t>Example 4: Each of several sequences of rtl components is inverted
1101on its own: <vspace/>Logical representation:
1102"http://AB.CD.ef/gh/IJ/KL.html"<vspace/>Visual representation:
1103"http://DC.BA.ef/gh/LK/JI.html"<vspace/> Each sequence of rtl
1104components is read rtl, in the same way as each sequence of rtl words
1105in an ltr text is read rtl.</t>
1106
1107<t>Example 5: Example 2, applied to components of different kinds:
1108<vspace/>Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
1109<vspace/>Visual representation:
1110"http://ab.cd.HG/FE/ij/kl.html"<vspace/> The inversion of the domain
1111name label and the path component may be unexpected, but it is
1112consistent with other bidi behavior.  For reassurance that the domain
1113component really is "ab.cd.EF", it may be helpful to read aloud the
1114visual representation following the bidi algorithm. After
1115"http://ab.cd." one reads the RTL block "E-F-slash-G-H", which
1116corresponds to the logical representation.
1117</t>
1118
1119<t>Example 6: Same as Example 5, with more rtl components:
1120<vspace/>Logical representation:
1121"http://ab.CD.EF/GH/IJ/kl.html"<vspace/>Visual representation:
1122"http://ab.JI/HG/FE.DC/kl.html"<vspace/> The inversion of the domain
1123name labels and the path components may be easier to identify because
1124the delimiters also move.</t>
1125
1126<t>Example 7: A single rtl component includes digits: <vspace/>Logical
1127representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"<vspace/>Visual
1128representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"<vspace/>
1129Numbers are written ltr in all cases but are treated as an additional
1130embedding inside a run of rtl characters. This is completely
1131consistent with usual bidirectional text.</t>
1132
1133<t>Example 8 (not allowed): Numbers are at the start or end of an rtl
1134component:<vspace/>Logical representation:
1135"http://ab.cd.ef/GH1/2IJ/KL.html"<vspace/>Visual representation:
1136"http://ab.cd.ef/LK/JI1/2HG.html"<vspace/> The sequence "1/2" is
1137interpreted by the bidi algorithm as a fraction, fragmenting the
1138components and leading to confusion. There are other characters that
1139are interpreted in a special way close to numbers; in particular, "+",
1140"-", "#", "$", "%", ",", ".", and ":".</t>
1141
1142<t>Example 9 (not allowed): The numbers in the previous example are
1143percent-encoded: <vspace/>Logical representation:
1144"http://ab.cd.ef/GH%31/%32IJ/KL.html",<vspace/>Visual representation:
1145"http://ab.cd.ef/LK/JI%32/%31HG.html"</t>
1146
1147<t>Example 10 (allowed but not recommended): <vspace/>Logical
1148representation: "http://ab.CDEFGH.123/kl/mn/op.html"<vspace/>Visual
1149representation: "http://ab.123.HGFEDC/kl/mn/op.html"<vspace/>
1150Components consisting of only numbers are allowed (it would be rather
1151difficult to prohibit them), but these may interact with adjacent RTL
1152components in ways that are not easy to predict.</t>
1153
1154<t>Example 11 (allowed but not recommended): <vspace/>Logical
1155representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"<vspace/>Visual
1156representation: "http://ab.123.HGFEDCij/kl/mn/op.html"<vspace/>
1157Components consisting of numbers and left-to-right characters are
1158allowed, but these may interact with adjacent RTL components in ways
1159that are not easy to predict.</t>
1160</section><!-- examples -->
1161</section><!-- bidi -->
1162
1163<section title="Normalization and Comparison" anchor="equivalence">
1164
1165<t><list style="hanging"><t hangText="Note:">The structure and much of
1166  the material for this section is taken from section 6 of <xref
1167  target="RFC3986"></xref>; the differences are due to the specifics
1168  of IRIs.</t></list></t>
1169
1170<t>One of the most common operations on IRIs is simple comparison:
1171Determining whether two IRIs are equivalent, without using the IRIs to
1172access their respective resource(s). A comparison is performed
1173whenever a response cache is accessed, a browser checks its history to
1174color a link, or an XML parser processes tags within a
1175namespace. Extensive normalization prior to comparison of IRIs may be
1176used by spiders and indexing engines to prune a search space or reduce
1177duplication of request actions and response storage.</t>
1178
1179<t>IRI comparison is performed for some particular purpose. Protocols
1180or implementations that compare IRIs for different purposes will often
1181be subject to differing design trade-offs in regards to how much
1182effort should be spent in reducing aliased identifiers. This section
1183describes various methods that may be used to compare IRIs, the
1184trade-offs between them, and the types of applications that might use
1185them.</t>
1186
1187<section title="Equivalence">
1188
1189<t>Because IRIs exist to identify resources, presumably they should be
1190considered equivalent when they identify the same resource. However,
1191this definition of equivalence is not of much practical use, as there
1192is no way for an implementation to compare two resources to determine
1193if they are "the same" unless it has full knowledge or control of
1194them. For this reason, determination of equivalence or difference of
1195IRIs is based on string comparison, perhaps augmented by reference to
1196additional rules provided by URI scheme definitions.  We use the terms
1197"different" and "equivalent" to describe the possible outcomes of such
1198comparisons, but there are many application-dependent versions of
1199equivalence.</t>
1200
1201<t>Even when it is possible to determine that two IRIs are equivalent,
1202IRI comparison is not sufficient to determine whether two IRIs
1203identify different resources. For example, an owner of two different
1204domain names could decide to serve the same resource from both,
1205resulting in two different IRIs. Therefore, comparison methods are
1206designed to minimize false negatives while strictly avoiding false
1207positives.</t>
1208
1209<t>In testing for equivalence, applications should not directly
1210compare relative references; the references should be converted to
1211their respective target IRIs before comparison. When IRIs are compared
1212to select (or avoid) a network action, such as retrieval of a
1213representation, fragment components (if any) should be excluded from
1214the comparison.</t>
1215
1216<t>Applications using IRIs as identity tokens with no relationship to
1217a protocol MUST use the Simple String Comparison (see <xref
1218target="stringcomp"></xref>).  All other applications MUST select one
1219of the comparison practices from the Comparison Ladder (see <xref
1220target="ladder"></xref>.</t>
1221</section> <!-- equivalence -->
1222
1223
1224<section title="Preparation for Comparison">
1225<t>Any kind of IRI comparison REQUIRES that any additional contextual
1226processing is first performed, including undoing higher-level
1227escapings or encodings in the protocol or format that carries an
1228IRI. This preprocessing is usually done when the protocol or format is
1229parsed.</t>
1230
1231<t>Examples of contextual preprocessing steps are described in <xref
1232target="LEIRIHREF"/>. </t>
1233
1234<t>Examples of such escapings or encodings are entities and
1235numeric character references in <xref target="HTML4"></xref> and <xref
1236target="XML1"></xref>. As an example,
1237"http://example.org/ros&amp;eacute;" (in HTML),
1238"http://example.org/ros&amp;#233;" (in HTML or XML), and
1239<vspace/>"http://example.org/ros&amp;#xE9;" (in HTML or XML) are all
1240resolved into what is denoted in this document (see <xref
1241target="sec-Notation"></xref>) as "http://example.org/ros&amp;#xE9;"
1242(the "&amp;#xE9;" here standing for the actual e-acute character, to
1243compensate for the fact that this document cannot contain non-ASCII
1244characters).</t>
1245
1246<t>Similar considerations apply to encodings such as Transfer Codings
1247in HTTP (see <xref target="RFC2616"></xref>) and Content Transfer
1248Encodings in MIME (<xref target="RFC2045"></xref>), although in these
1249cases, the encoding is based not on characters but on octets, and
1250additional care is required to make sure that characters, and not just
1251arbitrary octets, are compared (see <xref
1252target="stringcomp"></xref>).</t>
1253
1254</section> <!-- preparation -->
1255
1256<section title="Comparison Ladder" anchor="ladder">
1257
1258<t>In practice, a variety of methods are used to test IRI
1259equivalence. These methods fall into a range distinguished by the
1260amount of processing required and the degree to which the probability
1261of false negatives is reduced. As noted above, false negatives cannot
1262be eliminated. In practice, their probability can be reduced, but this
1263reduction requires more processing and is not cost-effective for all
1264applications.</t>
1265
1266
1267<t>If this range of comparison practices is considered as a ladder,
1268the following discussion will climb the ladder, starting with
1269practices that are cheap but have a relatively higher chance of
1270producing false negatives, and proceeding to those that have higher
1271computational cost and lower risk of false negatives.</t>
1272
1273<section title="Simple String Comparison" anchor="stringcomp">
1274
1275<t>If two IRIs, when considered as character strings, are identical,
1276then it is safe to conclude that they are equivalent.  This type of
1277equivalence test has very low computational cost and is in wide use in
1278a variety of applications, particularly in the domain of parsing. It
1279is also used when a definitive answer to the question of IRI
1280equivalence is needed that is independent of the scheme used and that
1281can be calculated quickly and without accessing a network. An example
1282of such a case is XML Namespaces (<xref
1283target="XMLNamespace"></xref>).</t>
1284
1285
1286<t>Testing strings for equivalence requires some basic precautions.
1287This procedure is often referred to as "bit-for-bit" or
1288"byte-for-byte" comparison, which is potentially misleading. Testing
1289strings for equality is normally based on pair comparison of the
1290characters that make up the strings, starting from the first and
1291proceeding until both strings are exhausted and all characters are
1292found to be equal, until a pair of characters compares unequal, or
1293until one of the strings is exhausted before the other.</t>
1294
1295<t>This character comparison requires that each pair of characters be
1296put in comparable encoding form. For example, should one IRI be stored
1297in a byte array in UTF-8 encoding form and the second in a UTF-16
1298encoding form, bit-for-bit comparisons applied naively will produce
1299errors. It is better to speak of equality on a character-for-character
1300rather than on a byte-for-byte or bit-for-bit basis.  In practical
1301terms, character-by-character comparisons should be done codepoint by
1302codepoint after conversion to a common character encoding form.
1303
1304When comparing character by character, the comparison function MUST
1305NOT map IRIs to URIs, because such a mapping would create additional
1306spurious equivalences. It follows that an IRI SHOULD NOT be modified
1307when being transported if there is any chance that this IRI might be
1308used in a context that uses Simple String Comparison.</t>
1309
1310
1311<t>False negatives are caused by the production and use of IRI
1312aliases. Unnecessary aliases can be reduced, regardless of the
1313comparison method, by consistently providing IRI references in an
1314already normalized form (i.e., a form identical to what would be
1315produced after normalization is applied, as described below).
1316Protocols and data formats often limit some IRI comparisons to simple
1317string comparison, based on the theory that people and implementations
1318will, in their own best interest, be consistent in providing IRI
1319references, or at least be consistent enough to negate any efficiency
1320that might be obtained from further normalization.</t>
1321</section> <!-- stringcomp -->
1322
1323<section title="Syntax-Based Normalization">
1324
1325<figure><preamble>Implementations may use logic based on the
1326definitions provided by this specification to reduce the probability
1327of false negatives. This processing is moderately higher in cost than
1328character-for-character string comparison. For example, an application
1329using this approach could reasonably consider the following two IRIs
1330equivalent:</preamble>
1331
1332<artwork>
1333   example://a/b/c/%7Bfoo%7D/ros&amp;#xE9;
1334   eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9
1335</artwork></figure>
1336
1337<t>Web user agents, such as browsers, typically apply this type of IRI
1338normalization when determining whether a cached response is
1339available. Syntax-based normalization includes such techniques as case
1340normalization, character normalization, percent-encoding
1341normalization, and removal of dot-segments.</t>
1342
1343<section title="Case Normalization">
1344
1345<t>For all IRIs, the hexadecimal digits within a percent-encoding
1346triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
1347should be normalized to use uppercase letters for the digits A-F.</t>
1348
1349<t>When an IRI uses components of the generic syntax, the component
1350syntax equivalence rules always apply; namely, that the scheme and
1351US-ASCII only host are case insensitive and therefore should be
1352normalized to lowercase. For example, the URI
1353"HTTP://www.EXAMPLE.com/" is equivalent to
1354"http://www.example.com/". Case equivalence for non-ASCII characters
1355in IRI components that are IDNs are discussed in <xref
1356target="schemecomp"></xref>.  The other generic syntax components are
1357assumed to be case sensitive unless specifically defined otherwise by
1358the scheme.</t>
1359
1360<t>Creating schemes that allow case-insensitive syntax components
1361containing non-ASCII characters should be avoided. Case normalization
1362of non-ASCII characters can be culturally dependent and is always a
1363complex operation. The only exception concerns non-ASCII host names
1364for which the character normalization includes a mapping step derived
1365from case folding.</t>
1366
1367</section> <!-- casenorm -->
1368
1369<section title="Character Normalization" anchor="normalization">
1370
1371<t>The Unicode Standard <xref target="UNIV4"></xref> defines various
1372equivalences between sequences of characters for various
1373purposes. Unicode Standard Annex #15 <xref target="UTR15"></xref>
1374defines various Normalization Forms for these equivalences, in
1375particular Normalization Form C (NFC, Canonical Decomposition,
1376followed by Canonical Composition) and Normalization Form KC (NFKC,
1377Compatibility Decomposition, followed by Canonical Composition).</t>
1378
1379<t> IRIs already in Unicode MUST NOT be normalized before parsing or
1380interpreting. In many non-Unicode character encodings, some text
1381cannot be represented directly. For example, the word "Vietnam" is
1382natively written "Vi&amp;#x1EC7;t Nam" (containing a LATIN SMALL
1383LETTER E WITH CIRCUMFLEX AND DOT BELOW) in NFC, but a direct
1384transcoding from the windows-1258 character encoding leads to
1385"Vi&amp;#xEA;&amp;#x323;t Nam" (containing a LATIN SMALL LETTER E WITH
1386CIRCUMFLEX followed by a COMBINING DOT BELOW). Direct transcoding of
1387other 8-bit encodings of Vietnamese may lead to other
1388representations.</t>
1389
1390<t>Equivalence of IRIs MUST rely on the assumption that IRIs are
1391appropriately pre-character-normalized rather than apply character
1392normalization when comparing two IRIs. The exceptions are conversion
1393from a non-digital form, and conversion from a non-UCS-based character
1394encoding to a UCS-based character encoding. In these cases, NFC or a
1395normalizing transcoder using NFC MUST be used for interoperability. To
1396avoid false negatives and problems with transcoding, IRIs SHOULD be
1397created by using NFC. Using NFKC may avoid even more problems; for
1398example, by choosing half-width Latin letters instead of full-width
1399ones, and full-width instead of half-width Katakana.</t>
1400
1401
1402<t>As an example,
1403"http://www.example.org/r&amp;#xE9;sum&amp;#xE9;.html" (in XML
1404Notation) is in NFC. On the other hand,
1405"http://www.example.org/re&amp;#x301;sume&amp;#x301;.html" is not in
1406NFC.</t>
1407
1408<t>The former uses precombined e-acute characters, and the latter uses
1409"e" characters followed by combining acute accents. Both usages are
1410defined as canonically equivalent in <xref target="UNIV4"></xref>.</t>
1411
1412<t><list style="hanging">
1413
1414<t hangText="Note:">
1415Because it is unknown how a particular sequence of characters is being
1416treated with respect to character normalization, it would be
1417inappropriate to allow third parties to normalize an IRI
1418arbitrarily. This does not contradict the recommendation that when a
1419resource is created, its IRI should be as character normalized as
1420possible (i.e., NFC or even NFKC). This is similar to the
1421uppercase/lowercase problems.  Some parts of a URI are case
1422insensitive (for example, the domain name). For others, it is unclear
1423whether they are case sensitive, case insensitive, or something in
1424between (e.g., case sensitive, but with a multiple choice selection if
1425the wrong case is used, instead of a direct negative result).  The
1426best recipe is that the creator use a reasonable capitalization and,
1427when transferring the URI, capitalization never be
1428changed.</t></list></t>
1429
1430<t>Various IRI schemes may allow the usage of Internationalized Domain
1431Names (IDN) <xref target="RFC3490"></xref> either in the ireg-name
1432part or elsewhere. Character Normalization also applies to IDNs, as
1433discussed in <xref target="schemecomp"></xref>.</t>
1434</section> <!-- charnorm -->
1435
1436<section title="Percent-Encoding Normalization">
1437
1438<t>The percent-encoding mechanism (Section 2.1 of <xref
1439target="RFC3986"></xref>) is a frequent source of variance among
1440otherwise identical IRIs. In addition to the case normalization issue
1441noted above, some IRI producers percent-encode octets that do not
1442require percent-encoding, resulting in IRIs that are equivalent to
1443their nonencoded counterparts. These IRIs should be normalized by
1444decoding any percent-encoded octet sequence that corresponds to an
1445unreserved character, as described in section 2.3 of <xref
1446target="RFC3986"></xref>.</t>
1447
1448<t>For actual resolution, differences in percent-encoding (except for
1449the percent-encoding of reserved characters) MUST always result in the
1450same resource.  For example, "http://example.org/~user",
1451"http://example.org/%7euser", and "http://example.org/%7Euser", must
1452resolve to the same resource.</t>
1453
1454<t>If this kind of equivalence is to be tested, the percent-encoding
1455of both IRIs to be compared has to be aligned; for example, by
1456converting both IRIs to URIs (see Section 3.1), eliminating escape
1457differences in the resulting URIs, and making sure that the case of
1458the hexadecimal characters in the percent-encoding is always the same
1459(preferably upper case). If the IRI is to be passed to another
1460application or used further in some other way, its original form MUST
1461be preserved.  The conversion described here should be performed only
1462for local comparison.</t>
1463
1464</section> <!-- pctnorm -->
1465
1466<section title="Path Segment Normalization">
1467
1468<t>The complete path segments "." and ".." are intended only for use
1469within relative references (Section 4.1 of <xref
1470target="RFC3986"></xref>) and are removed as part of the reference
1471resolution process (Section 5.2 of <xref target="RFC3986"></xref>).
1472However, some implementations may incorrectly assume that reference
1473resolution is not necessary when the reference is already an IRI, and
1474thus fail to remove dot-segments when they occur in non-relative
1475paths.  IRI normalizers should remove dot-segments by applying the
1476remove_dot_segments algorithm to the path, as described in Section
14775.2.4 of <xref target="RFC3986"></xref>.</t>
1478
1479</section> <!-- pathnorm -->
1480</section> <!-- ladder -->
1481
1482<section title="Scheme-Based Normalization" anchor="schemecomp">
1483
1484<t>The syntax and semantics of IRIs vary from scheme to scheme, as
1485described by the defining specification for each
1486scheme. Implementations may use scheme-specific rules, at further
1487processing cost, to reduce the probability of false negatives. For
1488example, because the "http" scheme makes use of an authority
1489component, has a default port of "80", and defines an empty path to be
1490equivalent to "/", the following four IRIs are equivalent:</t>
1491
1492<figure><artwork>
1493   http://example.com
1494   http://example.com/
1495   http://example.com:/
1496   http://example.com:80/</artwork></figure>
1497
1498<t>In general, an IRI that uses the generic syntax for authority with
1499an empty path should be normalized to a path of "/". Likewise, an
1500explicit ":port", for which the port is empty or the default for the
1501scheme, is equivalent to one where the port and its ":" delimiter are
1502elided and thus should be removed by scheme-based normalization. For
1503example, the second IRI above is the normal form for the "http"
1504scheme.</t>
1505
1506<t>Another case where normalization varies by scheme is in the
1507handling of an empty authority component or empty host
1508subcomponent. For many scheme specifications, an empty authority or
1509host is considered an error; for others, it is considered equivalent
1510to "localhost" or the end-user's host. When a scheme defines a default
1511for authority and an IRI reference to that default is desired, the
1512reference should be normalized to an empty authority for the sake of
1513uniformity, brevity, and internationalization. If, however, either the
1514userinfo or port subcomponents are non-empty, then the host should be
1515given explicitly even if it matches the default.</t>
1516
1517<t>Normalization should not remove delimiters when their associated
1518component is empty unless it is licensed to do so by the scheme
1519specification. For example, the IRI "http://example.com/?" cannot be
1520assumed to be equivalent to any of the examples above. Likewise, the
1521presence or absence of delimiters within a userinfo subcomponent is
1522usually significant to its interpretation.  The fragment component is
1523not subject to any scheme-based normalization; thus, two IRIs that
1524differ only by the suffix "#" are considered different regardless of
1525the scheme.</t>
1526
1527<t>((NOTE: THIS NEEDS TO BE UPDATED TO DEAL WITH IDNA8))
1528Some IRI schemes may allow the usage of Internationalized Domain
1529Names (IDN) <xref target="RFC3490"></xref> either in their ireg-name
1530part or elsewhere. When in use in IRIs, those names SHOULD be
1531validated by using the ToASCII operation defined in <xref
1532target="RFC3490"></xref>, with the flags "UseSTD3ASCIIRules" and
1533"AllowUnassigned". An IRI containing an invalid IDN cannot
1534successfully be resolved.  Validated IDN components of IRIs SHOULD be
1535character normalized by using the Nameprep process <xref
1536target="RFC3491"></xref>; however, for legibility purposes, they
1537SHOULD NOT be converted into ASCII Compatible Encoding (ACE).</t>
1538
1539<t>Scheme-based normalization may also consider IDN
1540components and their conversions to punycode as equivalent. As an
1541example, "http://r&amp;#xE9;sum&amp;#xE9;.example.org" may be
1542considered equivalent to
1543"http://xn--rsum-bpad.example.org".</t><t>Other scheme-specific
1544normalizations are possible.</t>
1545
1546</section> <!-- schemenorm -->
1547
1548<section title="Protocol-Based Normalization">
1549
1550<t>Substantial effort to reduce the incidence of false negatives is
1551often cost-effective for web spiders. Consequently, they implement
1552even more aggressive techniques in IRI comparison. For example, if
1553they observe that an IRI such as</t>
1554
1555<figure><artwork>
1556   http://example.com/data</artwork></figure>
1557<t>redirects to an IRI differing only in the trailing slash</t>
1558<figure><artwork>
1559   http://example.com/data/</artwork></figure>
1560
1561<t>they will likely regard the two as equivalent in the future.  This
1562kind of technique is only appropriate when equivalence is clearly
1563indicated by both the result of accessing the resources and the common
1564conventions of their scheme's dereference algorithm (in this case, use
1565of redirection by HTTP origin servers to avoid problems with relative
1566references).</t>
1567
1568</section> <!-- protonorm -->
1569</section> <!-- equivalence -->
1570</section> 
1571
1572<section title="Use of IRIs" anchor="IRIuse">
1573
1574<section title="Limitations on UCS Characters Allowed in IRIs" anchor="limitations">
1575
1576<t>This section discusses limitations on characters and character
1577sequences usable for IRIs beyond those given in <xref target="abnf"/>
1578and <xref target="visual"/>. The considerations in this section are
1579relevant when IRIs are created and when URIs are converted to
1580IRIs.</t>
1581
1582<t>
1583
1584<list style="hanging"><t hangText="a.">The repertoire of characters allowed
1585    in each IRI component is limited by the definition of that component.
1586    For example, the definition of the scheme component does not allow
1587    characters beyond US-ASCII.
1588    <vspace blankLines="1"/>
1589    (Note: In accordance with URI practice, generic IRI
1590    software cannot and should not check for such limitations.)</t>
1591
1592<t hangText="b.">The UCS contains many areas of characters for which
1593    there are strong visual look-alikes. Because of the likelihood of
1594    transcription errors, these also should be avoided. This includes
1595    the full-width equivalents of Latin characters, half-width
1596    Katakana characters for Japanese, and many others. It also
1597    includes many look-alikes of "space", "delims", and "unwise",
1598    characters excluded in <xref target="RFC3491"/>.</t>
1599   
1600</list>
1601</t>
1602
1603<t>Additional information is available from <xref target="UNIXML"/>.
1604    <xref target="UNIXML"/> is written in the context of running text
1605    rather than in that of identifiers. Nevertheless, it discusses
1606    many of the categories of characters not appropriate for IRIs.</t>
1607</section> <!-- limitations -->
1608
1609<section title="Software Interfaces and Protocols">
1610
1611<t>Although an IRI is defined as a sequence of characters, software
1612interfaces for URIs typically function on sequences of octets or other
1613kinds of code units. Thus, software interfaces and protocols MUST
1614define which character encoding is used.</t>
1615
1616<t>Intermediate software interfaces between IRI-capable components and
1617URI-only components MUST map the IRIs per <xref target="mapping"/>,
1618when transferring from IRI-capable to URI-only components.
1619
1620This mapping SHOULD be applied as late as possible. It SHOULD NOT be
1621applied between components that are known to be able to handle IRIs.</t>
1622</section> <!-- software -->
1623
1624<section title="Format of URIs and IRIs in Documents and Protocols">
1625
1626<t>Document formats that transport URIs may have to be upgraded to allow
1627the transport of IRIs. In cases where the document as a whole
1628has a native character encoding, IRIs MUST also be encoded in this
1629character encoding and converted accordingly by a parser or interpreter.
1630
1631IRI characters not expressible in the native character encoding SHOULD
1632be escaped by using the escaping conventions of the document format if
1633such conventions are available. Alternatively, they MAY be
1634percent-encoded according to <xref target="mapping"/>. For example, in
1635HTML or XML, numeric character references SHOULD be used. If a
1636document as a whole has a native character encoding and that character
1637encoding is not UTF-8, then IRIs MUST NOT be placed into the document
1638in the UTF-8 character encoding.</t>
1639
1640<t>((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
1641although they use different terminology. HTML 4.0 <xref
1642target="HTML4"/> defines the conversion from IRIs to URIs as
1643error-avoiding behavior. XML 1.0 <xref target="XML1"/>, XLink <xref
1644target="XLink"/>, XML Schema <xref target="XMLSchema"/>, and
1645specifications based upon them allow IRIs. Also, it is expected that
1646all relevant new W3C formats and protocols will be required to handle
1647IRIs <xref target="CharMod"/>.</t>
1648
1649</section> <!-- format -->
1650
1651<section title="Use of UTF-8 for Encoding Original Characters" anchor="UTF8use">
1652
1653<t>This section discusses details and gives examples for point c) in
1654<xref target="Applicability"/>. To be able to use IRIs, the URI
1655corresponding to the IRI in question has to encode original characters
1656into octets by using UTF-8.  This can be specified for all URIs of a
1657URI scheme or can apply to individual URIs for schemes that do not
1658specify how to encode original characters.  It can apply to the whole
1659URI, or only to some part. For background information on encoding
1660characters into URIs, see also Section 2.5 of <xref
1661target="RFC3986"/>.</t>
1662
1663<t>For new URI schemes, using UTF-8 is recommended in <xref
1664target="RFC4395"/>.  Examples where UTF-8 is already used are the URN
1665syntax <xref target="RFC2141"/>, IMAP URLs <xref target="RFC2192"/>,
1666and POP URLs <xref target="RFC2384"/>.  On the other hand, because the
1667HTTP URI scheme does not specify how to encode original characters,
1668only some HTTP URLs can have corresponding but different IRIs.</t>
1669
1670<t>For example, for a document with a URI
1671of<vspace/>"http://www.example.org/r%C3%A9sum%C3%A9.html", it is
1672possible to construct a corresponding IRI (in XML notation, see <xref
1673target="sec-Notation"/>):
1674"http://www.example.org/r&amp;#xE9;sum&amp;#xE9;.html" ("&amp;#xE9;"
1675stands for the e-acute character, and "%C3%A9" is the UTF-8 encoded
1676and percent-encoded representation of that character). On the other
1677hand, for a document with a URI of
1678"http://www.example.org/r%E9sum%E9.html", the percent-encoding octets
1679cannot be converted to actual characters in an IRI, as the
1680percent-encoding is not based on UTF-8.</t>
1681
1682<t>For most URI schemes, there is no need to upgrade their scheme
1683definition in order for them to work with IRIs.  The main case where
1684upgrading makes sense is when a scheme definition, or a particular
1685component of a scheme, is strictly limited to the use of US-ASCII
1686characters with no provision to include non-ASCII characters/octets
1687via percent-encoding, or if a scheme definition currently uses highly
1688scheme-specific provisions for the encoding of non-ASCII characters.
1689An example of this is the mailto: scheme <xref target="RFC2368"/>.</t>
1690
1691<t>This specification updates the IANA registry of URI schemes to note
1692their applicability to IRIs, see <xref target="iana"/>.  All IRIs use
1693URI schemes, and all URIs with URI schemes can be used as IRIs, even
1694though in some cases only by using URIs directly as IRIs, without any
1695conversion.</t>
1696
1697<t>Scheme definitions can impose restrictions on the syntax of
1698scheme-specific URIs; i.e., URIs that are admissible under the generic
1699URI syntax <xref target="RFC3986"/> may not be admissible due to
1700narrower syntactic constraints imposed by a URI scheme
1701specification. URI scheme definitions cannot broaden the syntactic
1702restrictions of the generic URI syntax; otherwise, it would be
1703possible to generate URIs that satisfied the scheme-specific syntactic
1704constraints without satisfying the syntactic constraints of the
1705generic URI syntax. However, additional syntactic constraints imposed
1706by URI scheme specifications are applicable to IRI, as the
1707corresponding URI resulting from the mapping defined in <xref
1708target="mapping"/> MUST be a valid URI under the syntactic
1709restrictions of generic URI syntax and any narrower restrictions
1710imposed by the corresponding URI scheme specification.</t>
1711
1712<t>The requirement for the use of UTF-8 generally applies to all parts
1713of a URI.  However, it is possible that the capability of IRIs to
1714represent a wide range of characters directly is used just in some
1715parts of the IRI (or IRI reference). The other parts of the IRI may
1716only contain US-ASCII characters, or they may not be based on
1717UTF-8. They may be based on another character encoding, or they may
1718directly encode raw binary data (see also <xref
1719target="RFC2397"/>). </t>
1720
1721<t>For example, it is possible to have a URI reference
1722of<vspace/>"http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9",
1723where the document name is encoded in iso-8859-1 based on server
1724settings, but where the fragment identifier is encoded in UTF-8 according
1725to <xref target="XPointer"/>. The IRI corresponding to the above
1726URI would be (in XML notation)<vspace/>"http://www.example.org/r%E9sum%E9.xml#r&amp;#xE9;sum&amp;#xE9;".</t>
1727
1728<t>Similar considerations apply to query parts. The functionality
1729of IRIs (namely, to be able to include non-ASCII characters) can
1730only be used if the query part is encoded in UTF-8.</t>
1731
1732</section> <!-- utf8 -->
1733
1734<section title="Relative IRI References">
1735<t>Processing of relative IRI references against a base is handled
1736straightforwardly; the algorithms of <xref target="RFC3986"/> can
1737be applied directly, treating the characters additionally allowed
1738in IRI references in the same way that unreserved characters are in URI
1739references.</t>
1740
1741</section> <!-- relative -->
1742</section> <!-- IRIuse -->
1743
1744<section title="Liberal handling of otherwise invalid IRIs" anchor="LEIRIHREF">
1745
1746<t>(EDITOR NOTE: This Section may move to an appendix.)
1747 
1748Some technical specifications and widely-deployed software have
1749allowed additional variations and extensions of IRIs to be used in
1750syntactic components. This section describes two widely-used
1751preprocessing agreements. Other technical specifications may wish to
1752reference a syntactic component which is "a valid IRI or a string that
1753will map to a valid IRI after this preprocessing algorithm". These two
1754variants are known as <xref target="LEIRI">Legacy Extended IRI or
1755LEIRI</xref>, and <xref target="HTML5">Web Address</xref>).
1756</t>
1757
1758<t>Future technical specifications SHOULD NOT allow conforming
1759producers to produce, or conforming content to contain, such forms,
1760as they are not interoperable with other IRI consuming software.</t>
1761
1762<section title="LEIRI processing"  anchor="LEIRIspec">
1763  <t>This section defines Legacy Extended IRIs (LEIRIs).
1764    The syntax of Legacy Extended IRIs is the same as that for IRIs,
1765    except that the ucschar production is replaced by the leiri-ucschar production:</t>
1766<figure>
1767
1768<artwork>
1769  leiri-ucschar  = " " / "&lt;" / "&gt;" / '"' / "{" / "}" / "|"
1770                   / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1771                   / %xE000-FFFD / %x10000-10FFFF
1772</artwork>
1773
1774<postamble>
1775  Among other extensions, processors based on this specification also
1776  did not enforce the restriction on bidirectional formatting
1777  characters in <xref target="visual"></xref>, and the iprivate
1778  production becomes redundant.</postamble>
1779</figure>
1780
1781<t>To convert a string allowed as a LEIRI to an IRI, each character
1782allowed in leiri-ucschar but not in ucschar must be percent-encoded
1783using <xref target="compmapping"/>.</t>
1784</section> <!-- leiriproc -->
1785
1786<section title="Web Address processing" anchor="webaddress">
1787
1788<t>Many popular web browsers have taken the approach of being quite
1789liberal in what is accepted as a "URL" or its relative
1790forms. This section describes their behavior in terms of a preprocessor
1791which maps strings into the IRI space for subsequent parsing and
1792interpretation as an IRI.</t>
1793
1794<t>In some situations, it might be appropriate to describe the syntax
1795that a liberal consumer implementation might accept as a "Web
1796Address" or "Hypertext Reference" or "HREF". However,
1797technical specifications SHOULD restrict the syntactic form allowed by compliant producers
1798to the IRI or IRI reference syntax defined in this document
1799even if they want to mandate this processing.</t>
1800
1801<t>
1802Summary:
1803<list style="symbols">
1804   <t>Leading and trailing whitespace is removed.</t>
1805   <t>Some additional characters are removed.</t>
1806   <t>Some additional characters are allowed and escaped (as with LEIRI).</t>
1807   <t>If interpreting an IRI as a URI, the pct-encoding of the query
1808   component of the parsed URI component depends on operational
1809   context.</t>
1810</list>
1811</t>
1812
1813<t>Each string provided may have an associated charset (called
1814the HREF-charset here); this defaults to UTF-8.
1815For web browsers interpreting HTML, the document
1816charset of a string is determined:
1817
1818<list style="hanging">
1819<t hangText="If the string came from a script (e.g. as an argument to
1820 a method)">The HRef-charset is the script's charset.</t>
1821
1822<t hangText="If the string came from a DOM node (e.g. from an
1823  element)">The node has a Document, and the HRef-charset is the
1824  Document's character encoding.</t>
1825
1826<t hangText="If the string had a HRef-charset defined when the string was
1827created or defined">The HRef-charset is as defined.</t>
1828
1829</list></t>
1830
1831<t>If the resulting HRef-charset is a unicode based character encoding
1832(e.g., UTF-16), then use UTF-8 instead.</t>
1833
1834
1835<figure>
1836<preamble>The syntax for Web Addresses is obtained by replacing the 'ucschar',
1837  pct-form, and path-sep rules with the href-ucschar, href-pct-form, and href-path-sep
1838  rules below. In addition, some characters are stripped.</preamble>
1839
1840<artwork>
1841  href-ucschar  = " " / "&lt;" / "&gt;" / '"' / "{" / "}" / "|"
1842                   / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1843                   / %xE000-FFFD / %x10000-10FFFF
1844  href-pct-form = pct-encoded / "%"
1845  href-path-sep = "/" / "\"
1846  href-strip    = &lt;to be done&gt;
1847</artwork>
1848
1849<postamble>
1850(NOTE: NEED TO FIX THESE SETS TO MATCH HTML5; NOT SURE ABOUT NEXT SENTENCE)
1851browsers did not enforce the restriction on bidirectional formatting
1852  characters in <xref target="visual"></xref>, and the iprivate
1853  production becomes redundant.</postamble>
1854</figure>
1855
1856<t>'Web Address processing' requires the following additional
1857preprocessing steps:
1858
1859<list style="numbers">
1860
1861<t>Leading and trailing instances of space (U+0020),
1862CR (U+000A), LF (U+000D), and TAB (U+0009) characters are removed.</t>
1863
1864<t>strip all characters in href-strip.</t>
1865  <t>Percent-encode all characters in href-ucschar not in ucschar.</t>
1866  <t>Replace occurrences of "%" not followed by two hexadecimal digits by "%25".</t>
1867  <t>Convert backslashes ('\') matching href-path-sep to forward slashes ('/').</t>
1868</list></t>
1869</section> <!-- webaddress -->
1870
1871<section title="Characters not allowed in IRIs" anchor="notAllowed">
1872
1873<t>This section provides a list of the groups of characters and code
1874points that are allowed by LEIRI or HREF but are not allowed in IRIs or are
1875allowed in IRIs only in the query part. For each group of characters,
1876advice on the usage of these characters is also given, concentrating
1877on the reasons for why they are excluded from IRI use.</t>
1878
1879<t>
1880
1881<list><t>Space (U+0020): Some formats and applications use space as a
1882delimiter, e.g. for items in a list. Appendix C of <xref
1883target="RFC3986"></xref> also mentions that white space may have to be
1884added when displaying or printing long URIs; the same applies to long
1885IRIs. This means that spaces can disappear, or can make the what is
1886intended as a single IRI or IRI reference to be treated as two or more
1887separate IRIs.</t>
1888
1889<t>Delimiters "&lt;" (U+003C), "&gt;" (U+003E), and '"' (U+0022):
1890Appendix C of <xref target="RFC3986"></xref> suggests the use of
1891double-quotes ("http://example.com/") and angle brackets
1892(&lt;http://example.com/&gt;) as delimiters for URIs in plain
1893text. These conventions are often used, and also apply to IRIs.  Using
1894these characters in strings intended to be IRIs would result in the
1895IRIs being cut off at the wrong place.</t>
1896
1897<t>Unwise characters "\" (U+005C), "^" (U+005E), "`"
1898(U+0060), "{" (U+007B), "|" (U+007C), and "}" (U+007D): These
1899characters originally have been excluded from URIs because the
1900respective codepoints are assigned to different graphic characters in
1901some 7-bit or 8-bit encoding. Despite the move to Unicode, some of
1902these characters are still occasionally displayed differently on some
1903systems, e.g. U+005C may appear as a Japanese Yen symbol on some
1904systems. Also, the fact that these characters are not used in URIs or
1905IRIs has encouraged their use outside URIs or IRIs in contexts that
1906may include URIs or IRIs. If a string with such a character were used
1907as an IRI in such a context, it would likely be interpreted
1908piecemeal.</t>
1909
1910<t>The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
1911#x9F): There is generally no way to transmit these characters reliably
1912as text outside of a charset encoding.  Even when in encoded form,
1913many software components silently filter out some of these characters,
1914or may stop processing alltogether when encountering some of
1915them. These characters may affect text display in subtle, unnoticable
1916ways or in drastic, global, and irreversible ways depending on the
1917hardware and software involved. The use of some of these characters
1918would allow malicious users to manipulate the display of an IRI and
1919its context in many situations.</t>
1920
1921<t>Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
1922characters affect the display ordering of characters. If IRIs were
1923allowed to contain these characters and the resulting visual display
1924transcribed. they could not be converted back to electronic form
1925(logical order) unambiguously. These characters, if allowed in IRIs,
1926might allow malicious users to manipulate the display of IRI and its
1927context.</t>
1928
1929<t>Specials (U+FFF0-FFFD): These code points provide functionality
1930beyond that useful in an IRI, for example byte order identification,
1931annotation, and replacements for unknown characters and objects. Their
1932use and interpretation in an IRI would serve no purpose and might lead
1933to confusing display variations.</t>
1934
1935<t>Private use code points (U+E000-F8FF, U+F0000-FFFFD,
1936U+100000-10FFFD): Display and interpretation of these code points is
1937by definition undefined without private agreement. Therefore, these
1938code points are not suited for use on the Internet. They are not
1939interoperable and may have unpredictable effects.</t>
1940
1941<t>Tags (U+E0000-E0FFF): These characters provide a way to language
1942tag in Unicode plain text. They are not appropriate for IRIs because
1943language information in identifiers cannot reliably be input,
1944transmitted (e.g. on a visual medium such as paper), or
1945recognized.</t>
1946
1947<t>Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
1948U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
1949U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
1950U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
1951U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
1952non-characters. Applications may use some of them internally, but are
1953not prepared to interchange them.</t>
1954
1955</list></t>
1956
1957<t>LEIRI preprocessing disallowed some code points and
1958code units:
1959
1960<list><t>Surrogate code units (D800-DFFF): These do not represent
1961Unicode codepoints.</t></list></t>
1962</section> <!-- notallowed -->
1963</section> <!-- lieirihref -->
1964 
1965<section title="URI/IRI Processing Guidelines (Informative)" anchor="guidelines">
1966
1967<t>This informative section provides guidelines for supporting IRIs in
1968the same software components and operations that currently process
1969URIs: Software interfaces that handle URIs, software that allows users
1970to enter URIs, software that creates or generates URIs, software that
1971displays URIs, formats and protocols that transport URIs, and software
1972that interprets URIs. These may all require modification before
1973functioning properly with IRIs. The considerations in this section
1974also apply to URI references and IRI references.</t>
1975
1976<section title="URI/IRI Software Interfaces">
1977<t>Software interfaces that handle URIs, such as URI-handling APIs and
1978protocols transferring URIs, need interfaces and protocol elements
1979that are designed to carry IRIs.</t>
1980
1981<t>In case the current handling in an API or protocol is based on
1982US-ASCII, UTF-8 is recommended as the character encoding for IRIs, as
1983it is compatible with US-ASCII, is in accordance with the
1984recommendations of <xref target="RFC2277"/>, and makes converting to
1985URIs easy. In any case, the API or protocol definition must clearly
1986define the character encoding to be used.</t>
1987
1988<t>The transfer from URI-only to IRI-capable components requires no
1989mapping, although the conversion described in <xref
1990target="URItoIRI"/> above may be performed. It is preferable not to
1991perform this inverse conversion unless it is certain this can be done
1992correctly.</t>
1993</section>
1994
1995<section title="URI/IRI Entry">
1996
1997<t>Some components allow users to enter URIs into the system
1998by typing or dictation, for example. This software must be updated to allow
1999for IRI entry.</t>
2000
2001<t>A person viewing a visual representation of an IRI (as a sequence
2002of glyphs, in some order, in some visual display) or hearing an IRI
2003will use an entry method for characters in the user's language to
2004input the IRI. Depending on the script and the input method used, this
2005may be a more or less complicated process.</t>
2006
2007<t>The process of IRI entry must ensure, as much as possible, that the
2008restrictions defined in <xref target="abnf"/> are met. This may be
2009done by choosing appropriate input methods or variants/settings
2010thereof, by appropriately converting the characters being input, by
2011eliminating characters that cannot be converted, and/or by issuing a
2012warning or error message to the user.</t>
2013
2014<t>As an example of variant settings, input method editors for East
2015Asian Languages usually allow the input of Latin letters and related
2016characters in full-width or half-width versions. For IRI input, the
2017input method editor should be set so that it produces half-width Latin
2018letters and punctuation and full-width Katakana.</t>
2019
2020<t>An input field primarily or solely used for the input of URIs/IRIs
2021might allow the user to view an IRI as it is mapped to a URI.  Places
2022where the input of IRIs is frequent may provide the possibility for
2023viewing an IRI as mapped to a URI. This will help users when some of
2024the software they use does not yet accept IRIs.</t>
2025
2026<t>An IRI input component interfacing to components that handle URIs,
2027but not IRIs, must map the IRI to a URI before passing it to these
2028components.</t>
2029
2030<t>For the input of IRIs with right-to-left characters, please see
2031<xref target="bidiInput"></xref>.</t>
2032</section>
2033
2034<section title="URI/IRI Transfer between Applications">
2035
2036<t>Many applications (for example, mail user agents) try to detect
2037URIs appearing in plain text. For this, they use some heuristics based
2038on URI syntax. They then allow the user to click on such URIs and
2039retrieve the corresponding resource in an appropriate (usually
2040scheme-dependent) application.</t>
2041
2042<t>Such applications would need to be upgraded, in order to use the
2043IRI syntax as a base for heuristics. In particular, a non-ASCII
2044character should not be taken as the indication of the end of an IRI.
2045Such applications also would need to make sure that they correctly
2046convert the detected IRI from the character encoding of the document
2047or application where the IRI appears, to the character encoding used
2048by the system-wide IRI invocation mechanism, or to a URI (according to
2049<xref target="mapping"/>) if the system-wide invocation mechanism only
2050accepts URIs.</t>
2051
2052<t>The clipboard is another frequently used way to transfer URIs and
2053IRIs from one application to another. On most platforms, the clipboard
2054is able to store and transfer text in many languages and scripts.
2055Correctly used, the clipboard transfers characters, not octets, which
2056will do the right thing with IRIs.</t>
2057</section>
2058
2059<section title="URI/IRI Generation">
2060
2061<t>Systems that offer resources through the Internet, where those
2062resources have logical names, sometimes automatically generate URIs
2063for the resources they offer. For example, some HTTP servers can
2064generate a directory listing for a file directory and then respond to
2065the generated URIs with the files.</t>
2066
2067<t>Many legacy character encodings are in use in various file systems.
2068Many currently deployed systems do not transform the local character
2069representation of the underlying system before generating URIs.</t>
2070
2071<t>For maximum interoperability, systems that generate resource
2072identifiers should make the appropriate transformations. For example,
2073if a file system contains a file named
2074"r&amp;#xE9;sum&amp;#xE9;.html", a server should expose this as
2075"r%C3%A9sum%C3%A9.html" in a URI, which allows use of
2076"r&amp;#xE9;sum&amp;#xE9;.html" in an IRI, even if locally the file
2077name is kept in a character encoding other than UTF-8.
2078</t>
2079
2080<t>This recommendation particularly applies to HTTP servers. For FTP
2081servers, similar considerations apply; see <xref target="RFC2640"/>.</t>
2082</section>
2083
2084<section title="URI/IRI Selection" anchor="selection">
2085<t>In some cases, resource owners and publishers have control over the
2086IRIs used to identify their resources. This control is mostly
2087executed by controlling the resource names, such as file names,
2088directly.</t>
2089
2090<t>In these cases, it is recommended to avoid choosing IRIs that are
2091easily confused. For example, for US-ASCII, the lower-case ell ("l") is
2092easily confused with the digit one ("1"), and the upper-case oh ("O") is
2093easily confused with the digit zero ("0"). Publishers should avoid
2094confusing users with "br0ken" or "1ame" identifiers.</t>
2095
2096<t>Outside the US-ASCII repertoire, there are many more opportunities for
2097confusion; a complete set of guidelines is too lengthy to include
2098here. As long as names are limited to characters from a single script,
2099native writers of a given script or language will know best when
2100ambiguities can appear, and how they can be avoided. What may look
2101ambiguous to a stranger may be completely obvious to the average
2102native user. On the other hand, in some cases, the UCS contains
2103variants for compatibility reasons; for example, for typographic purposes.
2104These should be avoided wherever possible. Although there may be exceptions,
2105newly created resource names should generally be in NFKC
2106<xref target="UTR15"></xref> (which means that they are also in NFC).</t>
2107
2108<t>As an example, the UCS contains the "fi" ligature at U+FB01
2109for compatibility reasons.
2110Wherever possible, IRIs should use the two letters "f" and "i" rather
2111than the "fi" ligature. An example where the latter may be used is
2112in the query part of an IRI for an explicit search for a word written
2113containing the "fi" ligature.</t>
2114
2115<t>In certain cases, there is a chance that characters from different
2116scripts look the same. The best known example is the similarity of the
2117Latin "A", the Greek "Alpha", and the Cyrillic "A". To avoid such
2118cases, IRIs should only be created where all the characters in a
2119single component are used together in a given language. This usually
2120means that all of these characters will be from the same script, but
2121there are languages that mix characters from different scripts (such
2122as Japanese).  This is similar to the heuristics used to distinguish
2123between letters and numbers in the examples above. Also, for Latin,
2124Greek, and Cyrillic, using lowercase letters results in fewer
2125ambiguities than using uppercase letters would.</t>
2126</section>
2127
2128<section title="Display of URIs/IRIs" anchor="display">
2129<t>
2130In situations where the rendering software is not expected to display
2131non-ASCII parts of the IRI correctly using the available layout and font
2132resources, these parts should be percent-encoded before being displayed.</t>
2133
2134<t>For display of Bidi IRIs, please see <xref target="visual"/>.</t>
2135</section>
2136
2137<section title="Interpretation of URIs and IRIs">
2138<t>Software that interprets IRIs as the names of local resources should
2139accept IRIs in multiple forms and convert and match them with the
2140appropriate local resource names.</t>
2141
2142<t>First, multiple representations include both IRIs in the native
2143character encoding of the protocol and also their URI counterparts.</t>
2144
2145<t>Second, it may include URIs constructed based on character
2146encodings other than UTF-8. These URIs may be produced by user agents that do
2147not conform to this specification and that use legacy character encodings to
2148convert non-ASCII characters to URIs. Whether this is necessary, and what
2149character encodings to cover, depends on a number of factors, such as
2150the legacy character encodings used locally and the distribution of
2151various versions of user agents. For example, software for Japanese
2152may accept URIs in Shift_JIS and/or EUC-JP in addition to UTF-8.</t>
2153
2154<t>Third, it may include additional mappings to be more user-friendly
2155and robust against transmission errors. These would be similar to how
2156some servers currently treat URIs as case insensitive or perform
2157additional matching to account for spelling errors. For characters
2158beyond the US-ASCII repertoire, this may, for example, include
2159ignoring the accents on received IRIs or resource names. Please note
2160that such mappings, including case mappings, are language
2161dependent.</t>
2162
2163<t>It can be difficult to identify a resource unambiguously if too
2164many mappings are taken into consideration. However, percent-encoded
2165and not percent-encoded parts of IRIs can always be clearly distinguished.
2166Also, the regularity of UTF-8 (see <xref target="Duerst97"/>) makes the
2167potential for collisions lower than it may seem at first.</t>
2168</section>
2169
2170<section title="Upgrading Strategy">
2171<t>Where this recommendation places further constraints on software
2172for which many instances are already deployed, it is important to
2173introduce upgrades carefully and to be aware of the various
2174interdependencies.</t>
2175
2176<t>If IRIs cannot be interpreted correctly, they should not be created,
2177generated, or transported. This suggests that upgrading URI interpreting
2178software to accept IRIs should have highest priority.</t>
2179
2180<t>On the other hand, a single IRI is interpreted only by a single or
2181very few interpreters that are known in advance, although it may be
2182entered and transported very widely.</t>
2183
2184<t>Therefore, IRIs benefit most from a broad upgrade of software to be
2185able to enter and transport IRIs. However, before an
2186individual IRI is published, care should be taken to upgrade the corresponding
2187interpreting software in order to cover the forms expected to be
2188received by various versions of entry and transport software.</t>
2189
2190<t>The upgrade of generating software to generate IRIs instead of using a
2191local character encoding should happen only after the service is upgraded
2192to accept IRIs. Similarly, IRIs should only be generated when the service
2193accepts IRIs and the intervening infrastructure and protocol is known
2194to transport them safely.</t>
2195
2196<t>Software converting from URIs to IRIs for display should be upgraded
2197only after upgraded entry software has been widely deployed to the
2198population that will see the displayed result.</t>
2199
2200
2201<t>Where there is a free choice of character encodings, it is often
2202possible to reduce the effort and dependencies for upgrading to IRIs
2203by using UTF-8 rather than another encoding. For example, when a new
2204file-based Web server is set up, using UTF-8 as the character encoding
2205for file names will make the transition to IRIs easier. Likewise, when
2206a new Web form is set up using UTF-8 as the character encoding of the
2207form page, the returned query URIs will use UTF-8 as the character
2208encoding (unless the user, for whatever reason, changes the character
2209encoding) and will therefore be compatible with IRIs.</t>
2210
2211
2212<t>These recommendations, when taken together, will allow for the
2213extension from URIs to IRIs in order to handle characters other than
2214US-ASCII while minimizing interoperability problems. For
2215considerations regarding the upgrade of URI scheme definitions, see
2216<xref target="UTF8use"/>.</t>
2217
2218</section>
2219</section> <!-- guidelines -->
2220
2221<section title="IANA Considerations" anchor="iana">
2222
2223<t>RFC Editor and IANA note: Please Replace RFC XXXX with the
2224number of this document when it issues as an RFC. </t>
2225
2226<t>IANA maintains a registry of "URI schemes". A "URI scheme" also
2227serves an "IRI scheme". </t>
2228
2229<t>To clarify that the URI scheme registration process also applies to
2230IRIs, change the description of the "URI schemes" registry
2231header to say "[RFC4395] defines an IANA-maintained registry of URI
2232Schemes. These registries include the Permanent and Provisional URI
2233Schemes.  RFC XXXX updates this registry to designate that schemes may
2234also indicate their usability as IRI schemes.</t>
2235
2236<t> Update "per RFC 4395" to "per RFC 4395 and RFC XXXX".
2237</t>
2238
2239</section> <!-- IANA -->
2240   
2241<section title="Security Considerations" anchor="security">
2242<t>The security considerations discussed in <xref target="RFC3986"/>
2243also apply to IRIs. In addition, the following issues require
2244particular care for IRIs.</t>
2245<t>Incorrect encoding or decoding can lead to security problems.
2246In particular, some UTF-8 decoders do not check against overlong
2247byte sequences. As an example, a "/" is encoded with the byte 0x2F
2248both in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly
2249interpret the sequence 0xC0 0xAF as a "/". A sequence such as "%C0%AF.."
2250may pass some security tests and then be interpreted
2251as "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion
2252and checking are not done in the right order, and/or if reserved
2253characters and unreserved characters are not clearly distinguished.</t>
2254
2255<t>There are various ways in which "spoofing" can occur with IRIs.
2256"Spoofing" means that somebody may add a resource name that looks the
2257same or similar to the user, but that points to a different resource.
2258The added resource may pretend to be the real resource by looking
2259very similar but may contain all kinds of changes that may be
2260difficult to spot and that can cause all kinds of problems.
2261Most spoofing possibilities for IRIs are extensions of those for URIs.</t>
2262
2263<t>Spoofing can occur for various reasons. First, a user's normalization expectations or actual normalization
2264when entering an IRI or  transcoding an IRI from a legacy character
2265encoding do not match the normalization used on the
2266server side. Conceptually, this is no different from the problems
2267surrounding the use of case-insensitive web servers. For example,
2268a popular web page with a mixed-case name ("http://big.example.com/PopularPage.html")
2269might be "spoofed" by someone who is able to create "http://big.example.com/popularpage.html".
2270However, the use of unnormalized character sequences, and of additional
2271mappings for user convenience, may increase the chance for spoofing.
2272Protocols and servers that allow the creation of resources with
2273names that are not normalized are particularly vulnerable to such
2274attacks. This is an inherent
2275security problem of the relevant protocol, server, or resource
2276and is not specific to IRIs, but it is mentioned here for completeness.</t>
2277
2278<t>Spoofing can occur in various IRI components, such as the
2279domain name part or a path part. For considerations specific
2280to the domain name part, see <xref target="RFC3491"/>.
2281For the path part, administrators of sites that allow independent
2282users to create resources in the same sub area may have to be careful
2283to check for spoofing.</t>
2284
2285<t>Spoofing can occur because in the UCS many characters look very similar. Details are discussed in <xref target="selection"/>.
2286Again, this is very similar to spoofing possibilities on US-ASCII,
2287e.g., using "br0ken" or "1ame" URIs.</t>
2288
2289<t>Spoofing can occur when URIs with percent-encodings based on various
2290character encodings are accepted to deal with older user agents. In some
2291cases, particularly for Latin-based resource names, this is usually easy to
2292detect because UTF-8-encoded names, when interpreted and viewed as
2293legacy character encodings, produce mostly garbage.</t><t>When
2294concurrently used character encodings have a similar structure but there
2295are no characters that have exactly the same encoding, detection is more
2296difficult.</t>
2297
2298<t>Spoofing can occur with bidirectional IRIs, if the restrictions
2299in <xref target="bidi-structure"/> are not followed. The same visual
2300representation may be interpreted as different logical representations,
2301and vice versa. It is also very important that a correct Unicode bidirectional
2302implementation be used.</t><t>The use of Legacy Extended IRIs introduces additional security issues.</t>
2303</section><!-- security -->
2304
2305<section title="Acknowledgements">
2306<t>For contributions to this update, we would like to thank Ian Hickson, Michael Sperberg-McQueen, Dan Connolly, Norman Walsh, Richard Tobin, Henry S. Thomson, and the XML Core Working Group of the W3C.</t>
2307
2308<t>The discussion on the issue addressed here started a long time
2309ago. There was a thread in the HTML working
2310group in August 1995 (under the topic of "Globalizing URIs") and in the
2311www-international mailing list in July 1996 (under the topic of
2312"Internationalization and URLs"), and there were ad-hoc meetings at the Unicode
2313conferences in September 1995 and September 1997.</t>
2314
2315<t>For contributions to the previous version of this document, RFC 3987, many thanks go to
2316Francois Yergeau, Matitiahu Allouche,
2317Roy Fielding, Tim Berners-Lee, Mark Davis,
2318M.T. Carrasco Benitez, James Clark, Tim Bray, Chris Wendt, Yaron Goland,
2319Andrea Vine, Misha Wolf, Leslie Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman,
2320Russ Housley, Makoto MURATA, Steven Atkin,
2321Ryan Stansifer, Tex Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs,
2322Adam Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown,
2323Roy Badami, Jonathan Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio,
2324Chris Haynes, Walter Underwood, and many others.</t>
2325<t>A definition of HyperText Reference was initially produced by Ian Hixson,
2326and further edited by Dan Connolly and C. M. Spergerg-McQueen.</t>
2327<t>Thanks to the Internationalization Working
2328Group (I18N WG) of the World Wide Web Consortium (W3C),
2329and the members of the W3C
2330I18N Working Group and Interest Group for their contributions and their
2331work on <xref target="CharMod"/>. Thanks also go
2332to the members of many other W3C Working Groups for adopting IRIs, and to
2333the members of the Montreal IAB Workshop on Internationalization and
2334Localization for their review.</t>
2335</section>
2336
2337
2338<section title="Open Issues">
2339
2340<t>NOTE: The issues noted in this section should be addressed before the document is submitted as an RFC.
2341These issues are not in any particular order, and do not necessarily form a complete list of all known issues.
2342 
2343<list style="hanging">
2344
2345<t hangText="length limits on domain name">See, for example,
2346  http://lists.w3.org/Archives/Public/public-iri/2009Sep/0064.html discussion on public-iri@w3.org
2347(that discussion is mostly irrelevant now as the "63 octets in UTF-8 per label" restriction was
2348dropped)</t>
2349
2350<t hangText="Allow generic scheme-independent IRI to URI translation">Previous drafts of this
2351  specification proposed a generic IRI to URI transformation using pct-encoding,
2352  and allowed domain name translation to be optionally handled by retranslating host names
2353  from pct-encoding back into Unicode and then into punycode.
2354  This draft does not allow that behavior, but this should be fixed to be in line
2355  with RFC 3986 syntax and to lead implementations towards an uniform an long-term
2356  URI&lt;-&gt;IRI correspondence. See also <xref target='Gettys'/></t>
2357
2358<t hangText="update URI scheme registry?">This document starts the process of making minor
2359  changes to the URI scheme registry. This should be handled as an update to RFC 4395.</t>
2360
2361<t hangText="utf8 in HTTP">Not really IRI issue, but some HTTP implementations send UTF8 path directly, review. </t>
2362
2363<t hangText="handling of \">Some web applications convert \ to / and others don't.
2364  Make this mandatory or disallowed (but not optional), for Web Addresses.</t>
2365
2366<t hangText="dealing with disallowed IRI characters"> </t>
2367
2368<t hangText="misplaced text">
2369Find a place to note that some older software transcoding to UTF-8 may produce
2370illegal output for some input, in particular for characters outside
2371the BMP (Basic Multilingual Plane). As an example, for the IRI with non-BMP characters (in XML Notation):
2372<vspace/>"http://example.com/&amp;#x10300;&amp;#x10301;&amp;#x10302";
2373<vspace/>which contains the first three letters of the Old Italic alphabet,
2374the correct conversion to a URI is
2375<vspace/>"http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82"</t>
2376
2377<t hangText="Special Query Handling needed?">
2378The percent-encoding handling of query components in the HTTP scheme
2379is really unfortunate. There is no good normative advice to give if
2380the percent-encoding is delayed until the query-IRI is interpreted. Could
2381HTML ask browsers to percent-encode the form data using the document
2382character set BEFORE the query IRI is constructed, and only in the case where
2383the document character set isn't Unicode-based and the query is
2384being added to http: or https: URIs?  This would give more
2385consistent results.  Browsers might have to change their behavior in
2386constructing the IRI-with-query-added, but the results would be more
2387consistent and fewer bugs, and it wouldn't affect interpretation of
2388any existing web pages. It would remove the need to have a normative
2389special case for queries in HTML documents, just for http, in a way
2390in which things like transcoding etc. wouldn't work well.  You could
2391tell the difference between a query URI in the address bar and one
2392created via a form because the address bar would always be UTF-8.
2393The browsers might have to change the algorithm for showing the address
2394in the adress bar to know how to undo the encoding.</t>
2395
2396<t hangText="handling illegal characters">
2397
2398<xref target="compmapping" /> used to apply only to characters in
2399either 'ucschar' or 'iprivate', but then later said that systems
2400accepting IRIs MAY also deal with the printable characters in US-ASCII
2401that are not allowed in URIs, namely "&lt;", "&gt;", '"', space, "{",
2402"}", "|", "\", "^", and "`".  Larry felt that this a MAY would result
2403in non-uniform behavior, because some systems would produce valid URI
2404components and others wouldn't.  Non-printable US-ASCII characters
2405should be stripped by most software, so if they get to if they're
2406passed on somewhere as IRI characters, encoding them makes sense.
2407
2408The section also used to say "If these characters are found
2409but are not converted, then the conversion SHOULD fail." but there is
2410no notion of conversion failing -- every string is converted.  Please
2411note that the number sign ("#"), the percent sign ("%"), and the
2412square bracket characters ("[", "]") are not part of the above list
2413and MUST NOT be converted.
2414 </t>
2415<t hangText="adding single % and hash">
2416Changed the BNF to not match the URI document in allowing
2417single % in path but not everywhere, and allowing a # in the
2418fragment part.</t>
2419
2420</list></t>
2421</section>
2422
2423<section title="Change Log">
2424
2425<t>Note to RFC Editor: Please completely remove this section before publication.</t>
2426
2427<section title='Changes from draft-duerst-iri-bis-07 to draft-ietf-iri-3987bis-00'>
2428     <t>Changed draft name, date, last paragraph of abstract, and titles in change log, and added this section
2429     in moving from draft-duerst-iri-bis-07 (personal submission) to draft-ietf-iri-3987bis-00 (WG document).</t>
2430</section>
2431
2432<section title="Changes from -06 to -07 of draft-duerst-iri-bis" anchor="forkChanges"><t>
2433
2434Major restructuring of IRI processing model to make scheme-specific translation necessary to handle IDNA requirements and for consistency with web implementations. </t>
2435<t>Starting with IRI, you want one of:
2436<list style="hanging">
2437<t hangText="a"> IRI components (IRI parsed into UTF8 pieces)</t>
2438<t hangText="b"> URI components (URI parsed into ASCII pieces, encoded correctly) </t>
2439<t hangText="c"> whole URI  (for passing on to some other system that wants whole URIs) </t>
2440</list></t>
2441
2442<section title="OLD WAY">
2443<t><list style="numbers">
2444
2445 <t>Pct-encoding on the whole thing to a URI.
2446 (c1) If you want a (maybe broken) whole URI, you might
2447        stop here.</t>
2448
2449 <t>Parsing the URI into URI components.
2450   (b1) If you want (maybe broken) URI components, stop here.</t>
2451
2452 <t> Decode the components (undoing the pct-encoding).
2453   (a) if you want IRI components, stop here.</t>
2454
2455 <t> reencode:  Either using a different encoding some components
2456   (for domain names, and query components in web pages, which
2457   depends on the component, scheme and context), and otherwise
2458   using pct-encoding.
2459   (b2) if you want (good) URI components, stop here.</t>
2460
2461 <t> reassemble the reencoded components.
2462   (c2) if you want a (*good*) whole URI stop here.</t>
2463</list>
2464
2465</t>
2466
2467</section>
2468
2469<section title="NEW WAY">
2470<t>
2471<list style="numbers">
2472
2473<t> Parse the IRI into IRI components using the generic syntax.
2474   (a) if you want IRI components, stop here.</t>
2475
2476<t> Encode each components, using pct-encoding, IDN encoding, or
2477         special query part encoding depending on the component
2478         scheme or context. (b) If you want URI components, stop here.</t>
2479<t> reassemble the a whole URI from URI components.
2480   (c) if you want a whole URI stop here.</t>
2481</list></t>
2482</section>
2483</section>
2484
2485<section title='Changes from -00 to -01'><t><list style="symbols">
2486  <t>Removed 'mailto:' before mail addresses of authors.</t>
2487  <t>Added "&lt;to be done&gt;" as right side of 'href-strip' rule. Fixed '|' to '/' for
2488    alternatives.</t>
2489</list></t>
2490</section>
2491
2492<section title="Changes from -05 to -06 of draft-duerst-iri-bis-00"><t><list style="symbols">
2493<t>Add HyperText Reference, change abstract, acks and references for it</t>
2494<t>Add Masinter back as another editor.</t>
2495<t>Masinter integrates HRef material from HTML5 spec.</t>
2496<t>Rewrite introduction sections to modernize.</t>
2497</list></t>
2498</section>
2499
2500<section title="Changes from -04 to -05 of draft-duerst-iri-bis"><t><list style="symbols"><t>Updated references.</t><t>Changed IPR text to pre5378Trust200902.</t></list></t>
2501</section>
2502
2503<section title="Changes from -03 to -04 of draft-duerst-iri-bis"><t><list style="symbols"><t>Added explicit abbreviation for LEIRIs.</t><t>Mentioned LEIRI references.</t><t>Completed text in LEIRI section about tag characters and about specials.</t></list></t>
2504</section>
2505
2506<section title="Changes from -02 to -03 of draft-duerst-iri-bis"><t><list style="symbols"><t>Updated some references.</t><t>Updated Michel Suginard's coordinates.</t></list></t>
2507</section>
2508
2509<section title="Changes from -01 to -02 of draft-duerst-iri-bis"><t><list style="symbols"><t>Added tag range to iprivate (issue private-include-tags-115).</t><t>Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs.</t></list></t>
2510</section>
2511<section title="Changes from -00 to -01 of draft-duerst-iri-bis"><t><list style="symbols"><t>Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI" based on input from the W3C XML Core WG. Moved the relevant subsections to the back and promoted them to a section.</t><t>Added some text re. Legacy Extended IRIs to the security section.</t><t>Added a IANA Consideration Section.</t><t>Added this Change Log Section.</t><t>Added a section about "IRIs with Spaces/Controls" (converting from a Note in RFC 3987).</t></list></t>
2512</section>
2513<section title="Changes from RFC 3987 to -00 of draft-duerst-iri-bis"><t><list><t>Fixed errata (see http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987).</t></list></t>
2514</section>
2515</section>
2516</middle>
2517
2518<back>
2519<references title="Normative References">
2520
2521<reference anchor="ASCII">
2522<front>
2523<title>Coded Character Set -- 7-bit American Standard Code for Information
2524Interchange</title>
2525<author>
2526<organization>American National Standards Institute</organization>
2527</author>
2528<date year="1986"/>
2529</front>
2530<seriesInfo name="ANSI" value="X3.4"/>
2531</reference>
2532
2533<reference anchor="ISO10646">
2534<front>
2535<title>ISO/IEC 10646:2003: Information Technology -
2536Universal Multiple-Octet Coded Character Set (UCS)</title>
2537<author>
2538<organization>International Organization for Standardization</organization>
2539</author>
2540<date month="December" year="2003"/>
2541</front>
2542<seriesInfo name="ISO" value="Standard 10646"/>
2543</reference>
2544
2545&rfc2119;
2546&rfc3490;
2547&rfc3491;
2548&rfc3629;
2549&rfc3986;
2550
2551<reference anchor="STD68">
2552<front>
2553<title abbrev="ABNF">Augmented BNF for Syntax Specifications: ABNF</title>
2554<author initials="D." surname="Crocker" fullname="Dave Crocker"><organization/></author>
2555<author initials="P." surname="Overell" fullname="Paul Overell"><organization/></author>
2556<date month="January" year="2008"/></front>
2557<seriesInfo name="STD" value="68"/><seriesInfo name="RFC" value="5234"/>
2558</reference>
2559
2560<reference anchor="UNIV4">
2561<front>
2562<title>The Unicode Standard, Version 5.1.0, defined by: The Unicode Standard,
2563Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0),
2564as amended by Unicode 4.1.0 (http://www.unicode.org/versions/Unicode5.1.0/)</title>
2565<author><organization>The Unicode Consortium</organization></author>
2566<date year="2008" month="April"/>
2567</front>
2568</reference>
2569
2570<reference anchor="UNI9" target="http://www.unicode.org/reports/tr9/tr9-13.html">
2571<front>
2572<title>The Bidirectional Algorithm</title>
2573<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
2574<date year="2004" month="March"/>
2575</front>
2576<seriesInfo name="Unicode Standard Annex" value="#9"/>
2577</reference>
2578
2579<reference anchor="UTR15" target="http://www.unicode.org/unicode/reports/tr15/tr15-23.html">
2580<front>
2581<title>Unicode Normalization Forms</title>
2582<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
2583<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2584<date year="2008" month="March"/>
2585</front>
2586<seriesInfo name="Unicode Standard Annex" value="#15"/>
2587</reference>
2588
2589</references>
2590
2591<references title="Informative References">
2592
2593<reference anchor="BidiEx" target="http://www.w3.org/International/iri-edit/BidiExamples">
2594<front>
2595<title>Examples of bidirectional IRIs</title>
2596<author><organization/></author>
2597<date year="" month=""/>
2598</front>
2599</reference>
2600
2601<reference anchor="CharMod" target="http://www.w3.org/TR/charmod-resid">
2602<front>
2603<title>Character Model for the World Wide Web: Resource Identifiers</title>
2604<author initials="M." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2605<author initials="F." surname="Yergeau" fullname="Francois Yergeau"><organization/></author>
2606<author initials="R." surname="Ishida" fullname="Richard Ishida"><organization/></author>
2607<author initials="M." surname="Wolf" fullname="Misha Wolf"><organization/></author>
2608<author initials="T." surname="Texin" fullname="Tex Texin"><organization/></author>
2609<date year="2004" month="November" day="25"/>
2610</front>
2611<seriesInfo name="World Wide Web Consortium" value="Candidate Recommendation"/>
2612</reference>
2613
2614<reference anchor="Duerst97" target="http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf">
2615<front>
2616<title>The Properties and Promises of UTF-8</title>
2617<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2618<date year="1997" month="September"/>
2619</front>
2620<seriesInfo name="Proc. 11th International Unicode Conference, San Jose" value=""/>
2621</reference>
2622
2623<reference anchor="Gettys" target="http://www.w3.org/DesignIssues/ModelConsequences">
2624<front>
2625<title>URI Model Consequences</title>
2626<author initials="J." surname="Gettys" fullname="Jim Gettys"><organization/></author>
2627<date month="" year=""/>
2628</front>
2629</reference>
2630
2631<reference anchor="HTML4" target="http://www.w3.org/TR/html401/appendix/notes.html#h-B.2">
2632<front>
2633<title>HTML 4.01 Specification</title>
2634<author initials="D." surname="Raggett" fullname="Dave Raggett"><organization/></author>
2635<author initials="A." surname="Le Hors" fullname="Arnaud Le Hors"><organization/></author>
2636<author initials="I." surname="Jacobs" fullname="Ian Jacobs"><organization/></author>
2637<date year="1999" month="December" day="24"/>
2638</front>
2639<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2640</reference>
2641
2642<reference anchor="LEIRI" target="http://www.w3.org/TR/leiri/">
2643<front>
2644<title>Legacy extended IRIs for XML resource identification</title>
2645<author initials="H." surname="Thompson" fullname="Henry Thompson"><organization/></author>
2646<author initials="R." surname="Tobin"    fullname="Richard Tobin"><organization/></author>
2647<author initials="N." surname="Walsh" fullname="Norman Walsh"><organization/></author>
2648  <date year="2008" month="November" day="3"/>
2649
2650</front>
2651<seriesInfo name="World Wide Web Consortium" value="Note"/>
2652</reference>
2653
2654
2655&rfc2045;
2656&rfc2130;
2657&rfc2141;
2658&rfc2192;
2659&rfc2277;
2660&rfc2368;
2661&rfc2384;
2662&rfc2396;
2663&rfc2397;
2664&rfc2616;
2665&rfc1738;
2666&rfc2640;
2667&rfc4395;
2668
2669<reference anchor="UNIXML" target="http://www.w3.org/TR/unicode-xml/">
2670
2671<front>
2672<title>Unicode in XML and other Markup Languages</title>
2673<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2674<author initials="A." surname="Freytag" fullname="Asmus Freytag"><organization/></author>
2675<date year="2003" month="June" day="18"/>
2676</front>
2677<seriesInfo name="Unicode Technical Report" value="#20"/>
2678<seriesInfo name="World Wide Web Consortium" value="Note"/>
2679</reference>
2680
2681<reference anchor="XLink" target="http://www.w3.org/TR/xlink/#link-locators">
2682<front>
2683<title>XML Linking Language (XLink) Version 1.0</title>
2684<author initials="S." surname="DeRose" fullname="Steve DeRose"><organization/></author>
2685<author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2686<author initials="D." surname="Orchard" fullname="David Orchard"><organization/></author>
2687<date year="2001" month="June" day="27"/>
2688</front>
2689<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2690</reference>
2691
2692<reference anchor="XML1" target="http://www.w3.org/TR/REC-xml">
2693  <front>
2694    <title>Extensible Markup Language (XML) 1.0 (Forth Edition)</title>
2695    <author initials="T." surname="Bray" fullname="Tim Bray"><organization/></author>
2696    <author initials="J." surname="Paoli" fullname="Jean Paoli"><organization/></author>
2697    <author initials="C.M." surname="Sperberg-McQueen" fullname="C. M. Sperberg-McQueen">
2698      <organization/></author>
2699    <author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2700    <author initials="F." surname="Yergeau" fullname="Francois Yergeau"><organization/></author>
2701    <date day="16" month="August" year="2006"/>
2702  </front>
2703  <seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2704</reference>
2705
2706<reference anchor="XMLNamespace" target="http://www.w3.org/TR/REC-xml-names">
2707  <front>
2708    <title>Namespaces in XML (Second Edition)</title>
2709    <author initials="T." surname="Bray" fullname="Tim Bray"><organization/></author>
2710    <author initials="D." surname="Hollander" fullname="Dave Hollander"><organization/></author>
2711    <author initials="A." surname="Layman" fullname="Andrew Layman"><organization/></author>
2712    <author initials="R." surname="Tobin" fullname="Richard Tobin"><organization></organization></author><date day="16" month="August" year="2006"/>
2713  </front>
2714  <seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2715</reference>
2716
2717<reference anchor="XMLSchema" target="http://www.w3.org/TR/xmlschema-2/#anyURI">
2718<front>
2719<title>XML Schema Part 2: Datatypes</title>
2720<author initials="P." surname="Biron" fullname="Paul Biron"><organization/></author>
2721<author initials="A." surname="Malhotra" fullname="Ashok Malhotra"><organization/></author>
2722<date year="2001" month="May" day="2"/>
2723</front>
2724<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2725</reference>
2726
2727<reference anchor="XPointer" target="http://www.w3.org/TR/xptr-framework/#escaping">
2728<front>
2729<title>XPointer Framework</title>
2730<author initials="P." surname="Grosso" fullname="Paul Grosso"><organization/></author>
2731<author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2732<author initials="J." surname="Marsh" fullname="Jonathan Marsh"><organization/></author>
2733<author initials="N." surname="Walsh" fullname="Norman Walsh"><organization/></author>
2734<date year="2003" month="March" day="25"/>
2735</front>
2736<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2737</reference>
2738
2739<reference anchor="HTML5" target="http://www.w3.org/TR/2009/WD-html5-20090423/">
2740<front>
2741<title>A vocabulary and associated APIs for HTML and XHTML</title>
2742<author initials="I." surname="Hickson" fullname="Ian Hickson"><organization>Google, Inc.</organization></author>
2743<author initials="D." surname="Hyatt" fullname="David Hyatt"><organization>Apple, Inc.</organization></author>
2744<date year="2009"  month="April" day="23"/>
2745</front>
2746<seriesInfo name="World Wide Web Consortium" value="Working Draft"/>
2747</reference>
2748
2749</references>
2750
2751<section title="Design Alternatives">
2752<t>This section briefly summarizes some design alternatives
2753considered earlier and the reasons why they were not chosen.</t>
2754<section title="New Scheme(s)">
2755<t>Introducing new schemes (for example, httpi:, ftpi:,...) or a
2756new metascheme (e.g., i:, leading to URI/IRI prefixes such as
2757i:http:, i:ftp:,...) was proposed to make IRI-to-URI conversion
2758scheme dependent or to distinguish between percent-encodings
2759resulting from IRI-to-URI conversion and percent-encodings from
2760legacy character encodings.</t>
2761
2762<t>New schemes are not needed to distinguish URIs from true IRIs (i.e.,
2763  IRIs that contain non-ASCII characters). The benefit of being able
2764  to detect the origin of percent-encodings is marginal, as UTF-8
2765  can be detected with very high reliability. Deploying new schemes is
2766  extremely hard, so not requiring new schemes for IRIs makes
2767  deployment of IRIs vastly easier. Making conversion scheme dependent
2768  is highly inadvisable and would be encouraged by separate schemes for IRIs.
2769  Using a uniform convention for conversion from IRIs to URIs makes
2770  IRI implementation orthogonal to the introduction of actual new
2771  schemes.</t>
2772</section>
2773<section title="Character Encodings Other Than UTF-8">
2774<t>At an early stage, UTF-7 was considered as an alternative to
2775UTF-8 when IRIs are converted to URIs. UTF-7 would not have needed
2776percent-encoding and  in most cases would have been shorter than
2777percent-encoded UTF-8.</t>
2778<t>Using UTF-8 avoids a double layering and overloading of the use of
2779   the "+" character. UTF-8 is fully compatible with US-ASCII and has
2780   therefore been recommended by the IETF, and is being used widely.</t>
2781 
2782  <t>UTF-7 has never been used much and is now clearly being
2783   discouraged. Requiring implementations to convert from UTF-8
2784   to UTF-7 and back would be an additional implementation burden.</t>
2785</section> <!-- notutf8 -->
2786<section title="New Encoding Convention">
2787<t>Instead of using the existing percent-encoding convention
2788of URIs, which is based on octets, the idea was to create a new
2789encoding convention; for example, to use "%u" to introduce
2790UCS code points.</t>
2791<t>Using the existing octet-based percent-encoding mechanism
2792does not need an upgrade of the URI syntax and does not
2793need corresponding server upgrades.</t>
2794</section> <!-- new encoding -->
2795<section title="Indicating Character Encodings in the URI/IRI">
2796<t>Some proposals suggested indicating the character encodings used
2797in an URI or IRI with some new syntactic convention in the URI itself,
2798similar to the "charset" parameter for e-mails and Web pages.
2799As an example, the label in square brackets in
2800"http://www.example.org/ros[iso-8859-1]&amp;#xE9;" indicated that
2801the following "&amp;#xE9;" had to be interpreted as iso-8859-1.</t>
2802<t>If UTF-8 is used exclusively, an upgrade to the URI syntax is not needed.
2803It avoids potentially multiple labels that have to be copied correctly
2804in all cases, even on the side of a bus or on a napkin, leading to
2805usability problems (and being prohibitively annoying).
2806Exclusively using UTF-8 also reduces transcoding errors and confusion.</t>
2807</section> <!-- indicating -->
2808</section>
2809</back>
2810</rfc>
Note: See TracBrowser for help on using the repository browser.