source: draft-ietf-iri-3987bis/draft-ietf-iri-3987bis.xml @ 21

Last change on this file since 21 was 21, checked in by duerst@…, 9 years ago

draft number and date change for submission as -02

  • Property svn:executable set to *
File size: 124.8 KB
Line 
1<?xml version="1.0"?>
2<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
3<!ENTITY rfc1738 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.1738.xml">
4<!ENTITY rfc2045 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2045.xml">
5<!ENTITY rfc2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
6<!ENTITY rfc2130 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2130.xml">
7<!ENTITY rfc2141 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2141.xml">
8<!ENTITY rfc2192 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2192.xml">
9<!ENTITY rfc2277 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2277.xml">
10<!ENTITY rfc2368 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2368.xml">
11<!ENTITY rfc2384 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2384.xml">
12<!ENTITY rfc2396 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2396.xml">
13<!ENTITY rfc2397 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2397.xml">
14<!ENTITY rfc2616 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2616.xml">
15<!ENTITY rfc2640 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2640.xml">
16<!ENTITY rfc3490 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3490.xml">
17<!ENTITY rfc3491 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3491.xml">
18<!ENTITY rfc3629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3629.xml">
19<!ENTITY rfc3986 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3986.xml">
20<!ENTITY rfc4395 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4395.xml">
21<!ENTITY rfc5890 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5890.xml">
22<!ENTITY rfc5891 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5891.xml">
23]>
24<?rfc strict='yes'?>
25<!--     complains about too long lines (2 cases)
26     and appendix, but otherwise is okay
27-->
28<?xml-stylesheet type='text/css' href='rfc2629.css' ?>
29<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
30<?rfc symrefs='yes'?>
31<?rfc sortrefs='yes'?>
32<?rfc iprnotified="no" ?>
33<?rfc toc='yes'?>
34<?rfc compact='yes'?>
35<?rfc subcompact='no'?>
36<rfc ipr="pre5378Trust200902" docName="draft-ietf-iri-3987bis-02" category="std" xml:lang="en" obsoletes="3987">
37<front>
38<title abbrev="IRIs">Internationalized Resource Identifiers (IRIs)</title>
39
40  <author initials="M.J." surname="Duerst" fullname='Martin Duerst'>
41    <!-- (Note: Please write "Duerst" with u-umlaut wherever
42      possible, for example as "D&#252;rst" in XML and HTML) -->
43  <organization abbrev="Aoyama Gakuin University">Aoyama Gakuin University</organization>
44  <address>
45  <postal>
46  <street>5-10-1 Fuchinobe</street>
47  <city>Sagamihara</city>
48  <region>Kanagawa</region>
49  <code>229-8558</code>
50  <country>Japan</country>
51  </postal>
52  <phone>+81 42 759 6329</phone>
53  <facsimile>+81 42 759 6495</facsimile>
54  <email>duerst@it.aoyama.ac.jp</email>
55  <uri>http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/<!-- (Note: This is the percent-encoded form of an IRI)--></uri>
56  </address>
57</author>
58
59<author initials="M.L." surname="Suignard" fullname="Michel Suignard">
60   <organization>Unicode Consortium</organization>
61   <address>
62   <postal>
63   <street></street>
64   <street>P.O. Box 391476</street>
65   <city>Mountain View</city>
66   <region>CA</region>
67   <code>94039-1476</code>
68   <country>U.S.A.</country>
69   </postal>
70   <phone>+1-650-693-3921</phone>
71   <email>michel@unicode.org</email>
72   <uri>http://www.suignard.com</uri>
73   </address>
74</author>
75<author initials="L." surname="Masinter" fullname="Larry Masinter">
76   <organization>Adobe</organization>
77   <address>
78   <postal>
79   <street>345 Park Ave</street>
80   <city>San Jose</city>
81   <region>CA</region>
82   <code>95110</code>
83   <country>U.S.A.</country>
84   </postal>
85   <phone>+1-408-536-3024</phone>
86   <email>masinter@adobe.com</email>
87   <uri>http://larry.masinter.net</uri>
88   </address>
89</author>
90
91<date year="2010" month="October" day="17"/>
92<area>Applications</area>
93<workgroup>Internationalized Resource Identifiers (iri)</workgroup>
94<keyword>IRI</keyword>
95<keyword>Internationalized Resource Identifier</keyword>
96<keyword>UTF-8</keyword>
97<keyword>URI</keyword>
98<keyword>URL</keyword>
99<keyword>IDN</keyword>
100<keyword>LEIRI</keyword>
101
102<abstract>
103<t>This document defines the Internationalized Resource Identifier
104(IRI) protocol element, as an extension of the Uniform Resource
105Identifier (URI).  An IRI is a sequence of characters from the
106Universal Character Set (Unicode/ISO 10646). Grammar and processing
107rules are given for IRIs and related syntactic forms.</t>
108
109<t>In addition, this document provides named additional rule sets
110for processing otherwise invalid IRIs, in a way that supports
111other specifications that wish to mandate common behavior for
112'error' handling. In particular, rules used in some XML languages
113(LEIRI) and web applications are given.</t>
114
115<t>Defining IRI as new protocol element (rather than updating or
116extending the definition of URI) allows independent orderly
117transitions: other protocols and languages that use URIs must
118explicitly choose to allow IRIs.</t>
119
120<t>Guidelines are provided for the use and deployment of IRIs and
121related protocol elements when revising protocols, formats, and
122software components that currently deal only with URIs.</t>
123
124</abstract>
125  <note title='RFC Editor: Please remove the next paragraph before publication.'>
126    <t>This document is intended to update RFC 3987 and move towards IETF
127    Draft Standard.  For discussion and comments on this
128    draft, please join the IETF IRI WG by subscribing to the mailing
129    list public-iri@w3.org. For a list of open issues, please see
130    the issue tracker of the WG at http://trac.tools.ietf.org/wg/iri/trac/report/1.</t>
131</note>
132</front>
133<middle>
134
135<section title="Introduction">
136
137<section title="Overview and Motivation" anchor="overview">
138
139<t>A Uniform Resource Identifier (URI) is defined in <xref
140target="RFC3986"/> as a sequence of characters chosen from a limited
141subset of the repertoire of US-ASCII <xref target="ASCII"/>
142characters.</t>
143
144<t>The characters in URIs are frequently used for representing words
145of natural languages.  This usage has many advantages: Such URIs are
146easier to memorize, easier to interpret, easier to transcribe, easier
147to create, and easier to guess. For most languages other than English,
148however, the natural script uses characters other than A - Z. For many
149people, handling Latin characters is as difficult as handling the
150characters of other scripts is for those who use only the Latin
151alphabet. Many languages with non-Latin scripts are transcribed with
152Latin letters. These transcriptions are now often used in URIs, but
153they introduce additional difficulties.</t>
154
155<t>The infrastructure for the appropriate handling of characters from
156additional scripts is now widely deployed in operating system and
157application software. Software that can handle a wide variety of
158scripts and languages at the same time is increasingly common. Also,
159an increasing number of protocols and formats can carry a wide range of
160characters.</t>
161
162<t>URIs are used both as a protocol element (for transmission and
163processing by software) and also a presentation element (for display
164and handling by people who read, interpret, coin, or guess them). The
165transition between these roles is more difficult and complex when
166dealing with the larger set of characters than allowed for URIs in
167<xref target="RFC3986"/>. </t>
168
169<t>This document defines the protocol element called Internationalized
170Resource Identifier (IRI), which allow applications of URIs to be
171extended to use resource identifiers that have a much wider repertoire
172of characters. It also provides corresponding "internationalized"
173versions of other constructs from <xref target="RFC3986"/>, such as
174URI references. The syntax of IRIs is defined in <xref
175target="syntax"/>.
176</t>
177
178<t>Using characters outside of A - Z in IRIs adds a number of
179difficulties. <xref target="Bidi"/> discusses the special case of
180bidirectional IRIs using characters from scripts written
181right-to-left.  <xref target="equivalence"/> discusses various forms
182of equivalence between IRIs. <xref target="IRIuse"/> discusses the use
183of IRIs in different situations.  <xref target="guidelines"/> gives
184additional informative guidelines.  <xref target="security"/>
185discusses IRI-specific security considerations.</t>
186</section> <!-- overview -->
187
188<section title="Applicability" anchor="Applicability">
189
190<t>IRIs are designed to allow protocols and software that deal with
191URIs to be updated to handle IRIs. A "URI scheme" (as defined by <xref
192target="RFC3986"/> and registered through the IANA process defined in
193<xref target="RFC4395"/> also serves as an "IRI scheme". Processing of
194IRIs is accomplished by extending the URI syntax while retaining (and
195not expanding) the set of "reserved" characters, such that the syntax
196for any URI scheme may be uniformly extended to allow non-ASCII
197characters. In addition, following parsing of an IRI, it is possible
198to construct a corresponding URI by first encoding characters outside
199of the allowed URI range and then reassembling the components.
200</t>
201
202<t>Practical use of IRIs forms in place of URIs forms depends on the
203following conditions being met:</t>
204
205<t><list style="hanging">
206   
207<t hangText="a.">A protocol or format element MUST be explicitly designated to be
208  able to carry IRIs. The intent is to avoid introducing IRIs into
209  contexts that are not defined to accept them.  For example, XML
210  schema <xref target="XMLSchema"/> has an explicit type "anyURI" that
211  includes IRIs and IRI references. Therefore, IRIs and IRI references
212  can be in attributes and elements of type "anyURI".  On the other
213  hand, in the <xref target="RFC2616"/> definition of HTTP/1.1, the
214  Request URI is defined as a URI, which means that direct use of IRIs
215  is not allowed in HTTP requests.</t>
216
217<t hangText="b.">The protocol or format carrying the IRIs MUST have a
218  mechanism to represent the wide range of characters used in IRIs,
219  either natively or by some protocol- or format-specific escaping
220  mechanism (for example, numeric character references in <xref
221  target="XML1"/>).</t>
222
223<t hangText="c.">The URI scheme definition, if it explicitly allows a
224  percent sign ("%") in any syntactic component, SHOULD define the
225  interpretation of sequences of percent-encoded octets (using "%XX"
226  hex octets) as octet from sequences of UTF-8 encoded strings; this
227  is recommended in the guidelines for registering new schemes, <xref
228  target="RFC4395"/>.  For example, this is the practice for IMAP URLs
229  <xref target="RFC2192"/>, POP URLs <xref target="RFC2384"/> and the
230  URN syntax <xref target="RFC2141"/>). Note that use of
231  percent-encoding may also be restricted in some situations, for
232  example, URI schemes that disallow percent-encoding might still be
233  used with a fragment identifier which is percent-encoded (e.g.,
234  <xref target="XPointer"/>). See <xref target="UTF8use"/> for further
235  discussion.</t>
236</list></t>
237
238</section> <!-- applicability -->
239
240<section title="Definitions" anchor="sec-Definitions">
241 
242<t>The following definitions are used in this document; they follow the
243terms in <xref target="RFC2130"/>, <xref target="RFC2277"/>, and
244<xref target="ISO10646"/>.</t>
245<t><list style="hanging">
246   
247<t hangText="character:">A member of a set of elements used for the
248    organization, control, or representation of data. For example,
249    "LATIN CAPITAL LETTER A" names a character.</t>
250   
251<t hangText="octet:">An ordered sequence of eight bits considered as a
252    unit.</t>
253   
254<t hangText="character repertoire:">A set of characters (set in the
255    mathematical sense).</t>
256   
257<t hangText="sequence of characters:">A sequence of characters (one
258    after another).</t>
259   
260<t hangText="sequence of octets:">A sequence of octets (one after
261    another).</t>
262   
263<t hangText="character encoding:">A method of representing a sequence
264    of characters as a sequence of octets (maybe with variants). Also,
265    a method of (unambiguously) converting a sequence of octets into a
266    sequence of characters.</t>
267   
268<t hangText="charset:">The name of a parameter or attribute used to
269    identify a character encoding.</t>
270   
271<t hangText="UCS:">Universal Character Set. The coded character set
272    defined by ISO/IEC 10646 <xref target="ISO10646"/> and the Unicode
273    Standard <xref target="UNIV4"/>.</t>
274   
275<t hangText="IRI reference:">Denotes the common usage of an
276    Internationalized Resource Identifier. An IRI reference may be
277    absolute or relative.  However, the "IRI" that results from such a
278    reference only includes absolute IRIs; any relative IRI references
279    are resolved to their absolute form.  Note that in <xref
280    target="RFC2396"/> URIs did not include fragment identifiers, but
281    in <xref target="RFC3986"/> fragment identifiers are part of
282    URIs.</t>
283   
284<t hangText="URL:">The term "URL" was originally used <xref
285   target="RFC1738"/> for roughly what is now called a "URI".  Books,
286   software and documentation often refers to URIs and IRIs using the
287   "URL" term. Some usages restrict "URL" to those URIs which are not
288   URNs. Because of the ambiguity of the term using the term "URL" is
289   NOT RECOMMENDED in formal documents.</t>
290
291<t hangText="LEIRI (Legacy Extended IRI) processing:">  This term was used in
292   various XML specifications to refer
293   to strings that, although not valid IRIs, were acceptable input to
294   the processing rules in <xref target="LEIRIspec" />.</t>
295
296<t hangText="(Web Address, Hypertext Reference, HREF):"> These terms have been
297   added in this document for convenience, to allow other
298   specifications to refer to those strings that, although not valid
299   IRIs, are acceptable input to the processing rules in <xref
300   target="webaddress"/>. This usage corresponds to the parsing rules
301   of some popular web browsing applications.
302   ISSUE: Need to find a good name/abbreviation for these.</t>
303   
304<t hangText="running text:">Human text (paragraphs, sentences,
305   phrases) with syntax according to orthographic conventions of a
306   natural language, as opposed to syntax defined for ease of
307   processing by machines (e.g., markup, programming languages).</t>
308   
309<t hangText="protocol element:">Any portion of a message that affects
310    processing of that message by the protocol in question.</t>
311   
312<t hangText="presentation element:">A presentation form corresponding
313    to a protocol element; for example, using a wider range of
314    characters.</t>
315   
316<t hangText="create (a URI or IRI):">With respect to URIs and IRIs,
317     the term is used for the initial creation. This may be the
318     initial creation of a resource with a certain identifier, or the
319     initial exposition of a resource under a particular
320     identifier.</t>
321   
322<t hangText="generate (a URI or IRI):">With respect to URIs and IRIs,
323     the term is used when the identifier is generated by derivation
324     from other information.</t>
325
326<t hangText="parsed URI component:">When a URI processor parses a URI
327   (following the generic syntax or a scheme-specific syntax, the result
328   is a set of parsed URI components, each of which has a type
329   (corresponding to the syntactic definition) and a sequence of URI
330   characters.  </t>
331
332<t hangText="parsed IRI component:">When an IRI processor parses
333   an IRI directly, following the general syntax or a scheme-specific
334   syntax, the result is a set of parsed IRI components, each of
335   which has a type (corresponding to the syntactice definition)
336   and a sequence of IRI characters. (This definition is analogous
337   to "parsed URI component".)</t>
338
339<t hangText="IRI scheme:">A URI scheme may also be known as
340   an "IRI scheme" if the scheme's syntax has been extended to
341   allow non-US-ASCII characters according to the rules in this
342   document.</t>
343
344</list></t>
345</section> <!-- definitions -->
346<section title="Notation" anchor="sec-Notation">
347     
348<t>RFCs and Internet Drafts currently do not allow any characters
349outside the US-ASCII repertoire. Therefore, this document uses various
350special notations to denote such characters in examples.</t>
351     
352<t>In text, characters outside US-ASCII are sometimes referenced by
353using a prefix of 'U+', followed by four to six hexadecimal
354digits.</t>
355
356<t>To represent characters outside US-ASCII in examples, this document
357uses two notations: 'XML Notation' and 'Bidi Notation'.</t>
358
359<t>XML Notation uses a leading '&amp;#x', a trailing ';', and the
360hexadecimal number of the character in the UCS in between. For
361example, &amp;#x44F; stands for CYRILLIC CAPITAL LETTER YA. In this
362notation, an actual '&amp;' is denoted by '&amp;amp;'.</t>
363
364<t>Bidi Notation is used for bidirectional examples: Lower case
365letters stand for Latin letters or other letters that are written left
366to right, whereas upper case letters represent Arabic or Hebrew
367letters that are written right to left.</t>
368
369<t>To denote actual octets in examples (as opposed to percent-encoded
370octets), the two hex digits denoting the octet are enclosed in "&lt;"
371and "&gt;".  For example, the octet often denoted as 0xc9 is denoted
372here as &lt;c9&gt;.</t>
373
374<t> In this document, the key words "MUST", "MUST NOT", "REQUIRED",
375"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
376and "OPTIONAL" are to be interpreted as described in <xref
377target="RFC2119"/>.</t>
378
379</section> <!-- notation -->
380</section> <!-- introduction -->
381
382<section title="IRI Syntax" anchor="syntax">
383<t>This section defines the syntax of Internationalized Resource
384Identifiers (IRIs).</t>
385
386<t>As with URIs, an IRI is defined as a sequence of characters, not as
387a sequence of octets. This definition accommodates the fact that IRIs
388may be written on paper or read over the radio as well as stored or
389transmitted digitally.  The same IRI might be represented as different
390sequences of octets in different protocols or documents if these
391protocols or documents use different character encodings (and/or
392transfer encodings).  Using the same character encoding as the
393containing protocol or document ensures that the characters in the IRI
394can be handled (e.g., searched, converted, displayed) in the same way
395as the rest of the protocol or document.</t>
396
397<section title="Summary of IRI Syntax" anchor="summary">
398
399<t>IRIs are defined by extending the URI syntax in <xref
400target="RFC3986"/>, but extending the class of unreserved characters
401by adding the characters of the UCS (Universal Character Set, <xref
402target="ISO10646"/>) beyond U+007F, subject to the limitations given
403in the syntax rules below and in <xref target="limitations"/>.</t>
404
405<t>The syntax and use of components and reserved characters is the
406same as that in <xref target="RFC3986"/>. Each "URI scheme" thus also
407functions as an "IRI scheme", in that scheme-specific parsing rules
408for URIs of a scheme are be extended to allow parsing of IRIs using
409the same parsing rules.</t>
410
411<t>All the operations defined in <xref target="RFC3986"/>, such as the
412resolution of relative references, can be applied to IRIs by
413IRI-processing software in exactly the same way as they are for URIs
414by URI-processing software.</t>
415
416<t>Characters outside the US-ASCII repertoire MUST NOT be reserved and
417therefore MUST NOT be used for syntactical purposes, such as to
418delimit components in newly defined schemes. For example, U+00A2, CENT
419SIGN, is not allowed as a delimiter in IRIs, because it is in the
420'iunreserved' category. This is similar to the fact that it is not
421possible to use '-' as a delimiter in URIs, because it is in the
422'unreserved' category.</t>
423
424</section> <!-- summary -->
425<section title="ABNF for IRI References and IRIs" anchor="abnf">
426
427<t>An ABNF definition for IRI references (which are the most general
428concept and the start of the grammar) and IRIs is given here. The
429syntax of this ABNF is described in <xref target="STD68"/>. Character
430numbers are taken from the UCS, without implying any actual binary
431encoding. Terminals in the ABNF are characters, not octets.</t>
432
433<t>The following grammar closely follows the URI grammar in <xref
434target="RFC3986"/>, except that the range of unreserved characters is
435expanded to include UCS characters, with the restriction that private
436UCS characters can occur only in query parts. The grammar is split
437into two parts: Rules that differ from <xref target="RFC3986"/>
438because of the above-mentioned expansion, and rules that are the same
439as those in <xref target="RFC3986"/>. For rules that are different
440than those in <xref target="RFC3986"/>, the names of the non-terminals
441have been changed as follows. If the non-terminal contains 'URI', this
442has been changed to 'IRI'. Otherwise, an 'i' has been prefixed.</t>
443
444<!--
445for line length measuring in artwork (max 72 chars, three chars at start):
446      1         2         3         4         5         6         7
447456789012345678901234567890123456789012345678901234567890123456789012
448-->
449<figure>
450<preamble>The following rules are different from those in <xref target="RFC3986"/>:</preamble>
451<artwork>
452IRI            = scheme ":" ihier-part [ "?" iquery ]
453                 [ "#" ifragment ]
454
455ihier-part     = "//" iauthority ipath-abempty
456               / ipath-absolute
457               / ipath-rootless
458               / ipath-empty
459
460IRI-reference  = IRI / irelative-ref
461
462absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]
463
464irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]
465
466irelative-part = "//" iauthority ipath-abempty
467               / ipath-absolute
468               / ipath-noscheme
469               / ipath-empty
470
471iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
472iuserinfo      = *( iunreserved / pct-form / sub-delims / ":" )
473ihost          = IP-literal / IPv4address / ireg-name
474
475pct-form       = pct-encoded
476
477ireg-name      = *( iunreserved / sub-delims )
478
479ipath          = ipath-abempty   ; begins with "/" or is empty
480               / ipath-absolute  ; begins with "/" but not "//"
481               / ipath-noscheme  ; begins with a non-colon segment
482               / ipath-rootless  ; begins with a segment
483               / ipath-empty     ; zero characters
484
485ipath-abempty  = *( path-sep isegment )
486ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
487ipath-noscheme = isegment-nz-nc *( path-sep isegment )
488ipath-rootless = isegment-nz *( path-sep isegment )
489ipath-empty    = 0&lt;ipchar&gt;
490path-sep       = "/"
491
492isegment       = *ipchar
493isegment-nz    = 1*ipchar
494isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
495                     / "@" )
496               ; non-zero-length segment without any colon ":"                     
497
498ipchar         = iunreserved / pct-form / sub-delims / ":"
499               / "@"
500 
501iquery         = *( ipchar / iprivate / "/" / "?" )
502
503ifragment      = *( ipchar / "/" / "?" / "#" )
504
505iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
506
507ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
508               / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
509               / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
510               / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
511               / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
512               / %xD0000-DFFFD / %xE1000-EFFFD
513
514iprivate       = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD
515               / %x100000-10FFFD
516</artwork>
517</figure>
518
519<t>Some productions are ambiguous. The "first-match-wins" (a.k.a. "greedy")
520algorithm applies. For details, see <xref target="RFC3986"/>.</t>
521
522<figure>
523<preamble>The following rules are the same as those in <xref target="RFC3986"/>:</preamble>
524<artwork>
525scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
526 
527port           = *DIGIT
528 
529IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"
530 
531IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
532 
533IPv6address    =                            6( h16 ":" ) ls32
534               /                       "::" 5( h16 ":" ) ls32
535               / [               h16 ] "::" 4( h16 ":" ) ls32
536               / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
537               / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
538               / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
539               / [ *4( h16 ":" ) h16 ] "::"              ls32
540               / [ *5( h16 ":" ) h16 ] "::"              h16
541               / [ *6( h16 ":" ) h16 ] "::"
542               
543h16            = 1*4HEXDIG
544ls32           = ( h16 ":" h16 ) / IPv4address
545
546IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet
547
548dec-octet      = DIGIT                 ; 0-9
549               / %x31-39 DIGIT         ; 10-99
550               / "1" 2DIGIT            ; 100-199
551               / "2" %x30-34 DIGIT     ; 200-249
552               / "25" %x30-35          ; 250-255
553           
554pct-encoded    = "%" HEXDIG HEXDIG
555
556unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
557reserved       = gen-delims / sub-delims
558gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
559sub-delims     = "!" / "$" / "&amp;" / "'" / "(" / ")"
560               / "*" / "+" / "," / ";" / "="
561</artwork></figure>
562
563<t>This syntax does not support IPv6 scoped addressing zone identifiers.</t>
564
565</section> <!-- abnf -->
566
567</section> <!-- syntax -->
568
569<section title="Processing IRIs and related protocol elements" anchor="processing">
570
571<t>IRIs are meant to replace URIs in identifying resources within new
572versions of protocols, formats, and software components that use a
573UCS-based character repertoire.  Protocols and components may use and
574process IRIs directly. However, there are still numerous systems and
575protocols which only accept URIs or components of parsed URIs; that is,
576they only accept sequences of characters within the subset of US-ASCII
577characters allowed in URIs. </t>
578
579<t>This section defines specific processing steps for IRI consumers
580which establish the relationship between the string given and the
581interpreted derivatives. These
582processing steps apply to both IRIs and IRI references (i.e., absolute
583or relative forms); for IRIs, some steps are scheme specific. </t>
584
585<section title="Converting to UCS" anchor="ucsconv"> 
586 
587<t>Input that is already in a Unicode form (i.e., a sequence of Unicode
588 characters or an octet-stream representing a Unicode-based character
589 encoding such as UTF-8 or UTF-16) should be left as is and not
590 normalized (see (see <xref target="normalization"/>).</t>
591 
592<t>If the IRI or IRI reference is an octet stream in some known
593 non-Unicode character encoding, convert the IRI to a sequence of
594 characters from the UCS; this sequence SHOULD also be normalized
595 according to Unicode Normalization Form C (NFC, <xref
596 target="UTR15"/>). In this case, retain the original character
597 encoding as the "document character encoding". (DESIGN QUESTION:
598 NOT WHAT MOST IMPLEMENTATIONS DO, CHANGE? ) </t>
599
600<t> In other cases (written on paper, read aloud, or otherwise
601 represented independent of any character encoding) represent the IRI
602 as a sequence of characters from the UCS normalized according to
603 Unicode Normalization Form C (NFC, <xref target="UTR15"/>).</t>
604</section> <!-- ucsconv -->
605
606<section title="Parse the IRI into IRI components">
607
608<t>Parse the IRI, either as a relative reference (no scheme)
609or using scheme specific processing (according to the scheme
610given); the result resulting in a set of parsed IRI components.
611(NOTE: FIX BEFORE RELEASE: INTENT IS THAT ALL IRI SCHEMES
612THAT USE GENERIC SYNTAX AND ALLOW NON-ASCII AUTHORITY CAN
613ONLY USE AUTHORITY FOR NAMES THAT FOLLOW PUNICODE.)
614 </t>
615
616<t>NOTE: The result of parsing into components will correspond result
617in a correspondence of subtrings of the IRI according to the part
618matched.  For example, in <xref target="HTML5"/>, the protocol
619components of interest are SCHEME (scheme), HOST (ireg-name), PORT
620(port), the PATH (ipath after the initial "/"), QUERY (iquery),
621FRAGMENT (ifragment), and AUTHORITY (iauthority).
622</t>
623
624<t>Subsequent processing rules are sometimes used to define other
625syntactic components. For example, <xref target="HTML5"/> defines APIs
626for IRI processing; in these APIs:
627
628<list style="hanging">
629<t hangText="HOSTSPECIFIC"> the substring that follows
630the substring matched by the iauthority production, or the whole
631string if the iauthority production wasn't matched.</t>
632<t hangText="HOSTPORT"> if there is a scheme component and a port
633component and the port given by the port component is different than
634the default port defined for the protocol given by the scheme
635component, then HOSTPORT is the substring that starts with the
636substring matched by the host production and ends with the substring
637matched by the port production, and includes the colon in between the
638two. Otherwise, it is the same as the host component.
639</t>
640</list>
641</t>
642</section> <!-- parse -->
643
644<section title="General percent-encoding of IRI components" anchor="compmapping">
645   
646<t>For most IRI components, it is possible to map the IRI component
647to an equivalent URI component by percent-encoding those characters
648not allowed in URIs. Previous processing steps will have removed
649some characters, and the interpretation of reserved characters will
650have already been done (with the syntactic reserved characters outside
651of the IRI component). This mapping is defined for all sequences
652of Unicode characters, whether or not they are valid for the component
653in question. </t>
654   
655<t>For each character which is not allowed in a valid URI (NOTE: WHAT
656IS THE RIGHT REFERENCE HERE), apply the following steps. </t>
657
658<t><list style="hanging">
659
660<t hangText="Convert to UTF-8">Convert the character to a sequence of
661  one or more octets using UTF-8 <xref target="RFC3629"/>.</t>
662
663<t hangText="Percent encode">Convert each octet of this sequence to %HH,
664   where HH is the hexadecimal notation of the octet value. The
665   hexadecimal notation SHOULD use uppercase letters. (This is the
666   general URI percent-encoding mechanism in Section 2.1 of <xref
667   target="RFC3986"/>.)</t>
668   
669</list></t>
670
671<t>Note that the mapping is an identity transformation for parsed URI
672components of valid URIs, and is idempotent: applying the mapping a
673second time will not change anything.</t>
674</section> <!-- general conversion -->
675
676<section title="Mapping ireg-name" anchor="dnsmapping">
677
678<t>Schemes that allow non-ASCII based characters
679in the reg-name (ireg-name) position MUST convert the ireg-name
680component of an IRI as follows:</t>
681
682<t>Replace the ireg-name part of the IRI by the part converted using
683the ToASCII operation specified in Section 4.1 of <xref
684target="RFC3490"/> on each dot-separated label, and by using U+002E
685(FULL STOP) as a label separator, with the flag UseSTD3ASCIIRules set
686to FALSE, and with the flag AllowUnassigned set to FALSE.
687The ToASCII operation may
688fail, but this would mean that the IRI cannot be resolved.
689In such cases, if the domain name conversion fails, then the
690entire IRI conversion fails. Processors that have no mechanism for
691signalling a failure MAY instead substitute an otherwise
692invalid host name, although such processing SHOULD be avoided.
693 </t>
694
695<t>For example, the IRI
696<vspace/>"http://r&amp;#xE9;sum&amp;#xE9;.example.org"<vspace/> MAY be
697converted to <vspace/>"http://xn--rsum-bad.example.org"<vspace/>;
698conversion to percent-encoded form, e.g.,
699 <vspace/>"http://r%C3%A9sum%C3%A9.example.org", MUST NOT be performed. </t>
700
701<t><list style="hanging"> 
702
703<t hangText="Note:">Domain Names may appear in parts of an IRI other
704than the ireg-name part.  It is the responsibility of scheme-specific
705implementations (if the Internationalized Domain Name is part of the
706scheme syntax) or of server-side implementations (if the
707Internationalized Domain Name is part of 'iquery') to apply the
708necessary conversions at the appropriate point. Example: Trying to
709validate the Web page at<vspace/>
710http://r&amp;#xE9;sum&amp;#xE9;.example.org would lead to an IRI of
711<vspace/>http://validator.w3.org/check?uri=http%3A%2F%2Fr&amp;#xE9;sum&amp;#xE9;.<vspace/>example.org,
712which would convert to a URI
713of<vspace/>http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.<vspace/>example.org.
714The server-side implementation is responsible for making the
715necessary conversions to be able to retrieve the Web page.</t>
716
717<t hangText="Note:">In this process, characters allowed in URI
718references and existing percent-encoded sequences are not encoded further.
719(This mapping is similar to, but different from, the encoding applied
720when arbitrary content is included in some part of a URI.)
721
722For example, an IRI of
723<vspace/>"http://www.example.org/red%09ros&amp;#xE9;#red"
724(in XML notation) is converted to
725<vspace/>"http://www.example.org/red%09ros%C3%A9#red", not to
726something like
727<vspace/>"http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".
728((DESIGN QUESTION: What about e.g. http://r%C3%A9sum%C3%A9.example.org in an IRI? Will that get converted to punycode, or not?))
729
730</t>
731
732</list></t>
733</section> <!-- dnsmapping -->
734
735<section title="Mapping query components" anchor="querymapping">
736
737<t>((NOTE: SEE ISSUES LIST))
738
739For compatibility with existing deployed HTTP infrastructure,
740the following special case applies for schemes "http" and "https"
741and IRIs whose origin has a document charset other than one which
742is UCS-based (e.g., UTF-8 or UTF-16). In such a case, the "query"
743component of an IRI is mapped into a URI by using the document
744charset rather than UTF-8 as the binary representation before
745pct-encoding. This mapping is not applied for any other scheme
746or component.</t>
747
748</section> <!-- querymapping -->
749
750<section title="Mapping IRIs to URIs" anchor="mapping">
751
752<t>The canonical mapping from a IRI to URI is defined by applying the
753mapping above (from IRI to URI components) and then reassembling a URI
754from the parsed URI components using the original punctuation that
755delimited the IRI components. </t>
756
757</section> <!-- mapping -->
758
759<section title="Converting URIs to IRIs" anchor="URItoIRI">
760
761<t>In some situations, for presentation and further processing,
762it is desirable to convert a URI into an equivalent IRI in which
763natural characters are represented directly rather than
764percent encoded. Of course, every URI is already an IRI in
765its own right without any conversion, and in general there
766This section gives one such procedure for this conversion.
767</t>
768
769<t>
770The conversion described in this section, if given a valid URI, will
771result in an IRI that maps back to the URI used as an input for the
772conversion (except for potential case differences in percent-encoding
773and for potential percent-encoded unreserved characters).
774
775However, the IRI resulting from this conversion may differ
776from the original IRI (if there ever was one).</t> 
777
778<t>URI-to-IRI conversion removes percent-encodings, but not all
779percent-encodings can be eliminated. There are several reasons for
780this:</t>
781
782<t><list style="hanging">
783
784<t hangText="1.">Some percent-encodings are necessary to distinguish
785    percent-encoded and unencoded uses of reserved characters.</t>
786
787<t hangText="2.">Some percent-encodings cannot be interpreted as sequences
788    of UTF-8 octets.<vspace blankLines="1"/>
789    (Note: The octet patterns of UTF-8 are highly regular.
790    Therefore, there is a very high probability, but no guarantee,
791    that percent-encodings that can be interpreted as sequences of UTF-8
792    octets actually originated from UTF-8. For a detailed discussion,
793    see <xref target="Duerst97"/>.)</t>
794
795<t hangText="3.">The conversion may result in a character that is not
796    appropriate in an IRI. See <xref target="abnf"/>, <xref target="visual"/>,
797      and <xref target="limitations"/> for further details.</t>
798
799<t hangText="4.">IRI to URI conversion has different rules for
800    dealing with domain names and query parameters.</t>
801
802</list></t>
803
804<t>Conversion from a URI to an IRI MAY be done by using the following
805steps:
806
807<list style="hanging">
808<t hangText="1.">Represent the URI as a sequence of octets in
809       US-ASCII.</t>
810
811<t hangText="2.">Convert all percent-encodings ("%" followed by two
812      hexadecimal digits) to the corresponding octets, except those
813      corresponding to "%", characters in "reserved", and characters
814      in US-ASCII not allowed in URIs.</t> 
815
816<t hangText="3.">Re-percent-encode any octet produced in step 2 that
817      is not part of a strictly legal UTF-8 octet sequence.</t>
818
819
820<t hangText="4.">Re-percent-encode all octets produced in step 3 that
821      in UTF-8 represent characters that are not appropriate according
822      to <xref target="abnf"/>, <xref target="visual"/>, and <xref
823      target="limitations"/>.</t> 
824
825<t hangText="5.">Interpret the resulting octet sequence as a sequence
826      of characters encoded in UTF-8.</t>
827
828<t hangText="6.">URIs known to contain domain names in the reg-name
829      component SHOULD convert punycode-encoded domain name labels to
830      the corresponding characters using the ToUnicode procedure. </t>
831</list></t>
832
833<t>This procedure will convert as many percent-encoded characters as
834possible to characters in an IRI. Because there are some choices when
835step 4 is applied (see <xref target="limitations"/>), results may
836vary.</t>
837
838<t>Conversions from URIs to IRIs MUST NOT use any character
839encoding other than UTF-8 in steps 3 and 4, even if it might be
840possible to guess from the context that another character encoding
841than UTF-8 was used in the URI.  For example, the URI
842"http://www.example.org/r%E9sum%E9.html" might with some guessing be
843interpreted to contain two e-acute characters encoded as
844iso-8859-1. It must not be converted to an IRI containing these
845e-acute characters. Otherwise, in the future the IRI will be mapped to
846"http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
847URI from "http://www.example.org/r%E9sum%E9.html".</t>
848
849<section title="Examples">
850
851<t>This section shows various examples of converting URIs to IRIs.
852Each example shows the result after each of the steps 1 through 6 is
853applied. XML Notation is used for the final result.  Octets are
854denoted by "&lt;" followed by two hexadecimal digits followed by
855"&gt;".</t>
856
857<t>The following example contains the sequence "%C3%BC", which is a
858strictly legal UTF-8 sequence, and which is converted into the actual
859character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
860u-umlaut).
861
862<list style="hanging">
863<t hangText="1.">http://www.example.org/D%C3%BCrst</t>
864<t hangText="2.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
865<t hangText="3.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
866<t hangText="4.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
867<t hangText="5.">http://www.example.org/D&amp;#xFC;rst</t>
868<t hangText="6.">http://www.example.org/D&amp;#xFC;rst</t>
869</list>
870</t>
871
872<t>The following example contains the sequence "%FC", which might
873represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in
874the<vspace/>iso-8859-1 character encoding.  (It might represent other
875characters in other character encodings. For example, the octet
876&lt;fc&gt; in iso-8859-5 represents U+045C, CYRILLIC SMALL LETTER
877KJE.)  Because &lt;fc&gt; is not part of a strictly legal UTF-8
878sequence, it is re-percent-encoded in step 3.
879
880
881<list style="hanging">
882<t hangText="1.">http://www.example.org/D%FCrst</t>
883<t hangText="2.">http://www.example.org/D&lt;fc&gt;rst</t>
884<t hangText="3.">http://www.example.org/D%FCrst</t>
885<t hangText="4.">http://www.example.org/D%FCrst</t>
886<t hangText="5.">http://www.example.org/D%FCrst</t>
887<t hangText="6.">http://www.example.org/D%FCrst</t>
888</list>
889</t>
890
891<t>The following example contains "%e2%80%ae", which is the percent-encoded<vspace/>UTF-8
892character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. <xref target="visual"/>
893forbids the direct use of this character in an IRI. Therefore, the
894corresponding octets are re-percent-encoded in step 4. This example shows
895that the case (upper- or lowercase) of letters used in percent-encodings may not be preserved.
896The example also contains a punycode-encoded domain name label (xn--99zt52a),
897which is not converted.
898
899<list style="hanging">
900<t hangText="1.">http://xn--99zt52a.example.org/%e2%80%ae</t>
901<t hangText="2.">http://xn--99zt52a.example.org/&lt;e2&gt;&lt;80&gt;&lt;ae&gt;</t>
902<t hangText="3.">http://xn--99zt52a.example.org/&lt;e2&gt;&lt;80&gt;&lt;ae&gt;</t>
903<t hangText="4.">http://xn--99zt52a.example.org/%E2%80%AE</t>
904<t hangText="5.">http://xn--99zt52a.example.org/%E2%80%AE</t>
905<t hangText="6.">http://&amp;#x7D0D;&amp;#x8C46;.example.org/%E2%80%AE</t>
906</list></t>
907
908<t>Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
909(Japanese Natto). ((EDITOR NOTE: There is some inconsistency in this note.))</t>
910
911</section> <!-- examples -->
912</section> <!-- URItoIRI -->
913</section> <!-- processing -->
914<section title="Bidirectional IRIs for Right-to-Left Languages" anchor="Bidi">
915
916<t>Some UCS characters, such as those used in the Arabic and Hebrew
917scripts, have an inherent right-to-left (rtl) writing direction. IRIs
918containing these characters (called bidirectional IRIs or Bidi IRIs)
919require additional attention because of the non-trivial relation
920between logical representation (used for digital representation and
921for reading/spelling) and visual representation (used for
922display/printing).</t>
923
924<t>Because of the complex interaction between the logical representation,
925the visual representation, and the syntax of a Bidi IRI, a balance is
926needed between various requirements.
927The main requirements are<list style="hanging">
928<t hangText="1.">user-predictable conversion between visual and
929    logical representation;</t>
930<t hangText="2.">the ability to include a wide range of characters
931    in various parts of the IRI; and</t>
932<t hangText="3.">minor or no changes or restrictions for
933      implementations.</t>
934</list></t>
935
936<section title="Logical Storage and Visual Presentation" anchor="visual">
937
938<t>When stored or transmitted in digital representation, bidirectional
939IRIs MUST be in full logical order and MUST conform to the IRI syntax
940rules (which includes the rules relevant to their scheme). This
941ensures that bidirectional IRIs can be processed in the same way as
942other IRIs.</t> <t>Bidirectional IRIs MUST be rendered by using the
943Unicode Bidirectional Algorithm <xref target="UNIV4"/>, <xref
944target="UNI9"/>.  Bidirectional IRIs MUST be rendered in the same way
945as they would be if they were in a left-to-right embedding; i.e., as
946if they were preceded by U+202A, LEFT-TO-RIGHT EMBEDDING (LRE), and
947followed by U+202C, POP DIRECTIONAL FORMATTING (PDF).  Setting the
948embedding direction can also be done in a higher-level protocol (e.g.,
949the dir='ltr' attribute in HTML).</t> 
950
951<t>There is no requirement to use the above embedding if the display
952is still the same without the embedding. For example, a bidirectional
953IRI in a text with left-to-right base directionality (such as used for
954English or Cyrillic) that is preceded and followed by whitespace and
955strong left-to-right characters does not need an embedding.  Also, a
956bidirectional relative IRI reference that only contains strong
957right-to-left characters and weak characters and that starts and ends
958with a strong right-to-left character and appears in a text with
959right-to-left base directionality (such as used for Arabic or Hebrew)
960and is preceded and followed by whitespace and strong characters does
961not need an embedding.</t>
962
963<t>In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
964sufficient to force the correct display behavior.  However, the
965details of the Unicode Bidirectional algorithm are not always easy to
966understand. Implementers are strongly advised to err on the side of
967caution and to use embedding in all cases where they are not
968completely sure that the display behavior is unaffected without the
969embedding.</t>
970
971<t>The Unicode Bidirectional Algorithm (<xref target="UNI9"/>, section
9724.3) permits higher-level protocols to influence bidirectional
973rendering. Such changes by higher-level protocols MUST NOT be used if
974they change the rendering of IRIs.</t> 
975
976<t>The bidirectional formatting characters that may be used before or
977after the IRI to ensure correct display are not themselves part of the
978IRI.  IRIs MUST NOT contain bidirectional formatting characters (LRM,
979RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of
980the IRI but do not appear themselves. It would therefore not be
981possible to input an IRI with such characters correctly.</t>
982
983</section> <!-- visual -->
984<section title="Bidi IRI Structure" anchor="bidi-structure">
985
986<t>The Unicode Bidirectional Algorithm is designed mainly for running
987text.  To make sure that it does not affect the rendering of
988bidirectional IRIs too much, some restrictions on bidirectional IRIs
989are necessary. These restrictions are given in terms of delimiters
990(structural characters, mostly punctuation such as "@", ".", ":",
991and<vspace/>"/") and components (usually consisting mostly of letters
992and digits).</t>
993
994<t>The following syntax rules from <xref target="abnf"/> correspond to
995components for the purpose of Bidi behavior: iuserinfo, ireg-name,
996isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and
997ifragment.</t>
998
999<t>Specifications that define the syntax of any of the above
1000components MAY divide them further and define smaller parts to be
1001components according to this document. As an example, the restrictions
1002of <xref target="RFC3490"/> on bidirectional domain names correspond
1003to treating each label of a domain name as a component for schemes
1004with ireg-name as a domain name.  Even where the components are not
1005defined formally, it may be helpful to think about some syntax in
1006terms of components and to apply the relevant restrictions.  For
1007example, for the usual name/value syntax in query parts, it is
1008convenient to treat each name and each value as a component. As
1009another example, the extensions in a resource name can be treated as
1010separate components.</t>
1011
1012<t>For each component, the following restrictions apply:</t>
1013<t>
1014<list style="hanging">
1015
1016<t hangText="1.">A component SHOULD NOT use both right-to-left and
1017  left-to-right characters.</t>
1018
1019<t hangText="2.">A component using right-to-left characters SHOULD
1020  start and end with right-to-left characters.</t>
1021
1022</list></t>
1023
1024<t>The above restrictions are given as "SHOULD"s, rather than as
1025"MUST"s.  For IRIs that are never presented visually, they are not
1026relevant.  However, for IRIs in general, they are very important to
1027ensure consistent conversion between visual presentation and logical
1028representation, in both directions.</t>
1029
1030<t><list style="hanging">
1031
1032<t hangText="Note:">In some components, the above restrictions may
1033  actually be strictly enforced.  For example, <xref
1034  target="RFC3490"></xref> requires that these restrictions apply to
1035  the labels of a host name for those schemes where ireg-name is a
1036  host name.  In some other components (for example, path components)
1037  following these restrictions may not be too difficult.  For other
1038  components, such as parts of the query part, it may be very
1039  difficult to enforce the restrictions because the values of query
1040  parameters may be arbitrary character sequences.</t>
1041
1042</list></t>
1043
1044<t>If the above restrictions cannot be satisfied otherwise, the
1045affected component can always be mapped to URI notation as described
1046in <xref target="compmapping"/>. Please note that the whole component
1047has to be mapped (see also Example 9 below).</t>
1048
1049</section> <!-- bidi-structure -->
1050
1051<section title="Input of Bidi IRIs" anchor="bidiInput">
1052
1053<t>Bidi input methods MUST generate Bidi IRIs in logical order while
1054rendering them according to <xref target="visual"/>.  During input,
1055rendering SHOULD be updated after every new character is input to
1056avoid end-user confusion.</t>
1057
1058</section> <!-- bidiInput -->
1059
1060<section title="Examples">
1061
1062<t>This section gives examples of bidirectional IRIs, in Bidi
1063Notation.  It shows legal IRIs with the relationship between logical
1064and visual representation and explains how certain phenomena in this
1065relationship may look strange to somebody not familiar with
1066bidirectional behavior, but familiar to users of Arabic and Hebrew. It
1067also shows what happens if the restrictions given in <xref
1068target="bidi-structure"/> are not followed. The examples below can be
1069seen at <xref target="BidiEx"/>, in Arabic, Hebrew, and Bidi Notation
1070variants.</t>
1071
1072<t>To read the bidi text in the examples, read the visual
1073representation from left to right until you encounter a block of rtl
1074text. Read the rtl block (including slashes and other special
1075characters) from right to left, then continue at the next unread ltr
1076character.</t>
1077
1078<t>Example 1: A single component with rtl characters is inverted:
1079<vspace/>Logical representation:
1080"http://ab.CDEFGH.ij/kl/mn/op.html"<vspace/>Visual representation:
1081"http://ab.HGFEDC.ij/kl/mn/op.html"<vspace/> Components can be read
1082one by one, and each component can be read in its natural
1083direction.</t>
1084
1085<t>Example 2: More than one consecutive component with rtl characters
1086is inverted as a whole: <vspace/>Logical representation:
1087"http://ab.CDE.FGH/ij/kl/mn/op.html"<vspace/>Visual representation:
1088"http://ab.HGF.EDC/ij/kl/mn/op.html"<vspace/> A sequence of rtl
1089components is read rtl, in the same way as a sequence of rtl words is
1090read rtl in a bidi text.</t>
1091
1092<t>Example 3: All components of an IRI (except for the scheme) are
1093rtl.  All rtl components are inverted overall: <vspace/>Logical
1094representation:
1095"http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"<vspace/>Visual
1096representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"<vspace/> The
1097whole IRI (except the scheme) is read rtl. Delimiters between rtl
1098components stay between the respective components; delimiters between
1099ltr and rtl components don't move.</t>
1100
1101<t>Example 4: Each of several sequences of rtl components is inverted
1102on its own: <vspace/>Logical representation:
1103"http://AB.CD.ef/gh/IJ/KL.html"<vspace/>Visual representation:
1104"http://DC.BA.ef/gh/LK/JI.html"<vspace/> Each sequence of rtl
1105components is read rtl, in the same way as each sequence of rtl words
1106in an ltr text is read rtl.</t>
1107
1108<t>Example 5: Example 2, applied to components of different kinds:
1109<vspace/>Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
1110<vspace/>Visual representation:
1111"http://ab.cd.HG/FE/ij/kl.html"<vspace/> The inversion of the domain
1112name label and the path component may be unexpected, but it is
1113consistent with other bidi behavior.  For reassurance that the domain
1114component really is "ab.cd.EF", it may be helpful to read aloud the
1115visual representation following the bidi algorithm. After
1116"http://ab.cd." one reads the RTL block "E-F-slash-G-H", which
1117corresponds to the logical representation.
1118</t>
1119
1120<t>Example 6: Same as Example 5, with more rtl components:
1121<vspace/>Logical representation:
1122"http://ab.CD.EF/GH/IJ/kl.html"<vspace/>Visual representation:
1123"http://ab.JI/HG/FE.DC/kl.html"<vspace/> The inversion of the domain
1124name labels and the path components may be easier to identify because
1125the delimiters also move.</t>
1126
1127<t>Example 7: A single rtl component includes digits: <vspace/>Logical
1128representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"<vspace/>Visual
1129representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"<vspace/>
1130Numbers are written ltr in all cases but are treated as an additional
1131embedding inside a run of rtl characters. This is completely
1132consistent with usual bidirectional text.</t>
1133
1134<t>Example 8 (not allowed): Numbers are at the start or end of an rtl
1135component:<vspace/>Logical representation:
1136"http://ab.cd.ef/GH1/2IJ/KL.html"<vspace/>Visual representation:
1137"http://ab.cd.ef/LK/JI1/2HG.html"<vspace/> The sequence "1/2" is
1138interpreted by the bidi algorithm as a fraction, fragmenting the
1139components and leading to confusion. There are other characters that
1140are interpreted in a special way close to numbers; in particular, "+",
1141"-", "#", "$", "%", ",", ".", and ":".</t>
1142
1143<t>Example 9 (not allowed): The numbers in the previous example are
1144percent-encoded: <vspace/>Logical representation:
1145"http://ab.cd.ef/GH%31/%32IJ/KL.html",<vspace/>Visual representation:
1146"http://ab.cd.ef/LK/JI%32/%31HG.html"</t>
1147
1148<t>Example 10 (allowed but not recommended): <vspace/>Logical
1149representation: "http://ab.CDEFGH.123/kl/mn/op.html"<vspace/>Visual
1150representation: "http://ab.123.HGFEDC/kl/mn/op.html"<vspace/>
1151Components consisting of only numbers are allowed (it would be rather
1152difficult to prohibit them), but these may interact with adjacent RTL
1153components in ways that are not easy to predict.</t>
1154
1155<t>Example 11 (allowed but not recommended): <vspace/>Logical
1156representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"<vspace/>Visual
1157representation: "http://ab.123.HGFEDCij/kl/mn/op.html"<vspace/>
1158Components consisting of numbers and left-to-right characters are
1159allowed, but these may interact with adjacent RTL components in ways
1160that are not easy to predict.</t>
1161</section><!-- examples -->
1162</section><!-- bidi -->
1163
1164<section title="Normalization and Comparison" anchor="equivalence">
1165
1166<t><list style="hanging"><t hangText="Note:">The structure and much of
1167  the material for this section is taken from section 6 of <xref
1168  target="RFC3986"></xref>; the differences are due to the specifics
1169  of IRIs.</t></list></t>
1170
1171<t>One of the most common operations on IRIs is simple comparison:
1172Determining whether two IRIs are equivalent, without using the IRIs to
1173access their respective resource(s). A comparison is performed
1174whenever a response cache is accessed, a browser checks its history to
1175color a link, or an XML parser processes tags within a
1176namespace. Extensive normalization prior to comparison of IRIs may be
1177used by spiders and indexing engines to prune a search space or reduce
1178duplication of request actions and response storage.</t>
1179
1180<t>IRI comparison is performed for some particular purpose. Protocols
1181or implementations that compare IRIs for different purposes will often
1182be subject to differing design trade-offs in regards to how much
1183effort should be spent in reducing aliased identifiers. This section
1184describes various methods that may be used to compare IRIs, the
1185trade-offs between them, and the types of applications that might use
1186them.</t>
1187
1188<section title="Equivalence">
1189
1190<t>Because IRIs exist to identify resources, presumably they should be
1191considered equivalent when they identify the same resource. However,
1192this definition of equivalence is not of much practical use, as there
1193is no way for an implementation to compare two resources to determine
1194if they are "the same" unless it has full knowledge or control of
1195them. For this reason, determination of equivalence or difference of
1196IRIs is based on string comparison, perhaps augmented by reference to
1197additional rules provided by URI scheme definitions.  We use the terms
1198"different" and "equivalent" to describe the possible outcomes of such
1199comparisons, but there are many application-dependent versions of
1200equivalence.</t>
1201
1202<t>Even when it is possible to determine that two IRIs are equivalent,
1203IRI comparison is not sufficient to determine whether two IRIs
1204identify different resources. For example, an owner of two different
1205domain names could decide to serve the same resource from both,
1206resulting in two different IRIs. Therefore, comparison methods are
1207designed to minimize false negatives while strictly avoiding false
1208positives.</t>
1209
1210<t>In testing for equivalence, applications should not directly
1211compare relative references; the references should be converted to
1212their respective target IRIs before comparison. When IRIs are compared
1213to select (or avoid) a network action, such as retrieval of a
1214representation, fragment components (if any) should be excluded from
1215the comparison.</t>
1216
1217<t>Applications using IRIs as identity tokens with no relationship to
1218a protocol MUST use the Simple String Comparison (see <xref
1219target="stringcomp"></xref>).  All other applications MUST select one
1220of the comparison practices from the Comparison Ladder (see <xref
1221target="ladder"></xref>.</t>
1222</section> <!-- equivalence -->
1223
1224
1225<section title="Preparation for Comparison">
1226<t>Any kind of IRI comparison REQUIRES that any additional contextual
1227processing is first performed, including undoing higher-level
1228escapings or encodings in the protocol or format that carries an
1229IRI. This preprocessing is usually done when the protocol or format is
1230parsed.</t>
1231
1232<t>Examples of contextual preprocessing steps are described in <xref
1233target="LEIRIHREF"/>. </t>
1234
1235<t>Examples of such escapings or encodings are entities and
1236numeric character references in <xref target="HTML4"></xref> and <xref
1237target="XML1"></xref>. As an example,
1238"http://example.org/ros&amp;eacute;" (in HTML),
1239"http://example.org/ros&amp;#233;" (in HTML or XML), and
1240<vspace/>"http://example.org/ros&amp;#xE9;" (in HTML or XML) are all
1241resolved into what is denoted in this document (see <xref
1242target="sec-Notation"></xref>) as "http://example.org/ros&amp;#xE9;"
1243(the "&amp;#xE9;" here standing for the actual e-acute character, to
1244compensate for the fact that this document cannot contain non-ASCII
1245characters).</t>
1246
1247<t>Similar considerations apply to encodings such as Transfer Codings
1248in HTTP (see <xref target="RFC2616"></xref>) and Content Transfer
1249Encodings in MIME (<xref target="RFC2045"></xref>), although in these
1250cases, the encoding is based not on characters but on octets, and
1251additional care is required to make sure that characters, and not just
1252arbitrary octets, are compared (see <xref
1253target="stringcomp"></xref>).</t>
1254
1255</section> <!-- preparation -->
1256
1257<section title="Comparison Ladder" anchor="ladder">
1258
1259<t>In practice, a variety of methods are used to test IRI
1260equivalence. These methods fall into a range distinguished by the
1261amount of processing required and the degree to which the probability
1262of false negatives is reduced. As noted above, false negatives cannot
1263be eliminated. In practice, their probability can be reduced, but this
1264reduction requires more processing and is not cost-effective for all
1265applications.</t>
1266
1267
1268<t>If this range of comparison practices is considered as a ladder,
1269the following discussion will climb the ladder, starting with
1270practices that are cheap but have a relatively higher chance of
1271producing false negatives, and proceeding to those that have higher
1272computational cost and lower risk of false negatives.</t>
1273
1274<section title="Simple String Comparison" anchor="stringcomp">
1275
1276<t>If two IRIs, when considered as character strings, are identical,
1277then it is safe to conclude that they are equivalent.  This type of
1278equivalence test has very low computational cost and is in wide use in
1279a variety of applications, particularly in the domain of parsing. It
1280is also used when a definitive answer to the question of IRI
1281equivalence is needed that is independent of the scheme used and that
1282can be calculated quickly and without accessing a network. An example
1283of such a case is XML Namespaces (<xref
1284target="XMLNamespace"></xref>).</t>
1285
1286
1287<t>Testing strings for equivalence requires some basic precautions.
1288This procedure is often referred to as "bit-for-bit" or
1289"byte-for-byte" comparison, which is potentially misleading. Testing
1290strings for equality is normally based on pair comparison of the
1291characters that make up the strings, starting from the first and
1292proceeding until both strings are exhausted and all characters are
1293found to be equal, until a pair of characters compares unequal, or
1294until one of the strings is exhausted before the other.</t>
1295
1296<t>This character comparison requires that each pair of characters be
1297put in comparable encoding form. For example, should one IRI be stored
1298in a byte array in UTF-8 encoding form and the second in a UTF-16
1299encoding form, bit-for-bit comparisons applied naively will produce
1300errors. It is better to speak of equality on a character-for-character
1301rather than on a byte-for-byte or bit-for-bit basis.  In practical
1302terms, character-by-character comparisons should be done codepoint by
1303codepoint after conversion to a common character encoding form.
1304
1305When comparing character by character, the comparison function MUST
1306NOT map IRIs to URIs, because such a mapping would create additional
1307spurious equivalences. It follows that an IRI SHOULD NOT be modified
1308when being transported if there is any chance that this IRI might be
1309used in a context that uses Simple String Comparison.</t>
1310
1311
1312<t>False negatives are caused by the production and use of IRI
1313aliases. Unnecessary aliases can be reduced, regardless of the
1314comparison method, by consistently providing IRI references in an
1315already normalized form (i.e., a form identical to what would be
1316produced after normalization is applied, as described below).
1317Protocols and data formats often limit some IRI comparisons to simple
1318string comparison, based on the theory that people and implementations
1319will, in their own best interest, be consistent in providing IRI
1320references, or at least be consistent enough to negate any efficiency
1321that might be obtained from further normalization.</t>
1322</section> <!-- stringcomp -->
1323
1324<section title="Syntax-Based Normalization">
1325
1326<figure><preamble>Implementations may use logic based on the
1327definitions provided by this specification to reduce the probability
1328of false negatives. This processing is moderately higher in cost than
1329character-for-character string comparison. For example, an application
1330using this approach could reasonably consider the following two IRIs
1331equivalent:</preamble>
1332
1333<artwork>
1334   example://a/b/c/%7Bfoo%7D/ros&amp;#xE9;
1335   eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9
1336</artwork></figure>
1337
1338<t>Web user agents, such as browsers, typically apply this type of IRI
1339normalization when determining whether a cached response is
1340available. Syntax-based normalization includes such techniques as case
1341normalization, character normalization, percent-encoding
1342normalization, and removal of dot-segments.</t>
1343
1344<section title="Case Normalization">
1345
1346<t>For all IRIs, the hexadecimal digits within a percent-encoding
1347triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
1348should be normalized to use uppercase letters for the digits A-F.</t>
1349
1350<t>When an IRI uses components of the generic syntax, the component
1351syntax equivalence rules always apply; namely, that the scheme and
1352US-ASCII only host are case insensitive and therefore should be
1353normalized to lowercase. For example, the URI
1354"HTTP://www.EXAMPLE.com/" is equivalent to
1355"http://www.example.com/". Case equivalence for non-ASCII characters
1356in IRI components that are IDNs are discussed in <xref
1357target="schemecomp"></xref>.  The other generic syntax components are
1358assumed to be case sensitive unless specifically defined otherwise by
1359the scheme.</t>
1360
1361<t>Creating schemes that allow case-insensitive syntax components
1362containing non-ASCII characters should be avoided. Case normalization
1363of non-ASCII characters can be culturally dependent and is always a
1364complex operation. The only exception concerns non-ASCII host names
1365for which the character normalization includes a mapping step derived
1366from case folding.</t>
1367
1368</section> <!-- casenorm -->
1369
1370<section title="Character Normalization" anchor="normalization">
1371
1372<t>The Unicode Standard <xref target="UNIV4"></xref> defines various
1373equivalences between sequences of characters for various
1374purposes. Unicode Standard Annex #15 <xref target="UTR15"></xref>
1375defines various Normalization Forms for these equivalences, in
1376particular Normalization Form C (NFC, Canonical Decomposition,
1377followed by Canonical Composition) and Normalization Form KC (NFKC,
1378Compatibility Decomposition, followed by Canonical Composition).</t>
1379
1380<t> IRIs already in Unicode MUST NOT be normalized before parsing or
1381interpreting. In many non-Unicode character encodings, some text
1382cannot be represented directly. For example, the word "Vietnam" is
1383natively written "Vi&amp;#x1EC7;t Nam" (containing a LATIN SMALL
1384LETTER E WITH CIRCUMFLEX AND DOT BELOW) in NFC, but a direct
1385transcoding from the windows-1258 character encoding leads to
1386"Vi&amp;#xEA;&amp;#x323;t Nam" (containing a LATIN SMALL LETTER E WITH
1387CIRCUMFLEX followed by a COMBINING DOT BELOW). Direct transcoding of
1388other 8-bit encodings of Vietnamese may lead to other
1389representations.</t>
1390
1391<t>Equivalence of IRIs MUST rely on the assumption that IRIs are
1392appropriately pre-character-normalized rather than apply character
1393normalization when comparing two IRIs. The exceptions are conversion
1394from a non-digital form, and conversion from a non-UCS-based character
1395encoding to a UCS-based character encoding. In these cases, NFC or a
1396normalizing transcoder using NFC MUST be used for interoperability. To
1397avoid false negatives and problems with transcoding, IRIs SHOULD be
1398created by using NFC. Using NFKC may avoid even more problems; for
1399example, by choosing half-width Latin letters instead of full-width
1400ones, and full-width instead of half-width Katakana.</t>
1401
1402
1403<t>As an example,
1404"http://www.example.org/r&amp;#xE9;sum&amp;#xE9;.html" (in XML
1405Notation) is in NFC. On the other hand,
1406"http://www.example.org/re&amp;#x301;sume&amp;#x301;.html" is not in
1407NFC.</t>
1408
1409<t>The former uses precombined e-acute characters, and the latter uses
1410"e" characters followed by combining acute accents. Both usages are
1411defined as canonically equivalent in <xref target="UNIV4"></xref>.</t>
1412
1413<t><list style="hanging">
1414
1415<t hangText="Note:">
1416Because it is unknown how a particular sequence of characters is being
1417treated with respect to character normalization, it would be
1418inappropriate to allow third parties to normalize an IRI
1419arbitrarily. This does not contradict the recommendation that when a
1420resource is created, its IRI should be as character normalized as
1421possible (i.e., NFC or even NFKC). This is similar to the
1422uppercase/lowercase problems.  Some parts of a URI are case
1423insensitive (for example, the domain name). For others, it is unclear
1424whether they are case sensitive, case insensitive, or something in
1425between (e.g., case sensitive, but with a multiple choice selection if
1426the wrong case is used, instead of a direct negative result).  The
1427best recipe is that the creator use a reasonable capitalization and,
1428when transferring the URI, capitalization never be
1429changed.</t></list></t>
1430
1431<t>Various IRI schemes may allow the usage of Internationalized Domain
1432Names (IDN) <xref target="RFC3490"></xref> either in the ireg-name
1433part or elsewhere. Character Normalization also applies to IDNs, as
1434discussed in <xref target="schemecomp"></xref>.</t>
1435</section> <!-- charnorm -->
1436
1437<section title="Percent-Encoding Normalization">
1438
1439<t>The percent-encoding mechanism (Section 2.1 of <xref
1440target="RFC3986"></xref>) is a frequent source of variance among
1441otherwise identical IRIs. In addition to the case normalization issue
1442noted above, some IRI producers percent-encode octets that do not
1443require percent-encoding, resulting in IRIs that are equivalent to
1444their nonencoded counterparts. These IRIs should be normalized by
1445decoding any percent-encoded octet sequence that corresponds to an
1446unreserved character, as described in section 2.3 of <xref
1447target="RFC3986"></xref>.</t>
1448
1449<t>For actual resolution, differences in percent-encoding (except for
1450the percent-encoding of reserved characters) MUST always result in the
1451same resource.  For example, "http://example.org/~user",
1452"http://example.org/%7euser", and "http://example.org/%7Euser", must
1453resolve to the same resource.</t>
1454
1455<t>If this kind of equivalence is to be tested, the percent-encoding
1456of both IRIs to be compared has to be aligned; for example, by
1457converting both IRIs to URIs (see Section 3.1), eliminating escape
1458differences in the resulting URIs, and making sure that the case of
1459the hexadecimal characters in the percent-encoding is always the same
1460(preferably upper case). If the IRI is to be passed to another
1461application or used further in some other way, its original form MUST
1462be preserved.  The conversion described here should be performed only
1463for local comparison.</t>
1464
1465</section> <!-- pctnorm -->
1466
1467<section title="Path Segment Normalization">
1468
1469<t>The complete path segments "." and ".." are intended only for use
1470within relative references (Section 4.1 of <xref
1471target="RFC3986"></xref>) and are removed as part of the reference
1472resolution process (Section 5.2 of <xref target="RFC3986"></xref>).
1473However, some implementations may incorrectly assume that reference
1474resolution is not necessary when the reference is already an IRI, and
1475thus fail to remove dot-segments when they occur in non-relative
1476paths.  IRI normalizers should remove dot-segments by applying the
1477remove_dot_segments algorithm to the path, as described in Section
14785.2.4 of <xref target="RFC3986"></xref>.</t>
1479
1480</section> <!-- pathnorm -->
1481</section> <!-- ladder -->
1482
1483<section title="Scheme-Based Normalization" anchor="schemecomp">
1484
1485<t>The syntax and semantics of IRIs vary from scheme to scheme, as
1486described by the defining specification for each
1487scheme. Implementations may use scheme-specific rules, at further
1488processing cost, to reduce the probability of false negatives. For
1489example, because the "http" scheme makes use of an authority
1490component, has a default port of "80", and defines an empty path to be
1491equivalent to "/", the following four IRIs are equivalent:</t>
1492
1493<figure><artwork>
1494   http://example.com
1495   http://example.com/
1496   http://example.com:/
1497   http://example.com:80/</artwork></figure>
1498
1499<t>In general, an IRI that uses the generic syntax for authority with
1500an empty path should be normalized to a path of "/". Likewise, an
1501explicit ":port", for which the port is empty or the default for the
1502scheme, is equivalent to one where the port and its ":" delimiter are
1503elided and thus should be removed by scheme-based normalization. For
1504example, the second IRI above is the normal form for the "http"
1505scheme.</t>
1506
1507<t>Another case where normalization varies by scheme is in the
1508handling of an empty authority component or empty host
1509subcomponent. For many scheme specifications, an empty authority or
1510host is considered an error; for others, it is considered equivalent
1511to "localhost" or the end-user's host. When a scheme defines a default
1512for authority and an IRI reference to that default is desired, the
1513reference should be normalized to an empty authority for the sake of
1514uniformity, brevity, and internationalization. If, however, either the
1515userinfo or port subcomponents are non-empty, then the host should be
1516given explicitly even if it matches the default.</t>
1517
1518<t>Normalization should not remove delimiters when their associated
1519component is empty unless it is licensed to do so by the scheme
1520specification. For example, the IRI "http://example.com/?" cannot be
1521assumed to be equivalent to any of the examples above. Likewise, the
1522presence or absence of delimiters within a userinfo subcomponent is
1523usually significant to its interpretation.  The fragment component is
1524not subject to any scheme-based normalization; thus, two IRIs that
1525differ only by the suffix "#" are considered different regardless of
1526the scheme.</t>
1527 
1528<t>Some IRI schemes allow the usage of Internationalized Domain
1529Names (IDN) <xref target='RFC5890'></xref> either in their ireg-name
1530part or elswhere. When in use in IRIs, those names SHOULD
1531conform to the definition of U-Label in <xref
1532target='RFC5890'></xref>. An IRI containing an invalid IDN cannot
1533successfully be resolved. For legibility purposes, they
1534SHOULD NOT be converted into ASCII Compatible Encoding (ACE).</t>
1535
1536<t>Scheme-based normalization may also consider IDN
1537components and their conversions to punycode as equivalent. As an
1538example, "http://r&amp;#xE9;sum&amp;#xE9;.example.org" may be
1539considered equivalent to
1540"http://xn--rsum-bpad.example.org".</t><t>Other scheme-specific
1541normalizations are possible.</t>
1542
1543</section> <!-- schemenorm -->
1544
1545<section title="Protocol-Based Normalization">
1546
1547<t>Substantial effort to reduce the incidence of false negatives is
1548often cost-effective for web spiders. Consequently, they implement
1549even more aggressive techniques in IRI comparison. For example, if
1550they observe that an IRI such as</t>
1551
1552<figure><artwork>
1553   http://example.com/data</artwork></figure>
1554<t>redirects to an IRI differing only in the trailing slash</t>
1555<figure><artwork>
1556   http://example.com/data/</artwork></figure>
1557
1558<t>they will likely regard the two as equivalent in the future.  This
1559kind of technique is only appropriate when equivalence is clearly
1560indicated by both the result of accessing the resources and the common
1561conventions of their scheme's dereference algorithm (in this case, use
1562of redirection by HTTP origin servers to avoid problems with relative
1563references).</t>
1564
1565</section> <!-- protonorm -->
1566</section> <!-- equivalence -->
1567</section> 
1568
1569<section title="Use of IRIs" anchor="IRIuse">
1570
1571<section title="Limitations on UCS Characters Allowed in IRIs" anchor="limitations">
1572
1573<t>This section discusses limitations on characters and character
1574sequences usable for IRIs beyond those given in <xref target="abnf"/>
1575and <xref target="visual"/>. The considerations in this section are
1576relevant when IRIs are created and when URIs are converted to
1577IRIs.</t>
1578
1579<t>
1580
1581<list style="hanging"><t hangText="a.">The repertoire of characters allowed
1582    in each IRI component is limited by the definition of that component.
1583    For example, the definition of the scheme component does not allow
1584    characters beyond US-ASCII.
1585    <vspace blankLines="1"/>
1586    (Note: In accordance with URI practice, generic IRI
1587    software cannot and should not check for such limitations.)</t>
1588
1589<t hangText="b.">The UCS contains many areas of characters for which
1590    there are strong visual look-alikes. Because of the likelihood of
1591    transcription errors, these also should be avoided. This includes
1592    the full-width equivalents of Latin characters, half-width
1593    Katakana characters for Japanese, and many others. It also
1594    includes many look-alikes of "space", "delims", and "unwise",
1595    characters excluded in <xref target="RFC3491"/>.</t>
1596   
1597</list>
1598</t>
1599
1600<t>Additional information is available from <xref target="UNIXML"/>.
1601    <xref target="UNIXML"/> is written in the context of running text
1602    rather than in that of identifiers. Nevertheless, it discusses
1603    many of the categories of characters not appropriate for IRIs.</t>
1604</section> <!-- limitations -->
1605
1606<section title="Software Interfaces and Protocols">
1607
1608<t>Although an IRI is defined as a sequence of characters, software
1609interfaces for URIs typically function on sequences of octets or other
1610kinds of code units. Thus, software interfaces and protocols MUST
1611define which character encoding is used.</t>
1612
1613<t>Intermediate software interfaces between IRI-capable components and
1614URI-only components MUST map the IRIs per <xref target="mapping"/>,
1615when transferring from IRI-capable to URI-only components.
1616
1617This mapping SHOULD be applied as late as possible. It SHOULD NOT be
1618applied between components that are known to be able to handle IRIs.</t>
1619</section> <!-- software -->
1620
1621<section title="Format of URIs and IRIs in Documents and Protocols">
1622
1623<t>Document formats that transport URIs may have to be upgraded to allow
1624the transport of IRIs. In cases where the document as a whole
1625has a native character encoding, IRIs MUST also be encoded in this
1626character encoding and converted accordingly by a parser or interpreter.
1627
1628IRI characters not expressible in the native character encoding SHOULD
1629be escaped by using the escaping conventions of the document format if
1630such conventions are available. Alternatively, they MAY be
1631percent-encoded according to <xref target="mapping"/>. For example, in
1632HTML or XML, numeric character references SHOULD be used. If a
1633document as a whole has a native character encoding and that character
1634encoding is not UTF-8, then IRIs MUST NOT be placed into the document
1635in the UTF-8 character encoding.</t>
1636
1637<t>((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
1638although they use different terminology. HTML 4.0 <xref
1639target="HTML4"/> defines the conversion from IRIs to URIs as
1640error-avoiding behavior. XML 1.0 <xref target="XML1"/>, XLink <xref
1641target="XLink"/>, XML Schema <xref target="XMLSchema"/>, and
1642specifications based upon them allow IRIs. Also, it is expected that
1643all relevant new W3C formats and protocols will be required to handle
1644IRIs <xref target="CharMod"/>.</t>
1645
1646</section> <!-- format -->
1647
1648<section title="Use of UTF-8 for Encoding Original Characters" anchor="UTF8use">
1649
1650<t>This section discusses details and gives examples for point c) in
1651<xref target="Applicability"/>. To be able to use IRIs, the URI
1652corresponding to the IRI in question has to encode original characters
1653into octets by using UTF-8.  This can be specified for all URIs of a
1654URI scheme or can apply to individual URIs for schemes that do not
1655specify how to encode original characters.  It can apply to the whole
1656URI, or only to some part. For background information on encoding
1657characters into URIs, see also Section 2.5 of <xref
1658target="RFC3986"/>.</t>
1659
1660<t>For new URI schemes, using UTF-8 is recommended in <xref
1661target="RFC4395"/>.  Examples where UTF-8 is already used are the URN
1662syntax <xref target="RFC2141"/>, IMAP URLs <xref target="RFC2192"/>,
1663and POP URLs <xref target="RFC2384"/>.  On the other hand, because the
1664HTTP URI scheme does not specify how to encode original characters,
1665only some HTTP URLs can have corresponding but different IRIs.</t>
1666
1667<t>For example, for a document with a URI
1668of<vspace/>"http://www.example.org/r%C3%A9sum%C3%A9.html", it is
1669possible to construct a corresponding IRI (in XML notation, see <xref
1670target="sec-Notation"/>):
1671"http://www.example.org/r&amp;#xE9;sum&amp;#xE9;.html" ("&amp;#xE9;"
1672stands for the e-acute character, and "%C3%A9" is the UTF-8 encoded
1673and percent-encoded representation of that character). On the other
1674hand, for a document with a URI of
1675"http://www.example.org/r%E9sum%E9.html", the percent-encoding octets
1676cannot be converted to actual characters in an IRI, as the
1677percent-encoding is not based on UTF-8.</t>
1678
1679<t>For most URI schemes, there is no need to upgrade their scheme
1680definition in order for them to work with IRIs.  The main case where
1681upgrading makes sense is when a scheme definition, or a particular
1682component of a scheme, is strictly limited to the use of US-ASCII
1683characters with no provision to include non-ASCII characters/octets
1684via percent-encoding, or if a scheme definition currently uses highly
1685scheme-specific provisions for the encoding of non-ASCII characters.
1686An example of this is the mailto: scheme <xref target="RFC2368"/>.</t>
1687
1688<t>This specification updates the IANA registry of URI schemes to note
1689their applicability to IRIs, see <xref target="iana"/>.  All IRIs use
1690URI schemes, and all URIs with URI schemes can be used as IRIs, even
1691though in some cases only by using URIs directly as IRIs, without any
1692conversion.</t>
1693
1694<t>Scheme definitions can impose restrictions on the syntax of
1695scheme-specific URIs; i.e., URIs that are admissible under the generic
1696URI syntax <xref target="RFC3986"/> may not be admissible due to
1697narrower syntactic constraints imposed by a URI scheme
1698specification. URI scheme definitions cannot broaden the syntactic
1699restrictions of the generic URI syntax; otherwise, it would be
1700possible to generate URIs that satisfied the scheme-specific syntactic
1701constraints without satisfying the syntactic constraints of the
1702generic URI syntax. However, additional syntactic constraints imposed
1703by URI scheme specifications are applicable to IRI, as the
1704corresponding URI resulting from the mapping defined in <xref
1705target="mapping"/> MUST be a valid URI under the syntactic
1706restrictions of generic URI syntax and any narrower restrictions
1707imposed by the corresponding URI scheme specification.</t>
1708
1709<t>The requirement for the use of UTF-8 generally applies to all parts
1710of a URI.  However, it is possible that the capability of IRIs to
1711represent a wide range of characters directly is used just in some
1712parts of the IRI (or IRI reference). The other parts of the IRI may
1713only contain US-ASCII characters, or they may not be based on
1714UTF-8. They may be based on another character encoding, or they may
1715directly encode raw binary data (see also <xref
1716target="RFC2397"/>). </t>
1717
1718<t>For example, it is possible to have a URI reference
1719of<vspace/>"http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9",
1720where the document name is encoded in iso-8859-1 based on server
1721settings, but where the fragment identifier is encoded in UTF-8 according
1722to <xref target="XPointer"/>. The IRI corresponding to the above
1723URI would be (in XML notation)<vspace/>"http://www.example.org/r%E9sum%E9.xml#r&amp;#xE9;sum&amp;#xE9;".</t>
1724
1725<t>Similar considerations apply to query parts. The functionality
1726of IRIs (namely, to be able to include non-ASCII characters) can
1727only be used if the query part is encoded in UTF-8.</t>
1728
1729</section> <!-- utf8 -->
1730
1731<section title="Relative IRI References">
1732<t>Processing of relative IRI references against a base is handled
1733straightforwardly; the algorithms of <xref target="RFC3986"/> can
1734be applied directly, treating the characters additionally allowed
1735in IRI references in the same way that unreserved characters are in URI
1736references.</t>
1737
1738</section> <!-- relative -->
1739</section> <!-- IRIuse -->
1740
1741<section title="Liberal handling of otherwise invalid IRIs" anchor="LEIRIHREF">
1742
1743<t>(EDITOR NOTE: This Section may move to an appendix.)
1744 
1745Some technical specifications and widely-deployed software have
1746allowed additional variations and extensions of IRIs to be used in
1747syntactic components. This section describes two widely-used
1748preprocessing agreements. Other technical specifications may wish to
1749reference a syntactic component which is "a valid IRI or a string that
1750will map to a valid IRI after this preprocessing algorithm". These two
1751variants are known as <xref target="LEIRI">Legacy Extended IRI or
1752LEIRI</xref>, and <xref target="HTML5">Web Address</xref>).
1753</t>
1754
1755<t>Future technical specifications SHOULD NOT allow conforming
1756producers to produce, or conforming content to contain, such forms,
1757as they are not interoperable with other IRI consuming software.</t>
1758
1759<section title="LEIRI processing"  anchor="LEIRIspec">
1760  <t>This section defines Legacy Extended IRIs (LEIRIs).
1761    The syntax of Legacy Extended IRIs is the same as that for IRIs,
1762    except that the ucschar production is replaced by the leiri-ucschar production:</t>
1763<figure>
1764
1765<artwork>
1766  leiri-ucschar  = " " / "&lt;" / "&gt;" / '"' / "{" / "}" / "|"
1767                   / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1768                   / %xE000-FFFD / %x10000-10FFFF
1769</artwork>
1770
1771<postamble>
1772  Among other extensions, processors based on this specification also
1773  did not enforce the restriction on bidirectional formatting
1774  characters in <xref target="visual"></xref>, and the iprivate
1775  production becomes redundant.</postamble>
1776</figure>
1777
1778<t>To convert a string allowed as a LEIRI to an IRI, each character
1779allowed in leiri-ucschar but not in ucschar must be percent-encoded
1780using <xref target="compmapping"/>.</t>
1781</section> <!-- leiriproc -->
1782
1783<section title="Web Address processing" anchor="webaddress">
1784
1785<t>Many popular web browsers have taken the approach of being quite
1786liberal in what is accepted as a "URL" or its relative
1787forms. This section describes their behavior in terms of a preprocessor
1788which maps strings into the IRI space for subsequent parsing and
1789interpretation as an IRI.</t>
1790
1791<t>In some situations, it might be appropriate to describe the syntax
1792that a liberal consumer implementation might accept as a "Web
1793Address" or "Hypertext Reference" or "HREF". However,
1794technical specifications SHOULD restrict the syntactic form allowed by compliant producers
1795to the IRI or IRI reference syntax defined in this document
1796even if they want to mandate this processing.</t>
1797
1798<t>
1799Summary:
1800<list style="symbols">
1801   <t>Leading and trailing whitespace is removed.</t>
1802   <t>Some additional characters are removed.</t>
1803   <t>Some additional characters are allowed and escaped (as with LEIRI).</t>
1804   <t>If interpreting an IRI as a URI, the pct-encoding of the query
1805   component of the parsed URI component depends on operational
1806   context.</t>
1807</list>
1808</t>
1809
1810<t>Each string provided may have an associated charset (called
1811the HREF-charset here); this defaults to UTF-8.
1812For web browsers interpreting HTML, the document
1813charset of a string is determined:
1814
1815<list style="hanging">
1816<t hangText="If the string came from a script (e.g. as an argument to
1817 a method)">The HRef-charset is the script's charset.</t>
1818
1819<t hangText="If the string came from a DOM node (e.g. from an
1820  element)">The node has a Document, and the HRef-charset is the
1821  Document's character encoding.</t>
1822
1823<t hangText="If the string had a HRef-charset defined when the string was
1824created or defined">The HRef-charset is as defined.</t>
1825
1826</list></t>
1827
1828<t>If the resulting HRef-charset is a unicode based character encoding
1829(e.g., UTF-16), then use UTF-8 instead.</t>
1830
1831
1832<figure>
1833<preamble>The syntax for Web Addresses is obtained by replacing the 'ucschar',
1834  pct-form, and path-sep rules with the href-ucschar, href-pct-form, and href-path-sep
1835  rules below. In addition, some characters are stripped.</preamble>
1836
1837<artwork type='abnf'>
1838  href-ucschar  = " " / "&lt;" / "&gt;" / DQUOTE / "{" / "}" / "|"
1839                   / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1840                   / %xE000-FFFD / %x10000-10FFFF
1841  href-pct-form = pct-encoded / "%"
1842  href-path-sep = "/" / "\"
1843  href-strip    = &lt;to be done&gt;
1844</artwork>
1845
1846<postamble>
1847(NOTE: NEED TO FIX THESE SETS TO MATCH HTML5; NOT SURE ABOUT NEXT SENTENCE)
1848browsers did not enforce the restriction on bidirectional formatting
1849  characters in <xref target="visual"></xref>, and the iprivate
1850  production becomes redundant.</postamble>
1851</figure>
1852
1853<t>'Web Address processing' requires the following additional
1854preprocessing steps:
1855
1856<list style="numbers">
1857
1858<t>Leading and trailing instances of space (U+0020),
1859CR (U+000A), LF (U+000D), and TAB (U+0009) characters are removed.</t>
1860
1861<t>strip all characters in href-strip.</t>
1862  <t>Percent-encode all characters in href-ucschar not in ucschar.</t>
1863  <t>Replace occurrences of "%" not followed by two hexadecimal digits by "%25".</t>
1864  <t>Convert backslashes ('\') matching href-path-sep to forward slashes ('/').</t>
1865</list></t>
1866</section> <!-- webaddress -->
1867
1868<section title="Characters not allowed in IRIs" anchor="notAllowed">
1869
1870<t>This section provides a list of the groups of characters and code
1871points that are allowed by LEIRI or HREF but are not allowed in IRIs or are
1872allowed in IRIs only in the query part. For each group of characters,
1873advice on the usage of these characters is also given, concentrating
1874on the reasons for why they are excluded from IRI use.</t>
1875
1876<t>
1877
1878<list><t>Space (U+0020): Some formats and applications use space as a
1879delimiter, e.g. for items in a list. Appendix C of <xref
1880target="RFC3986"></xref> also mentions that white space may have to be
1881added when displaying or printing long URIs; the same applies to long
1882IRIs. This means that spaces can disappear, or can make the what is
1883intended as a single IRI or IRI reference to be treated as two or more
1884separate IRIs.</t>
1885
1886<t>Delimiters "&lt;" (U+003C), "&gt;" (U+003E), and '"' (U+0022):
1887Appendix C of <xref target="RFC3986"></xref> suggests the use of
1888double-quotes ("http://example.com/") and angle brackets
1889(&lt;http://example.com/&gt;) as delimiters for URIs in plain
1890text. These conventions are often used, and also apply to IRIs.  Using
1891these characters in strings intended to be IRIs would result in the
1892IRIs being cut off at the wrong place.</t>
1893
1894<t>Unwise characters "\" (U+005C), "^" (U+005E), "`"
1895(U+0060), "{" (U+007B), "|" (U+007C), and "}" (U+007D): These
1896characters originally have been excluded from URIs because the
1897respective codepoints are assigned to different graphic characters in
1898some 7-bit or 8-bit encoding. Despite the move to Unicode, some of
1899these characters are still occasionally displayed differently on some
1900systems, e.g. U+005C may appear as a Japanese Yen symbol on some
1901systems. Also, the fact that these characters are not used in URIs or
1902IRIs has encouraged their use outside URIs or IRIs in contexts that
1903may include URIs or IRIs. If a string with such a character were used
1904as an IRI in such a context, it would likely be interpreted
1905piecemeal.</t>
1906
1907<t>The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
1908#x9F): There is generally no way to transmit these characters reliably
1909as text outside of a charset encoding.  Even when in encoded form,
1910many software components silently filter out some of these characters,
1911or may stop processing alltogether when encountering some of
1912them. These characters may affect text display in subtle, unnoticable
1913ways or in drastic, global, and irreversible ways depending on the
1914hardware and software involved. The use of some of these characters
1915would allow malicious users to manipulate the display of an IRI and
1916its context in many situations.</t>
1917
1918<t>Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
1919characters affect the display ordering of characters. If IRIs were
1920allowed to contain these characters and the resulting visual display
1921transcribed. they could not be converted back to electronic form
1922(logical order) unambiguously. These characters, if allowed in IRIs,
1923might allow malicious users to manipulate the display of IRI and its
1924context.</t>
1925
1926<t>Specials (U+FFF0-FFFD): These code points provide functionality
1927beyond that useful in an IRI, for example byte order identification,
1928annotation, and replacements for unknown characters and objects. Their
1929use and interpretation in an IRI would serve no purpose and might lead
1930to confusing display variations.</t>
1931
1932<t>Private use code points (U+E000-F8FF, U+F0000-FFFFD,
1933U+100000-10FFFD): Display and interpretation of these code points is
1934by definition undefined without private agreement. Therefore, these
1935code points are not suited for use on the Internet. They are not
1936interoperable and may have unpredictable effects.</t>
1937
1938<t>Tags (U+E0000-E0FFF): These characters provide a way to language
1939tag in Unicode plain text. They are not appropriate for IRIs because
1940language information in identifiers cannot reliably be input,
1941transmitted (e.g. on a visual medium such as paper), or
1942recognized.</t>
1943
1944<t>Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
1945U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
1946U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
1947U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
1948U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
1949non-characters. Applications may use some of them internally, but are
1950not prepared to interchange them.</t>
1951
1952</list></t>
1953
1954<t>LEIRI preprocessing disallowed some code points and
1955code units:
1956
1957<list><t>Surrogate code units (D800-DFFF): These do not represent
1958Unicode codepoints.</t></list></t>
1959</section> <!-- notallowed -->
1960</section> <!-- lieirihref -->
1961 
1962<section title="URI/IRI Processing Guidelines (Informative)" anchor="guidelines">
1963
1964<t>This informative section provides guidelines for supporting IRIs in
1965the same software components and operations that currently process
1966URIs: Software interfaces that handle URIs, software that allows users
1967to enter URIs, software that creates or generates URIs, software that
1968displays URIs, formats and protocols that transport URIs, and software
1969that interprets URIs. These may all require modification before
1970functioning properly with IRIs. The considerations in this section
1971also apply to URI references and IRI references.</t>
1972
1973<section title="URI/IRI Software Interfaces">
1974<t>Software interfaces that handle URIs, such as URI-handling APIs and
1975protocols transferring URIs, need interfaces and protocol elements
1976that are designed to carry IRIs.</t>
1977
1978<t>In case the current handling in an API or protocol is based on
1979US-ASCII, UTF-8 is recommended as the character encoding for IRIs, as
1980it is compatible with US-ASCII, is in accordance with the
1981recommendations of <xref target="RFC2277"/>, and makes converting to
1982URIs easy. In any case, the API or protocol definition must clearly
1983define the character encoding to be used.</t>
1984
1985<t>The transfer from URI-only to IRI-capable components requires no
1986mapping, although the conversion described in <xref
1987target="URItoIRI"/> above may be performed. It is preferable not to
1988perform this inverse conversion unless it is certain this can be done
1989correctly.</t>
1990</section>
1991
1992<section title="URI/IRI Entry">
1993
1994<t>Some components allow users to enter URIs into the system
1995by typing or dictation, for example. This software must be updated to allow
1996for IRI entry.</t>
1997
1998<t>A person viewing a visual representation of an IRI (as a sequence
1999of glyphs, in some order, in some visual display) or hearing an IRI
2000will use an entry method for characters in the user's language to
2001input the IRI. Depending on the script and the input method used, this
2002may be a more or less complicated process.</t>
2003
2004<t>The process of IRI entry must ensure, as much as possible, that the
2005restrictions defined in <xref target="abnf"/> are met. This may be
2006done by choosing appropriate input methods or variants/settings
2007thereof, by appropriately converting the characters being input, by
2008eliminating characters that cannot be converted, and/or by issuing a
2009warning or error message to the user.</t>
2010
2011<t>As an example of variant settings, input method editors for East
2012Asian Languages usually allow the input of Latin letters and related
2013characters in full-width or half-width versions. For IRI input, the
2014input method editor should be set so that it produces half-width Latin
2015letters and punctuation and full-width Katakana.</t>
2016
2017<t>An input field primarily or solely used for the input of URIs/IRIs
2018might allow the user to view an IRI as it is mapped to a URI.  Places
2019where the input of IRIs is frequent may provide the possibility for
2020viewing an IRI as mapped to a URI. This will help users when some of
2021the software they use does not yet accept IRIs.</t>
2022
2023<t>An IRI input component interfacing to components that handle URIs,
2024but not IRIs, must map the IRI to a URI before passing it to these
2025components.</t>
2026
2027<t>For the input of IRIs with right-to-left characters, please see
2028<xref target="bidiInput"></xref>.</t>
2029</section>
2030
2031<section title="URI/IRI Transfer between Applications">
2032
2033<t>Many applications (for example, mail user agents) try to detect
2034URIs appearing in plain text. For this, they use some heuristics based
2035on URI syntax. They then allow the user to click on such URIs and
2036retrieve the corresponding resource in an appropriate (usually
2037scheme-dependent) application.</t>
2038
2039<t>Such applications would need to be upgraded, in order to use the
2040IRI syntax as a base for heuristics. In particular, a non-ASCII
2041character should not be taken as the indication of the end of an IRI.
2042Such applications also would need to make sure that they correctly
2043convert the detected IRI from the character encoding of the document
2044or application where the IRI appears, to the character encoding used
2045by the system-wide IRI invocation mechanism, or to a URI (according to
2046<xref target="mapping"/>) if the system-wide invocation mechanism only
2047accepts URIs.</t>
2048
2049<t>The clipboard is another frequently used way to transfer URIs and
2050IRIs from one application to another. On most platforms, the clipboard
2051is able to store and transfer text in many languages and scripts.
2052Correctly used, the clipboard transfers characters, not octets, which
2053will do the right thing with IRIs.</t>
2054</section>
2055
2056<section title="URI/IRI Generation">
2057
2058<t>Systems that offer resources through the Internet, where those
2059resources have logical names, sometimes automatically generate URIs
2060for the resources they offer. For example, some HTTP servers can
2061generate a directory listing for a file directory and then respond to
2062the generated URIs with the files.</t>
2063
2064<t>Many legacy character encodings are in use in various file systems.
2065Many currently deployed systems do not transform the local character
2066representation of the underlying system before generating URIs.</t>
2067
2068<t>For maximum interoperability, systems that generate resource
2069identifiers should make the appropriate transformations. For example,
2070if a file system contains a file named
2071"r&amp;#xE9;sum&amp;#xE9;.html", a server should expose this as
2072"r%C3%A9sum%C3%A9.html" in a URI, which allows use of
2073"r&amp;#xE9;sum&amp;#xE9;.html" in an IRI, even if locally the file
2074name is kept in a character encoding other than UTF-8.
2075</t>
2076
2077<t>This recommendation particularly applies to HTTP servers. For FTP
2078servers, similar considerations apply; see <xref target="RFC2640"/>.</t>
2079</section>
2080
2081<section title="URI/IRI Selection" anchor="selection">
2082<t>In some cases, resource owners and publishers have control over the
2083IRIs used to identify their resources. This control is mostly
2084executed by controlling the resource names, such as file names,
2085directly.</t>
2086
2087<t>In these cases, it is recommended to avoid choosing IRIs that are
2088easily confused. For example, for US-ASCII, the lower-case ell ("l") is
2089easily confused with the digit one ("1"), and the upper-case oh ("O") is
2090easily confused with the digit zero ("0"). Publishers should avoid
2091confusing users with "br0ken" or "1ame" identifiers.</t>
2092
2093<t>Outside the US-ASCII repertoire, there are many more opportunities for
2094confusion; a complete set of guidelines is too lengthy to include
2095here. As long as names are limited to characters from a single script,
2096native writers of a given script or language will know best when
2097ambiguities can appear, and how they can be avoided. What may look
2098ambiguous to a stranger may be completely obvious to the average
2099native user. On the other hand, in some cases, the UCS contains
2100variants for compatibility reasons; for example, for typographic purposes.
2101These should be avoided wherever possible. Although there may be exceptions,
2102newly created resource names should generally be in NFKC
2103<xref target="UTR15"></xref> (which means that they are also in NFC).</t>
2104
2105<t>As an example, the UCS contains the "fi" ligature at U+FB01
2106for compatibility reasons.
2107Wherever possible, IRIs should use the two letters "f" and "i" rather
2108than the "fi" ligature. An example where the latter may be used is
2109in the query part of an IRI for an explicit search for a word written
2110containing the "fi" ligature.</t>
2111
2112<t>In certain cases, there is a chance that characters from different
2113scripts look the same. The best known example is the similarity of the
2114Latin "A", the Greek "Alpha", and the Cyrillic "A". To avoid such
2115cases, IRIs should only be created where all the characters in a
2116single component are used together in a given language. This usually
2117means that all of these characters will be from the same script, but
2118there are languages that mix characters from different scripts (such
2119as Japanese).  This is similar to the heuristics used to distinguish
2120between letters and numbers in the examples above. Also, for Latin,
2121Greek, and Cyrillic, using lowercase letters results in fewer
2122ambiguities than using uppercase letters would.</t>
2123</section>
2124
2125<section title="Display of URIs/IRIs" anchor="display">
2126<t>
2127In situations where the rendering software is not expected to display
2128non-ASCII parts of the IRI correctly using the available layout and font
2129resources, these parts should be percent-encoded before being displayed.</t>
2130
2131<t>For display of Bidi IRIs, please see <xref target="visual"/>.</t>
2132</section>
2133
2134<section title="Interpretation of URIs and IRIs">
2135<t>Software that interprets IRIs as the names of local resources should
2136accept IRIs in multiple forms and convert and match them with the
2137appropriate local resource names.</t>
2138
2139<t>First, multiple representations include both IRIs in the native
2140character encoding of the protocol and also their URI counterparts.</t>
2141
2142<t>Second, it may include URIs constructed based on character
2143encodings other than UTF-8. These URIs may be produced by user agents that do
2144not conform to this specification and that use legacy character encodings to
2145convert non-ASCII characters to URIs. Whether this is necessary, and what
2146character encodings to cover, depends on a number of factors, such as
2147the legacy character encodings used locally and the distribution of
2148various versions of user agents. For example, software for Japanese
2149may accept URIs in Shift_JIS and/or EUC-JP in addition to UTF-8.</t>
2150
2151<t>Third, it may include additional mappings to be more user-friendly
2152and robust against transmission errors. These would be similar to how
2153some servers currently treat URIs as case insensitive or perform
2154additional matching to account for spelling errors. For characters
2155beyond the US-ASCII repertoire, this may, for example, include
2156ignoring the accents on received IRIs or resource names. Please note
2157that such mappings, including case mappings, are language
2158dependent.</t>
2159
2160<t>It can be difficult to identify a resource unambiguously if too
2161many mappings are taken into consideration. However, percent-encoded
2162and not percent-encoded parts of IRIs can always be clearly distinguished.
2163Also, the regularity of UTF-8 (see <xref target="Duerst97"/>) makes the
2164potential for collisions lower than it may seem at first.</t>
2165</section>
2166
2167<section title="Upgrading Strategy">
2168<t>Where this recommendation places further constraints on software
2169for which many instances are already deployed, it is important to
2170introduce upgrades carefully and to be aware of the various
2171interdependencies.</t>
2172
2173<t>If IRIs cannot be interpreted correctly, they should not be created,
2174generated, or transported. This suggests that upgrading URI interpreting
2175software to accept IRIs should have highest priority.</t>
2176
2177<t>On the other hand, a single IRI is interpreted only by a single or
2178very few interpreters that are known in advance, although it may be
2179entered and transported very widely.</t>
2180
2181<t>Therefore, IRIs benefit most from a broad upgrade of software to be
2182able to enter and transport IRIs. However, before an
2183individual IRI is published, care should be taken to upgrade the corresponding
2184interpreting software in order to cover the forms expected to be
2185received by various versions of entry and transport software.</t>
2186
2187<t>The upgrade of generating software to generate IRIs instead of using a
2188local character encoding should happen only after the service is upgraded
2189to accept IRIs. Similarly, IRIs should only be generated when the service
2190accepts IRIs and the intervening infrastructure and protocol is known
2191to transport them safely.</t>
2192
2193<t>Software converting from URIs to IRIs for display should be upgraded
2194only after upgraded entry software has been widely deployed to the
2195population that will see the displayed result.</t>
2196
2197
2198<t>Where there is a free choice of character encodings, it is often
2199possible to reduce the effort and dependencies for upgrading to IRIs
2200by using UTF-8 rather than another encoding. For example, when a new
2201file-based Web server is set up, using UTF-8 as the character encoding
2202for file names will make the transition to IRIs easier. Likewise, when
2203a new Web form is set up using UTF-8 as the character encoding of the
2204form page, the returned query URIs will use UTF-8 as the character
2205encoding (unless the user, for whatever reason, changes the character
2206encoding) and will therefore be compatible with IRIs.</t>
2207
2208
2209<t>These recommendations, when taken together, will allow for the
2210extension from URIs to IRIs in order to handle characters other than
2211US-ASCII while minimizing interoperability problems. For
2212considerations regarding the upgrade of URI scheme definitions, see
2213<xref target="UTF8use"/>.</t>
2214
2215</section>
2216</section> <!-- guidelines -->
2217
2218<section title="IANA Considerations" anchor="iana">
2219
2220<t>RFC Editor and IANA note: Please Replace RFC XXXX with the
2221number of this document when it issues as an RFC. </t>
2222
2223<t>IANA maintains a registry of "URI schemes". A "URI scheme" also
2224serves an "IRI scheme". </t>
2225
2226<t>To clarify that the URI scheme registration process also applies to
2227IRIs, change the description of the "URI schemes" registry
2228header to say "[RFC4395] defines an IANA-maintained registry of URI
2229Schemes. These registries include the Permanent and Provisional URI
2230Schemes.  RFC XXXX updates this registry to designate that schemes may
2231also indicate their usability as IRI schemes.</t>
2232
2233<t> Update "per RFC 4395" to "per RFC 4395 and RFC XXXX".
2234</t>
2235
2236</section> <!-- IANA -->
2237   
2238<section title="Security Considerations" anchor="security">
2239<t>The security considerations discussed in <xref target="RFC3986"/>
2240also apply to IRIs. In addition, the following issues require
2241particular care for IRIs.</t>
2242<t>Incorrect encoding or decoding can lead to security problems.
2243In particular, some UTF-8 decoders do not check against overlong
2244byte sequences. As an example, a "/" is encoded with the byte 0x2F
2245both in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly
2246interpret the sequence 0xC0 0xAF as a "/". A sequence such as "%C0%AF.."
2247may pass some security tests and then be interpreted
2248as "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion
2249and checking are not done in the right order, and/or if reserved
2250characters and unreserved characters are not clearly distinguished.</t>
2251
2252<t>There are various ways in which "spoofing" can occur with IRIs.
2253"Spoofing" means that somebody may add a resource name that looks the
2254same or similar to the user, but that points to a different resource.
2255The added resource may pretend to be the real resource by looking
2256very similar but may contain all kinds of changes that may be
2257difficult to spot and that can cause all kinds of problems.
2258Most spoofing possibilities for IRIs are extensions of those for URIs.</t>
2259
2260<t>Spoofing can occur for various reasons. First, a user's normalization expectations or actual normalization
2261when entering an IRI or  transcoding an IRI from a legacy character
2262encoding do not match the normalization used on the
2263server side. Conceptually, this is no different from the problems
2264surrounding the use of case-insensitive web servers. For example,
2265a popular web page with a mixed-case name ("http://big.example.com/PopularPage.html")
2266might be "spoofed" by someone who is able to create "http://big.example.com/popularpage.html".
2267However, the use of unnormalized character sequences, and of additional
2268mappings for user convenience, may increase the chance for spoofing.
2269Protocols and servers that allow the creation of resources with
2270names that are not normalized are particularly vulnerable to such
2271attacks. This is an inherent
2272security problem of the relevant protocol, server, or resource
2273and is not specific to IRIs, but it is mentioned here for completeness.</t>
2274
2275<t>Spoofing can occur in various IRI components, such as the
2276domain name part or a path part. For considerations specific
2277to the domain name part, see <xref target="RFC3491"/>.
2278For the path part, administrators of sites that allow independent
2279users to create resources in the same sub area may have to be careful
2280to check for spoofing.</t>
2281
2282<t>Spoofing can occur because in the UCS many characters look very similar. Details are discussed in <xref target="selection"/>.
2283Again, this is very similar to spoofing possibilities on US-ASCII,
2284e.g., using "br0ken" or "1ame" URIs.</t>
2285
2286<t>Spoofing can occur when URIs with percent-encodings based on various
2287character encodings are accepted to deal with older user agents. In some
2288cases, particularly for Latin-based resource names, this is usually easy to
2289detect because UTF-8-encoded names, when interpreted and viewed as
2290legacy character encodings, produce mostly garbage.</t><t>When
2291concurrently used character encodings have a similar structure but there
2292are no characters that have exactly the same encoding, detection is more
2293difficult.</t>
2294
2295<t>Spoofing can occur with bidirectional IRIs, if the restrictions
2296in <xref target="bidi-structure"/> are not followed. The same visual
2297representation may be interpreted as different logical representations,
2298and vice versa. It is also very important that a correct Unicode bidirectional
2299implementation be used.</t><t>The use of Legacy Extended IRIs introduces additional security issues.</t>
2300</section><!-- security -->
2301
2302<section title="Acknowledgements">
2303<t>For contributions to this update, we would like to thank Ian Hickson, Michael Sperberg-McQueen, Dan Connolly, Norman Walsh, Richard Tobin, Henry S. Thomson, and the XML Core Working Group of the W3C.</t>
2304
2305<t>The discussion on the issue addressed here started a long time
2306ago. There was a thread in the HTML working
2307group in August 1995 (under the topic of "Globalizing URIs") and in the
2308www-international mailing list in July 1996 (under the topic of
2309"Internationalization and URLs"), and there were ad-hoc meetings at the Unicode
2310conferences in September 1995 and September 1997.</t>
2311
2312<t>For contributions to the previous version of this document, RFC 3987, many thanks go to
2313Francois Yergeau, Matitiahu Allouche,
2314Roy Fielding, Tim Berners-Lee, Mark Davis,
2315M.T. Carrasco Benitez, James Clark, Tim Bray, Chris Wendt, Yaron Goland,
2316Andrea Vine, Misha Wolf, Leslie Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman,
2317Russ Housley, Makoto MURATA, Steven Atkin,
2318Ryan Stansifer, Tex Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs,
2319Adam Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown,
2320Roy Badami, Jonathan Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio,
2321Chris Haynes, Walter Underwood, and many others.</t>
2322<t>A definition of HyperText Reference was initially produced by Ian Hixson,
2323and further edited by Dan Connolly and C. M. Spergerg-McQueen.</t>
2324<t>Thanks to the Internationalization Working
2325Group (I18N WG) of the World Wide Web Consortium (W3C),
2326and the members of the W3C
2327I18N Working Group and Interest Group for their contributions and their
2328work on <xref target="CharMod"/>. Thanks also go
2329to the members of many other W3C Working Groups for adopting IRIs, and to
2330the members of the Montreal IAB Workshop on Internationalization and
2331Localization for their review.</t>
2332</section>
2333
2334
2335<section title="Change Log">
2336
2337<t>Note to RFC Editor: Please completely remove this section before publication.</t>
2338
2339<section title='Changes from draft-duerst-iri-bis-07 to draft-ietf-iri-3987bis-00'>
2340     <t>Changed draft name, date, last paragraph of abstract, and titles in change log, and added this section
2341     in moving from draft-duerst-iri-bis-07 (personal submission) to draft-ietf-iri-3987bis-00 (WG document).</t>
2342</section>
2343
2344<section title="Changes from -06 to -07 of draft-duerst-iri-bis" anchor="forkChanges"><t>
2345
2346Major restructuring of IRI processing model to make scheme-specific translation necessary to handle IDNA requirements and for consistency with web implementations. </t>
2347<t>Starting with IRI, you want one of:
2348<list style="hanging">
2349<t hangText="a"> IRI components (IRI parsed into UTF8 pieces)</t>
2350<t hangText="b"> URI components (URI parsed into ASCII pieces, encoded correctly) </t>
2351<t hangText="c"> whole URI  (for passing on to some other system that wants whole URIs) </t>
2352</list></t>
2353
2354<section title="OLD WAY">
2355<t><list style="numbers">
2356
2357 <t>Pct-encoding on the whole thing to a URI.
2358 (c1) If you want a (maybe broken) whole URI, you might
2359        stop here.</t>
2360
2361 <t>Parsing the URI into URI components.
2362   (b1) If you want (maybe broken) URI components, stop here.</t>
2363
2364 <t> Decode the components (undoing the pct-encoding).
2365   (a) if you want IRI components, stop here.</t>
2366
2367 <t> reencode:  Either using a different encoding some components
2368   (for domain names, and query components in web pages, which
2369   depends on the component, scheme and context), and otherwise
2370   using pct-encoding.
2371   (b2) if you want (good) URI components, stop here.</t>
2372
2373 <t> reassemble the reencoded components.
2374   (c2) if you want a (*good*) whole URI stop here.</t>
2375</list>
2376
2377</t>
2378
2379</section>
2380
2381<section title="NEW WAY">
2382<t>
2383<list style="numbers">
2384
2385<t> Parse the IRI into IRI components using the generic syntax.
2386   (a) if you want IRI components, stop here.</t>
2387
2388<t> Encode each components, using pct-encoding, IDN encoding, or
2389         special query part encoding depending on the component
2390         scheme or context. (b) If you want URI components, stop here.</t>
2391<t> reassemble the a whole URI from URI components.
2392   (c) if you want a whole URI stop here.</t>
2393</list></t>
2394</section>
2395</section>
2396
2397<section title='Changes from -00 to -01'><t><list style="symbols">
2398  <t>Removed 'mailto:' before mail addresses of authors.</t>
2399  <t>Added "&lt;to be done&gt;" as right side of 'href-strip' rule. Fixed '|' to '/' for
2400    alternatives.</t>
2401</list></t>
2402</section>
2403
2404<section title="Changes from -05 to -06 of draft-duerst-iri-bis-00"><t><list style="symbols">
2405<t>Add HyperText Reference, change abstract, acks and references for it</t>
2406<t>Add Masinter back as another editor.</t>
2407<t>Masinter integrates HRef material from HTML5 spec.</t>
2408<t>Rewrite introduction sections to modernize.</t>
2409</list></t>
2410</section>
2411
2412<section title="Changes from -04 to -05 of draft-duerst-iri-bis"><t><list style="symbols"><t>Updated references.</t><t>Changed IPR text to pre5378Trust200902.</t></list></t>
2413</section>
2414
2415<section title="Changes from -03 to -04 of draft-duerst-iri-bis"><t><list style="symbols"><t>Added explicit abbreviation for LEIRIs.</t><t>Mentioned LEIRI references.</t><t>Completed text in LEIRI section about tag characters and about specials.</t></list></t>
2416</section>
2417
2418<section title="Changes from -02 to -03 of draft-duerst-iri-bis"><t><list style="symbols"><t>Updated some references.</t><t>Updated Michel Suginard's coordinates.</t></list></t>
2419</section>
2420
2421<section title="Changes from -01 to -02 of draft-duerst-iri-bis"><t><list style="symbols"><t>Added tag range to iprivate (issue private-include-tags-115).</t><t>Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs.</t></list></t>
2422</section>
2423<section title="Changes from -00 to -01 of draft-duerst-iri-bis"><t><list style="symbols"><t>Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI" based on input from the W3C XML Core WG. Moved the relevant subsections to the back and promoted them to a section.</t><t>Added some text re. Legacy Extended IRIs to the security section.</t><t>Added a IANA Consideration Section.</t><t>Added this Change Log Section.</t><t>Added a section about "IRIs with Spaces/Controls" (converting from a Note in RFC 3987).</t></list></t>
2424</section>
2425<section title="Changes from RFC 3987 to -00 of draft-duerst-iri-bis"><t><list><t>Fixed errata (see http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987).</t></list></t>
2426</section>
2427</section>
2428</middle>
2429
2430<back>
2431<references title="Normative References">
2432
2433<reference anchor="ASCII">
2434<front>
2435<title>Coded Character Set -- 7-bit American Standard Code for Information
2436Interchange</title>
2437<author>
2438<organization>American National Standards Institute</organization>
2439</author>
2440<date year="1986"/>
2441</front>
2442<seriesInfo name="ANSI" value="X3.4"/>
2443</reference>
2444
2445<reference anchor="ISO10646">
2446<front>
2447<title>ISO/IEC 10646:2003: Information Technology -
2448Universal Multiple-Octet Coded Character Set (UCS)</title>
2449<author>
2450<organization>International Organization for Standardization</organization>
2451</author>
2452<date month="December" year="2003"/>
2453</front>
2454<seriesInfo name="ISO" value="Standard 10646"/>
2455</reference>
2456
2457&rfc2119;
2458&rfc3490;
2459&rfc3491;
2460&rfc3629;
2461&rfc3986;
2462
2463<reference anchor="STD68">
2464<front>
2465<title abbrev="ABNF">Augmented BNF for Syntax Specifications: ABNF</title>
2466<author initials="D." surname="Crocker" fullname="Dave Crocker"><organization/></author>
2467<author initials="P." surname="Overell" fullname="Paul Overell"><organization/></author>
2468<date month="January" year="2008"/></front>
2469<seriesInfo name="STD" value="68"/><seriesInfo name="RFC" value="5234"/>
2470</reference>
2471 
2472&rfc5890;
2473&rfc5891;
2474
2475<reference anchor="UNIV4">
2476<front>
2477<title>The Unicode Standard, Version 5.1.0, defined by: The Unicode Standard,
2478Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0),
2479as amended by Unicode 4.1.0 (http://www.unicode.org/versions/Unicode5.1.0/)</title>
2480<author><organization>The Unicode Consortium</organization></author>
2481<date year="2008" month="April"/>
2482</front>
2483</reference>
2484
2485<reference anchor="UNI9" target="http://www.unicode.org/reports/tr9/tr9-13.html">
2486<front>
2487<title>The Bidirectional Algorithm</title>
2488<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
2489<date year="2004" month="March"/>
2490</front>
2491<seriesInfo name="Unicode Standard Annex" value="#9"/>
2492</reference>
2493
2494<reference anchor="UTR15" target="http://www.unicode.org/unicode/reports/tr15/tr15-23.html">
2495<front>
2496<title>Unicode Normalization Forms</title>
2497<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
2498<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2499<date year="2008" month="March"/>
2500</front>
2501<seriesInfo name="Unicode Standard Annex" value="#15"/>
2502</reference>
2503
2504</references>
2505
2506<references title="Informative References">
2507
2508<reference anchor="BidiEx" target="http://www.w3.org/International/iri-edit/BidiExamples">
2509<front>
2510<title>Examples of bidirectional IRIs</title>
2511<author><organization/></author>
2512<date year="" month=""/>
2513</front>
2514</reference>
2515
2516<reference anchor="CharMod" target="http://www.w3.org/TR/charmod-resid">
2517<front>
2518<title>Character Model for the World Wide Web: Resource Identifiers</title>
2519<author initials="M." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2520<author initials="F." surname="Yergeau" fullname="Francois Yergeau"><organization/></author>
2521<author initials="R." surname="Ishida" fullname="Richard Ishida"><organization/></author>
2522<author initials="M." surname="Wolf" fullname="Misha Wolf"><organization/></author>
2523<author initials="T." surname="Texin" fullname="Tex Texin"><organization/></author>
2524<date year="2004" month="November" day="25"/>
2525</front>
2526<seriesInfo name="World Wide Web Consortium" value="Candidate Recommendation"/>
2527</reference>
2528
2529<reference anchor="Duerst97" target="http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf">
2530<front>
2531<title>The Properties and Promises of UTF-8</title>
2532<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2533<date year="1997" month="September"/>
2534</front>
2535<seriesInfo name="Proc. 11th International Unicode Conference, San Jose" value=""/>
2536</reference>
2537
2538<reference anchor="Gettys" target="http://www.w3.org/DesignIssues/ModelConsequences">
2539<front>
2540<title>URI Model Consequences</title>
2541<author initials="J." surname="Gettys" fullname="Jim Gettys"><organization/></author>
2542<date month="" year=""/>
2543</front>
2544</reference>
2545
2546<reference anchor="HTML4" target="http://www.w3.org/TR/html401/appendix/notes.html#h-B.2">
2547<front>
2548<title>HTML 4.01 Specification</title>
2549<author initials="D." surname="Raggett" fullname="Dave Raggett"><organization/></author>
2550<author initials="A." surname="Le Hors" fullname="Arnaud Le Hors"><organization/></author>
2551<author initials="I." surname="Jacobs" fullname="Ian Jacobs"><organization/></author>
2552<date year="1999" month="December" day="24"/>
2553</front>
2554<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2555</reference>
2556
2557<reference anchor="LEIRI" target="http://www.w3.org/TR/leiri/">
2558<front>
2559<title>Legacy extended IRIs for XML resource identification</title>
2560<author initials="H." surname="Thompson" fullname="Henry Thompson"><organization/></author>
2561<author initials="R." surname="Tobin"    fullname="Richard Tobin"><organization/></author>
2562<author initials="N." surname="Walsh" fullname="Norman Walsh"><organization/></author>
2563  <date year="2008" month="November" day="3"/>
2564
2565</front>
2566<seriesInfo name="World Wide Web Consortium" value="Note"/>
2567</reference>
2568
2569
2570&rfc2045;
2571&rfc2130;
2572&rfc2141;
2573&rfc2192;
2574&rfc2277;
2575&rfc2368;
2576&rfc2384;
2577&rfc2396;
2578&rfc2397;
2579&rfc2616;
2580&rfc1738;
2581&rfc2640;
2582&rfc4395;
2583
2584<reference anchor="UNIXML" target="http://www.w3.org/TR/unicode-xml/">
2585<front>
2586<title>Unicode in XML and other Markup Languages</title>
2587<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2588<author initials="A." surname="Freytag" fullname="Asmus Freytag"><organization/></author>
2589<date year="2003" month="June" day="18"/>
2590</front>
2591<seriesInfo name="Unicode Technical Report" value="#20"/>
2592<seriesInfo name="World Wide Web Consortium" value="Note"/>
2593</reference>
2594 
2595<reference anchor="UTR36" target="http://unicode.org/reports/tr36/">
2596<front>
2597<title>Unicode Security Considerations</title>
2598<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
2599<author initials="M." surname="Suignard" fullname="Michel Suignard"><organization/></author>
2600<date year="2010" month="August" day="4"/>
2601</front>
2602<seriesInfo name="Unicode Technical Report" value="#36"/>
2603</reference>
2604
2605<reference anchor="XLink" target="http://www.w3.org/TR/xlink/#link-locators">
2606<front>
2607<title>XML Linking Language (XLink) Version 1.0</title>
2608<author initials="S." surname="DeRose" fullname="Steve DeRose"><organization/></author>
2609<author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2610<author initials="D." surname="Orchard" fullname="David Orchard"><organization/></author>
2611<date year="2001" month="June" day="27"/>
2612</front>
2613<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2614</reference>
2615
2616<reference anchor="XML1" target="http://www.w3.org/TR/REC-xml">
2617  <front>
2618    <title>Extensible Markup Language (XML) 1.0 (Forth Edition)</title>
2619    <author initials="T." surname="Bray" fullname="Tim Bray"><organization/></author>
2620    <author initials="J." surname="Paoli" fullname="Jean Paoli"><organization/></author>
2621    <author initials="C.M." surname="Sperberg-McQueen" fullname="C. M. Sperberg-McQueen">
2622      <organization/></author>
2623    <author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2624    <author initials="F." surname="Yergeau" fullname="Francois Yergeau"><organization/></author>
2625    <date day="16" month="August" year="2006"/>
2626  </front>
2627  <seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2628</reference>
2629
2630<reference anchor="XMLNamespace" target="http://www.w3.org/TR/REC-xml-names">
2631  <front>
2632    <title>Namespaces in XML (Second Edition)</title>
2633    <author initials="T." surname="Bray" fullname="Tim Bray"><organization/></author>
2634    <author initials="D." surname="Hollander" fullname="Dave Hollander"><organization/></author>
2635    <author initials="A." surname="Layman" fullname="Andrew Layman"><organization/></author>
2636    <author initials="R." surname="Tobin" fullname="Richard Tobin"><organization></organization></author><date day="16" month="August" year="2006"/>
2637  </front>
2638  <seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2639</reference>
2640
2641<reference anchor="XMLSchema" target="http://www.w3.org/TR/xmlschema-2/#anyURI">
2642<front>
2643<title>XML Schema Part 2: Datatypes</title>
2644<author initials="P." surname="Biron" fullname="Paul Biron"><organization/></author>
2645<author initials="A." surname="Malhotra" fullname="Ashok Malhotra"><organization/></author>
2646<date year="2001" month="May" day="2"/>
2647</front>
2648<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2649</reference>
2650
2651<reference anchor="XPointer" target="http://www.w3.org/TR/xptr-framework/#escaping">
2652<front>
2653<title>XPointer Framework</title>
2654<author initials="P." surname="Grosso" fullname="Paul Grosso"><organization/></author>
2655<author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2656<author initials="J." surname="Marsh" fullname="Jonathan Marsh"><organization/></author>
2657<author initials="N." surname="Walsh" fullname="Norman Walsh"><organization/></author>
2658<date year="2003" month="March" day="25"/>
2659</front>
2660<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2661</reference>
2662
2663<reference anchor="HTML5" target="http://www.w3.org/TR/2009/WD-html5-20090423/">
2664<front>
2665<title>A vocabulary and associated APIs for HTML and XHTML</title>
2666<author initials="I." surname="Hickson" fullname="Ian Hickson"><organization>Google, Inc.</organization></author>
2667<author initials="D." surname="Hyatt" fullname="David Hyatt"><organization>Apple, Inc.</organization></author>
2668<date year="2009"  month="April" day="23"/>
2669</front>
2670<seriesInfo name="World Wide Web Consortium" value="Working Draft"/>
2671</reference>
2672
2673</references>
2674
2675<section title="Design Alternatives">
2676<t>This section briefly summarizes some design alternatives
2677considered earlier and the reasons why they were not chosen.</t>
2678<section title="New Scheme(s)">
2679<t>Introducing new schemes (for example, httpi:, ftpi:,...) or a
2680new metascheme (e.g., i:, leading to URI/IRI prefixes such as
2681i:http:, i:ftp:,...) was proposed to make IRI-to-URI conversion
2682scheme dependent or to distinguish between percent-encodings
2683resulting from IRI-to-URI conversion and percent-encodings from
2684legacy character encodings.</t>
2685
2686<t>New schemes are not needed to distinguish URIs from true IRIs (i.e.,
2687  IRIs that contain non-ASCII characters). The benefit of being able
2688  to detect the origin of percent-encodings is marginal, as UTF-8
2689  can be detected with very high reliability. Deploying new schemes is
2690  extremely hard, so not requiring new schemes for IRIs makes
2691  deployment of IRIs vastly easier. Making conversion scheme dependent
2692  is highly inadvisable and would be encouraged by separate schemes for IRIs.
2693  Using a uniform convention for conversion from IRIs to URIs makes
2694  IRI implementation orthogonal to the introduction of actual new
2695  schemes.</t>
2696</section>
2697<section title="Character Encodings Other Than UTF-8">
2698<t>At an early stage, UTF-7 was considered as an alternative to
2699UTF-8 when IRIs are converted to URIs. UTF-7 would not have needed
2700percent-encoding and  in most cases would have been shorter than
2701percent-encoded UTF-8.</t>
2702<t>Using UTF-8 avoids a double layering and overloading of the use of
2703   the "+" character. UTF-8 is fully compatible with US-ASCII and has
2704   therefore been recommended by the IETF, and is being used widely.</t>
2705 
2706  <t>UTF-7 has never been used much and is now clearly being
2707   discouraged. Requiring implementations to convert from UTF-8
2708   to UTF-7 and back would be an additional implementation burden.</t>
2709</section> <!-- notutf8 -->
2710<section title="New Encoding Convention">
2711<t>Instead of using the existing percent-encoding convention
2712of URIs, which is based on octets, the idea was to create a new
2713encoding convention; for example, to use "%u" to introduce
2714UCS code points.</t>
2715<t>Using the existing octet-based percent-encoding mechanism
2716does not need an upgrade of the URI syntax and does not
2717need corresponding server upgrades.</t>
2718</section> <!-- new encoding -->
2719<section title="Indicating Character Encodings in the URI/IRI">
2720<t>Some proposals suggested indicating the character encodings used
2721in an URI or IRI with some new syntactic convention in the URI itself,
2722similar to the "charset" parameter for e-mails and Web pages.
2723As an example, the label in square brackets in
2724"http://www.example.org/ros[iso-8859-1]&amp;#xE9;" indicated that
2725the following "&amp;#xE9;" had to be interpreted as iso-8859-1.</t>
2726<t>If UTF-8 is used exclusively, an upgrade to the URI syntax is not needed.
2727It avoids potentially multiple labels that have to be copied correctly
2728in all cases, even on the side of a bus or on a napkin, leading to
2729usability problems (and being prohibitively annoying).
2730Exclusively using UTF-8 also reduces transcoding errors and confusion.</t>
2731</section> <!-- indicating -->
2732</section>
2733</back>
2734</rfc>
Note: See TracBrowser for help on using the repository browser.