source: draft-ietf-iri-3987bis/draft-ietf-iri-3987bis.xml @ 27

Last change on this file since 27 was 27, checked in by duerst@…, 9 years ago

removed spurious comments

  • Property svn:executable set to *
File size: 125.7 KB
Line 
1<?xml version="1.0"?>
2<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
3<!ENTITY rfc1738 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.1738.xml">
4<!ENTITY rfc2045 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2045.xml">
5<!ENTITY rfc2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
6<!ENTITY rfc2130 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2130.xml">
7<!ENTITY rfc2141 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2141.xml">
8<!ENTITY rfc2192 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2192.xml">
9<!ENTITY rfc2277 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2277.xml">
10<!ENTITY rfc2368 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2368.xml">
11<!ENTITY rfc2384 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2384.xml">
12<!ENTITY rfc2396 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2396.xml">
13<!ENTITY rfc2397 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2397.xml">
14<!ENTITY rfc2616 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2616.xml">
15<!ENTITY rfc2640 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2640.xml">
16<!ENTITY rfc3490 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3490.xml">
17<!ENTITY rfc3491 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3491.xml">
18<!ENTITY rfc3629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3629.xml">
19<!ENTITY rfc3986 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3986.xml">
20<!ENTITY rfc5890 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5890.xml">
21<!ENTITY rfc5891 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5891.xml">
22]>
23<?rfc strict='yes'?>
24
25<?xml-stylesheet type='text/css' href='rfc2629.css' ?>
26<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
27<?rfc symrefs='yes'?>
28<?rfc sortrefs='yes'?>
29<?rfc iprnotified="no" ?>
30<?rfc toc='yes'?>
31<?rfc compact='yes'?>
32<?rfc subcompact='no'?>
33<rfc ipr="pre5378Trust200902" docName="draft-ietf-iri-3987bis-03" category="std" xml:lang="en" obsoletes="3987">
34<front>
35<title abbrev="IRIs">Internationalized Resource Identifiers (IRIs)</title>
36
37  <author initials="M.J." surname="Duerst" fullname='Martin Duerst'>
38    <!-- (Note: Please write "Duerst" with u-umlaut wherever
39      possible, for example as "D&#252;rst" in XML and HTML) -->
40  <organization abbrev="Aoyama Gakuin University">Aoyama Gakuin University</organization>
41  <address>
42  <postal>
43  <street>5-10-1 Fuchinobe</street>
44  <city>Sagamihara</city>
45  <region>Kanagawa</region>
46  <code>229-8558</code>
47  <country>Japan</country>
48  </postal>
49  <phone>+81 42 759 6329</phone>
50  <facsimile>+81 42 759 6495</facsimile>
51  <email>duerst@it.aoyama.ac.jp</email>
52  <uri>http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/<!-- (Note: This is the percent-encoded form of an IRI)--></uri>
53  </address>
54</author>
55
56<author initials="M.L." surname="Suignard" fullname="Michel Suignard">
57   <organization>Unicode Consortium</organization>
58   <address>
59   <postal>
60   <street></street>
61   <street>P.O. Box 391476</street>
62   <city>Mountain View</city>
63   <region>CA</region>
64   <code>94039-1476</code>
65   <country>U.S.A.</country>
66   </postal>
67   <phone>+1-650-693-3921</phone>
68   <email>michel@unicode.org</email>
69   <uri>http://www.suignard.com</uri>
70   </address>
71</author>
72<author initials="L." surname="Masinter" fullname="Larry Masinter">
73   <organization>Adobe</organization>
74   <address>
75   <postal>
76   <street>345 Park Ave</street>
77   <city>San Jose</city>
78   <region>CA</region>
79   <code>95110</code>
80   <country>U.S.A.</country>
81   </postal>
82   <phone>+1-408-536-3024</phone>
83   <email>masinter@adobe.com</email>
84   <uri>http://larry.masinter.net</uri>
85   </address>
86</author>
87
88<date year="2010" month="October" day="25"/>
89<area>Applications</area>
90<workgroup>Internationalized Resource Identifiers (iri)</workgroup>
91<keyword>IRI</keyword>
92<keyword>Internationalized Resource Identifier</keyword>
93<keyword>UTF-8</keyword>
94<keyword>URI</keyword>
95<keyword>URL</keyword>
96<keyword>IDN</keyword>
97<keyword>LEIRI</keyword>
98
99<abstract>
100<t>This document defines the Internationalized Resource Identifier
101(IRI) protocol element, as an extension of the Uniform Resource
102Identifier (URI).  An IRI is a sequence of characters from the
103Universal Character Set (Unicode/ISO 10646). Grammar and processing
104rules are given for IRIs and related syntactic forms.</t>
105
106<t>In addition, this document provides named additional rule sets
107for processing otherwise invalid IRIs, in a way that supports
108other specifications that wish to mandate common behavior for
109'error' handling. In particular, rules used in some XML languages
110(LEIRI) and web applications are given.</t>
111
112<t>Defining IRI as new protocol element (rather than updating or
113extending the definition of URI) allows independent orderly
114transitions: other protocols and languages that use URIs must
115explicitly choose to allow IRIs.</t>
116
117<t>Guidelines are provided for the use and deployment of IRIs and
118related protocol elements when revising protocols, formats, and
119software components that currently deal only with URIs.</t>
120
121</abstract>
122  <note title='RFC Editor: Please remove the next paragraph before publication.'>
123    <t>This document is intended to update RFC 3987 and move towards IETF
124    Draft Standard.  For discussion and comments on this
125    draft, please join the IETF IRI WG by subscribing to the mailing
126    list public-iri@w3.org. For a list of open issues, please see
127    the issue tracker of the WG at http://trac.tools.ietf.org/wg/iri/trac/report/1.</t>
128</note>
129</front>
130<middle>
131
132<section title="Introduction">
133
134<section title="Overview and Motivation" anchor="overview">
135
136<t>A Uniform Resource Identifier (URI) is defined in <xref
137target="RFC3986"/> as a sequence of characters chosen from a limited
138subset of the repertoire of US-ASCII <xref target="ASCII"/>
139characters.</t>
140
141<t>The characters in URIs are frequently used for representing words
142of natural languages.  This usage has many advantages: Such URIs are
143easier to memorize, easier to interpret, easier to transcribe, easier
144to create, and easier to guess. For most languages other than English,
145however, the natural script uses characters other than A - Z. For many
146people, handling Latin characters is as difficult as handling the
147characters of other scripts is for those who use only the Latin
148alphabet. Many languages with non-Latin scripts are transcribed with
149Latin letters. These transcriptions are now often used in URIs, but
150they introduce additional difficulties.</t>
151
152<t>The infrastructure for the appropriate handling of characters from
153additional scripts is now widely deployed in operating system and
154application software. Software that can handle a wide variety of
155scripts and languages at the same time is increasingly common. Also,
156an increasing number of protocols and formats can carry a wide range of
157characters.</t>
158
159<t>URIs are used both as a protocol element (for transmission and
160processing by software) and also a presentation element (for display
161and handling by people who read, interpret, coin, or guess them). The
162transition between these roles is more difficult and complex when
163dealing with the larger set of characters than allowed for URIs in
164<xref target="RFC3986"/>. </t>
165
166<t>This document defines the protocol element called Internationalized
167Resource Identifier (IRI), which allow applications of URIs to be
168extended to use resource identifiers that have a much wider repertoire
169of characters. It also provides corresponding "internationalized"
170versions of other constructs from <xref target="RFC3986"/>, such as
171URI references. The syntax of IRIs is defined in <xref
172target="syntax"/>.
173</t>
174
175<t>Using characters outside of A - Z in IRIs adds a number of
176difficulties. <xref target="Bidi"/> discusses the special case of
177bidirectional IRIs using characters from scripts written
178right-to-left.  <xref target="equivalence"/> discusses various forms
179of equivalence between IRIs. <xref target="IRIuse"/> discusses the use
180of IRIs in different situations.  <xref target="guidelines"/> gives
181additional informative guidelines.  <xref target="security"/>
182discusses IRI-specific security considerations.</t>
183</section> <!-- overview -->
184
185<section title="Applicability" anchor="Applicability">
186
187<t>IRIs are designed to allow protocols and software that deal with
188URIs to be updated to handle IRIs. A "URI scheme" (as defined by <xref
189target="RFC3986"/> and registered through the IANA process defined in
190<xref target="RFC4395bis"/> also serves as an "IRI scheme". Processing of
191IRIs is accomplished by extending the URI syntax while retaining (and
192not expanding) the set of "reserved" characters, such that the syntax
193for any URI scheme may be uniformly extended to allow non-ASCII
194characters. In addition, following parsing of an IRI, it is possible
195to construct a corresponding URI by first encoding characters outside
196of the allowed URI range and then reassembling the components.
197</t>
198
199<t>Practical use of IRIs forms in place of URIs forms depends on the
200following conditions being met:</t>
201
202<t><list style="hanging">
203   
204<t hangText="a.">A protocol or format element MUST be explicitly designated to be
205  able to carry IRIs. The intent is to avoid introducing IRIs into
206  contexts that are not defined to accept them.  For example, XML
207  schema <xref target="XMLSchema"/> has an explicit type "anyURI" that
208  includes IRIs and IRI references. Therefore, IRIs and IRI references
209  can be in attributes and elements of type "anyURI".  On the other
210  hand, in the <xref target="RFC2616"/> definition of HTTP/1.1, the
211  Request URI is defined as a URI, which means that direct use of IRIs
212  is not allowed in HTTP requests.</t>
213
214<t hangText="b.">The protocol or format carrying the IRIs MUST have a
215  mechanism to represent the wide range of characters used in IRIs,
216  either natively or by some protocol- or format-specific escaping
217  mechanism (for example, numeric character references in <xref
218  target="XML1"/>).</t>
219
220<t hangText="c.">The URI scheme definition, if it explicitly allows a
221  percent sign ("%") in any syntactic component, SHOULD define the
222  interpretation of sequences of percent-encoded octets (using "%XX"
223  hex octets) as octet from sequences of UTF-8 encoded strings; this
224  is recommended in the guidelines for registering new schemes, <xref
225  target="RFC4395bis"/>.  For example, this is the practice for IMAP URLs
226  <xref target="RFC2192"/>, POP URLs <xref target="RFC2384"/> and the
227  URN syntax <xref target="RFC2141"/>). Note that use of
228  percent-encoding may also be restricted in some situations, for
229  example, URI schemes that disallow percent-encoding might still be
230  used with a fragment identifier which is percent-encoded (e.g.,
231  <xref target="XPointer"/>). See <xref target="UTF8use"/> for further
232  discussion.</t>
233</list></t>
234
235</section> <!-- applicability -->
236
237<section title="Definitions" anchor="sec-Definitions">
238 
239<t>The following definitions are used in this document; they follow the
240terms in <xref target="RFC2130"/>, <xref target="RFC2277"/>, and
241<xref target="ISO10646"/>.</t>
242<t><list style="hanging">
243   
244<t hangText="character:">A member of a set of elements used for the
245    organization, control, or representation of data. For example,
246    "LATIN CAPITAL LETTER A" names a character.</t>
247   
248<t hangText="octet:">An ordered sequence of eight bits considered as a
249    unit.</t>
250   
251<t hangText="character repertoire:">A set of characters (set in the
252    mathematical sense).</t>
253   
254<t hangText="sequence of characters:">A sequence of characters (one
255    after another).</t>
256   
257<t hangText="sequence of octets:">A sequence of octets (one after
258    another).</t>
259   
260<t hangText="character encoding:">A method of representing a sequence
261    of characters as a sequence of octets (maybe with variants). Also,
262    a method of (unambiguously) converting a sequence of octets into a
263    sequence of characters.</t>
264   
265<t hangText="charset:">The name of a parameter or attribute used to
266    identify a character encoding.</t>
267   
268<t hangText="UCS:">Universal Character Set. The coded character set
269    defined by ISO/IEC 10646 <xref target="ISO10646"/> and the Unicode
270    Standard <xref target="UNIV4"/>.</t>
271   
272<t hangText="IRI reference:">Denotes the common usage of an
273    Internationalized Resource Identifier. An IRI reference may be
274    absolute or relative.  However, the "IRI" that results from such a
275    reference only includes absolute IRIs; any relative IRI references
276    are resolved to their absolute form.  Note that in <xref
277    target="RFC2396"/> URIs did not include fragment identifiers, but
278    in <xref target="RFC3986"/> fragment identifiers are part of
279    URIs.</t>
280   
281<t hangText="URL:">The term "URL" was originally used <xref
282   target="RFC1738"/> for roughly what is now called a "URI".  Books,
283   software and documentation often refers to URIs and IRIs using the
284   "URL" term. Some usages restrict "URL" to those URIs which are not
285   URNs. Because of the ambiguity of the term using the term "URL" is
286   NOT RECOMMENDED in formal documents.</t>
287
288<t hangText="LEIRI (Legacy Extended IRI) processing:">  This term was used in
289   various XML specifications to refer
290   to strings that, although not valid IRIs, were acceptable input to
291   the processing rules in <xref target="LEIRIspec" />.</t>
292
293<t hangText="(Web Address, Hypertext Reference, HREF):"> These terms have been
294   added in this document for convenience, to allow other
295   specifications to refer to those strings that, although not valid
296   IRIs, are acceptable input to the processing rules in <xref
297   target="webaddress"/>. This usage corresponds to the parsing rules
298   of some popular web browsing applications.
299   ISSUE: Need to find a good name/abbreviation for these.</t>
300   
301<t hangText="running text:">Human text (paragraphs, sentences,
302   phrases) with syntax according to orthographic conventions of a
303   natural language, as opposed to syntax defined for ease of
304   processing by machines (e.g., markup, programming languages).</t>
305   
306<t hangText="protocol element:">Any portion of a message that affects
307    processing of that message by the protocol in question.</t>
308   
309<t hangText="presentation element:">A presentation form corresponding
310    to a protocol element; for example, using a wider range of
311    characters.</t>
312   
313<t hangText="create (a URI or IRI):">With respect to URIs and IRIs,
314     the term is used for the initial creation. This may be the
315     initial creation of a resource with a certain identifier, or the
316     initial exposition of a resource under a particular
317     identifier.</t>
318   
319<t hangText="generate (a URI or IRI):">With respect to URIs and IRIs,
320     the term is used when the identifier is generated by derivation
321     from other information.</t>
322
323<t hangText="parsed URI component:">When a URI processor parses a URI
324   (following the generic syntax or a scheme-specific syntax, the result
325   is a set of parsed URI components, each of which has a type
326   (corresponding to the syntactic definition) and a sequence of URI
327   characters.  </t>
328
329<t hangText="parsed IRI component:">When an IRI processor parses
330   an IRI directly, following the general syntax or a scheme-specific
331   syntax, the result is a set of parsed IRI components, each of
332   which has a type (corresponding to the syntactice definition)
333   and a sequence of IRI characters. (This definition is analogous
334   to "parsed URI component".)</t>
335
336<t hangText="IRI scheme:">A URI scheme may also be known as
337   an "IRI scheme" if the scheme's syntax has been extended to
338   allow non-US-ASCII characters according to the rules in this
339   document.</t>
340
341</list></t>
342</section> <!-- definitions -->
343<section title="Notation" anchor="sec-Notation">
344     
345<t>RFCs and Internet Drafts currently do not allow any characters
346outside the US-ASCII repertoire. Therefore, this document uses various
347special notations to denote such characters in examples.</t>
348     
349<t>In text, characters outside US-ASCII are sometimes referenced by
350using a prefix of 'U+', followed by four to six hexadecimal
351digits.</t>
352
353<t>To represent characters outside US-ASCII in examples, this document
354uses two notations: 'XML Notation' and 'Bidi Notation'.</t>
355
356<t>XML Notation uses a leading '&amp;#x', a trailing ';', and the
357hexadecimal number of the character in the UCS in between. For
358example, &amp;#x44F; stands for CYRILLIC CAPITAL LETTER YA. In this
359notation, an actual '&amp;' is denoted by '&amp;amp;'.</t>
360
361<t>Bidi Notation is used for bidirectional examples: Lower case
362letters stand for Latin letters or other letters that are written left
363to right, whereas upper case letters represent Arabic or Hebrew
364letters that are written right to left.</t>
365
366<t>To denote actual octets in examples (as opposed to percent-encoded
367octets), the two hex digits denoting the octet are enclosed in "&lt;"
368and "&gt;".  For example, the octet often denoted as 0xc9 is denoted
369here as &lt;c9&gt;.</t>
370
371<t> In this document, the key words "MUST", "MUST NOT", "REQUIRED",
372"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
373and "OPTIONAL" are to be interpreted as described in <xref
374target="RFC2119"/>.</t>
375
376</section> <!-- notation -->
377</section> <!-- introduction -->
378
379<section title="IRI Syntax" anchor="syntax">
380<t>This section defines the syntax of Internationalized Resource
381Identifiers (IRIs).</t>
382
383<t>As with URIs, an IRI is defined as a sequence of characters, not as
384a sequence of octets. This definition accommodates the fact that IRIs
385may be written on paper or read over the radio as well as stored or
386transmitted digitally.  The same IRI might be represented as different
387sequences of octets in different protocols or documents if these
388protocols or documents use different character encodings (and/or
389transfer encodings).  Using the same character encoding as the
390containing protocol or document ensures that the characters in the IRI
391can be handled (e.g., searched, converted, displayed) in the same way
392as the rest of the protocol or document.</t>
393
394<section title="Summary of IRI Syntax" anchor="summary">
395
396<t>IRIs are defined by extending the URI syntax in <xref
397target="RFC3986"/>, but extending the class of unreserved characters
398by adding the characters of the UCS (Universal Character Set, <xref
399target="ISO10646"/>) beyond U+007F, subject to the limitations given
400in the syntax rules below and in <xref target="limitations"/>.</t>
401
402<t>The syntax and use of components and reserved characters is the
403same as that in <xref target="RFC3986"/>. Each "URI scheme" thus also
404functions as an "IRI scheme", in that scheme-specific parsing rules
405for URIs of a scheme are be extended to allow parsing of IRIs using
406the same parsing rules.</t>
407
408<t>All the operations defined in <xref target="RFC3986"/>, such as the
409resolution of relative references, can be applied to IRIs by
410IRI-processing software in exactly the same way as they are for URIs
411by URI-processing software.</t>
412
413<t>Characters outside the US-ASCII repertoire MUST NOT be reserved and
414therefore MUST NOT be used for syntactical purposes, such as to
415delimit components in newly defined schemes. For example, U+00A2, CENT
416SIGN, is not allowed as a delimiter in IRIs, because it is in the
417'iunreserved' category. This is similar to the fact that it is not
418possible to use '-' as a delimiter in URIs, because it is in the
419'unreserved' category.</t>
420
421</section> <!-- summary -->
422<section title="ABNF for IRI References and IRIs" anchor="abnf">
423
424<t>An ABNF definition for IRI references (which are the most general
425concept and the start of the grammar) and IRIs is given here. The
426syntax of this ABNF is described in <xref target="STD68"/>. Character
427numbers are taken from the UCS, without implying any actual binary
428encoding. Terminals in the ABNF are characters, not octets.</t>
429
430<t>The following grammar closely follows the URI grammar in <xref
431target="RFC3986"/>, except that the range of unreserved characters is
432expanded to include UCS characters, with the restriction that private
433UCS characters can occur only in query parts. The grammar is split
434into two parts: Rules that differ from <xref target="RFC3986"/>
435because of the above-mentioned expansion, and rules that are the same
436as those in <xref target="RFC3986"/>. For rules that are different
437than those in <xref target="RFC3986"/>, the names of the non-terminals
438have been changed as follows. If the non-terminal contains 'URI', this
439has been changed to 'IRI'. Otherwise, an 'i' has been prefixed.</t>
440
441<!--
442for line length measuring in artwork (max 72 chars, three chars at start):
443      1         2         3         4         5         6         7
444456789012345678901234567890123456789012345678901234567890123456789012
445-->
446<figure>
447<preamble>The following rules are different from those in <xref target="RFC3986"/>:</preamble>
448<artwork>
449IRI            = scheme ":" ihier-part [ "?" iquery ]
450                 [ "#" ifragment ]
451
452ihier-part     = "//" iauthority ipath-abempty
453               / ipath-absolute
454               / ipath-rootless
455               / ipath-empty
456
457IRI-reference  = IRI / irelative-ref
458
459absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]
460
461irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]
462
463irelative-part = "//" iauthority ipath-abempty
464               / ipath-absolute
465               / ipath-noscheme
466               / ipath-empty
467
468iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
469iuserinfo      = *( iunreserved / pct-form / sub-delims / ":" )
470ihost          = IP-literal / IPv4address / ireg-name
471
472pct-form       = pct-encoded
473
474ireg-name      = *( iunreserved / sub-delims )
475
476ipath          = ipath-abempty   ; begins with "/" or is empty
477               / ipath-absolute  ; begins with "/" but not "//"
478               / ipath-noscheme  ; begins with a non-colon segment
479               / ipath-rootless  ; begins with a segment
480               / ipath-empty     ; zero characters
481
482ipath-abempty  = *( path-sep isegment )
483ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
484ipath-noscheme = isegment-nz-nc *( path-sep isegment )
485ipath-rootless = isegment-nz *( path-sep isegment )
486ipath-empty    = 0&lt;ipchar&gt;
487path-sep       = "/"
488
489isegment       = *ipchar
490isegment-nz    = 1*ipchar
491isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
492                     / "@" )
493               ; non-zero-length segment without any colon ":"                     
494
495ipchar         = iunreserved / pct-form / sub-delims / ":"
496               / "@"
497 
498iquery         = *( ipchar / iprivate / "/" / "?" )
499
500ifragment      = *( ipchar / "/" / "?" / "#" )
501
502iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
503
504ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
505               / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
506               / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
507               / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
508               / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
509               / %xD0000-DFFFD / %xE1000-EFFFD
510
511iprivate       = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD
512               / %x100000-10FFFD
513</artwork>
514</figure>
515
516<t>Some productions are ambiguous. The "first-match-wins" (a.k.a. "greedy")
517algorithm applies. For details, see <xref target="RFC3986"/>.</t>
518
519<figure>
520<preamble>The following rules are the same as those in <xref target="RFC3986"/>:</preamble>
521<artwork>
522scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
523 
524port           = *DIGIT
525 
526IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"
527 
528IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
529 
530IPv6address    =                            6( h16 ":" ) ls32
531               /                       "::" 5( h16 ":" ) ls32
532               / [               h16 ] "::" 4( h16 ":" ) ls32
533               / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
534               / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
535               / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
536               / [ *4( h16 ":" ) h16 ] "::"              ls32
537               / [ *5( h16 ":" ) h16 ] "::"              h16
538               / [ *6( h16 ":" ) h16 ] "::"
539               
540h16            = 1*4HEXDIG
541ls32           = ( h16 ":" h16 ) / IPv4address
542
543IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet
544
545dec-octet      = DIGIT                 ; 0-9
546               / %x31-39 DIGIT         ; 10-99
547               / "1" 2DIGIT            ; 100-199
548               / "2" %x30-34 DIGIT     ; 200-249
549               / "25" %x30-35          ; 250-255
550           
551pct-encoded    = "%" HEXDIG HEXDIG
552
553unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
554reserved       = gen-delims / sub-delims
555gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
556sub-delims     = "!" / "$" / "&amp;" / "'" / "(" / ")"
557               / "*" / "+" / "," / ";" / "="
558</artwork></figure>
559
560<t>This syntax does not support IPv6 scoped addressing zone identifiers.</t>
561
562</section> <!-- abnf -->
563
564</section> <!-- syntax -->
565
566<section title="Processing IRIs and related protocol elements" anchor="processing">
567
568<t>IRIs are meant to replace URIs in identifying resources within new
569versions of protocols, formats, and software components that use a
570UCS-based character repertoire.  Protocols and components may use and
571process IRIs directly. However, there are still numerous systems and
572protocols which only accept URIs or components of parsed URIs; that is,
573they only accept sequences of characters within the subset of US-ASCII
574characters allowed in URIs. </t>
575
576<t>This section defines specific processing steps for IRI consumers
577which establish the relationship between the string given and the
578interpreted derivatives. These
579processing steps apply to both IRIs and IRI references (i.e., absolute
580or relative forms); for IRIs, some steps are scheme specific. </t>
581
582<section title="Converting to UCS" anchor="ucsconv"> 
583 
584<t>Input that is already in a Unicode form (i.e., a sequence of Unicode
585 characters or an octet-stream representing a Unicode-based character
586 encoding such as UTF-8 or UTF-16) should be left as is and not
587 normalized (see (see <xref target="normalization"/>).</t>
588
589  <t>An IRI or IRI reference is a sequence of characters from the UCS.
590    For IRIs that are not already in a Unicode form
591    (as when written on paper, read aloud, or represented in a text stream
592    using a legacy character encoding), convert the IRI to Unicode.
593    Note that some character encodings or transcriptions can be converted
594    to or represented by more than one sequence of Unicode characters.
595    Ideally the resulting IRI would use a normalized form,
596    such as Unicode Normalization Form C <xref target="UTR15"/>
597    (see <xref target='ladder'/> Normalization and Comparison),
598    since that ensures a stable, consistent representation
599    that is most likely to produce the intended results.
600    Implementers and users are cautioned that, while denormalized character sequences are valid,
601    they might be difficult for other users or processes to reproduce
602    and might lead to unexpected results.
603  </t>
604
605<t> In other cases (written on paper, read aloud, or otherwise
606 represented independent of any character encoding) represent the IRI
607 as a sequence of characters from the UCS normalized according to
608 Unicode Normalization Form C (NFC, <xref target="UTR15"/>).</t>
609</section> <!-- ucsconv -->
610
611<section title="Parse the IRI into IRI components">
612
613<t>Parse the IRI, either as a relative reference (no scheme)
614or using scheme specific processing (according to the scheme
615given); the result resulting in a set of parsed IRI components.
616(NOTE: FIX BEFORE RELEASE: INTENT IS THAT ALL IRI SCHEMES
617THAT USE GENERIC SYNTAX AND ALLOW NON-ASCII AUTHORITY CAN
618ONLY USE AUTHORITY FOR NAMES THAT FOLLOW PUNICODE.)
619 </t>
620
621<t>NOTE: The result of parsing into components will correspond result
622in a correspondence of subtrings of the IRI according to the part
623matched.  For example, in <xref target="HTML5"/>, the protocol
624components of interest are SCHEME (scheme), HOST (ireg-name), PORT
625(port), the PATH (ipath after the initial "/"), QUERY (iquery),
626FRAGMENT (ifragment), and AUTHORITY (iauthority).
627</t>
628
629<t>Subsequent processing rules are sometimes used to define other
630syntactic components. For example, <xref target="HTML5"/> defines APIs
631for IRI processing; in these APIs:
632
633<list style="hanging">
634<t hangText="HOSTSPECIFIC"> the substring that follows
635the substring matched by the iauthority production, or the whole
636string if the iauthority production wasn't matched.</t>
637<t hangText="HOSTPORT"> if there is a scheme component and a port
638component and the port given by the port component is different than
639the default port defined for the protocol given by the scheme
640component, then HOSTPORT is the substring that starts with the
641substring matched by the host production and ends with the substring
642matched by the port production, and includes the colon in between the
643two. Otherwise, it is the same as the host component.
644</t>
645</list>
646</t>
647</section> <!-- parse -->
648
649<section title="General percent-encoding of IRI components" anchor="compmapping">
650   
651<t>For most IRI components, it is possible to map the IRI component
652to an equivalent URI component by percent-encoding those characters
653not allowed in URIs. Previous processing steps will have removed
654some characters, and the interpretation of reserved characters will
655have already been done (with the syntactic reserved characters outside
656of the IRI component). This mapping is defined for all sequences
657of Unicode characters, whether or not they are valid for the component
658in question. </t>
659   
660<t>For each character which is not allowed in a valid URI (NOTE: WHAT
661IS THE RIGHT REFERENCE HERE), apply the following steps. </t>
662
663<t><list style="hanging">
664
665<t hangText="Convert to UTF-8">Convert the character to a sequence of
666  one or more octets using UTF-8 <xref target="RFC3629"/>.</t>
667
668<t hangText="Percent encode">Convert each octet of this sequence to %HH,
669   where HH is the hexadecimal notation of the octet value. The
670   hexadecimal notation SHOULD use uppercase letters. (This is the
671   general URI percent-encoding mechanism in Section 2.1 of <xref
672   target="RFC3986"/>.)</t>
673   
674</list></t>
675
676<t>Note that the mapping is an identity transformation for parsed URI
677components of valid URIs, and is idempotent: applying the mapping a
678second time will not change anything.</t>
679</section> <!-- general conversion -->
680
681<section title="Mapping ireg-name" anchor="dnsmapping">
682
683<t>Schemes that allow non-ASCII based characters
684in the reg-name (ireg-name) position MUST convert the ireg-name
685component of an IRI as follows:</t>
686
687<t>Replace the ireg-name part of the IRI by the part converted using
688the ToASCII operation specified in Section 4.1 of <xref
689target="RFC3490"/> on each dot-separated label, and by using U+002E
690(FULL STOP) as a label separator, with the flag UseSTD3ASCIIRules set
691to FALSE, and with the flag AllowUnassigned set to FALSE.
692The ToASCII operation may
693fail, but this would mean that the IRI cannot be resolved.
694In such cases, if the domain name conversion fails, then the
695entire IRI conversion fails. Processors that have no mechanism for
696signalling a failure MAY instead substitute an otherwise
697invalid host name, although such processing SHOULD be avoided.
698 </t>
699
700<t>For example, the IRI
701<vspace/>"http://r&amp;#xE9;sum&amp;#xE9;.example.org"<vspace/> MAY be
702converted to <vspace/>"http://xn--rsum-bad.example.org"<vspace/>;
703conversion to percent-encoded form, e.g.,
704 <vspace/>"http://r%C3%A9sum%C3%A9.example.org", MUST NOT be performed. </t>
705
706<t><list style="hanging"> 
707
708<t hangText="Note:">Domain Names may appear in parts of an IRI other
709than the ireg-name part.  It is the responsibility of scheme-specific
710implementations (if the Internationalized Domain Name is part of the
711scheme syntax) or of server-side implementations (if the
712Internationalized Domain Name is part of 'iquery') to apply the
713necessary conversions at the appropriate point. Example: Trying to
714validate the Web page at<vspace/>
715http://r&amp;#xE9;sum&amp;#xE9;.example.org would lead to an IRI of
716<vspace/>http://validator.w3.org/check?uri=http%3A%2F%2Fr&amp;#xE9;sum&amp;#xE9;.<vspace/>example.org,
717which would convert to a URI
718of<vspace/>http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.<vspace/>example.org.
719The server-side implementation is responsible for making the
720necessary conversions to be able to retrieve the Web page.</t>
721
722<t hangText="Note:">In this process, characters allowed in URI
723references and existing percent-encoded sequences are not encoded further.
724(This mapping is similar to, but different from, the encoding applied
725when arbitrary content is included in some part of a URI.)
726
727For example, an IRI of
728<vspace/>"http://www.example.org/red%09ros&amp;#xE9;#red"
729(in XML notation) is converted to
730<vspace/>"http://www.example.org/red%09ros%C3%A9#red", not to
731something like
732<vspace/>"http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".
733((DESIGN QUESTION: What about e.g. http://r%C3%A9sum%C3%A9.example.org in an IRI? Will that get converted to punycode, or not?))
734
735</t>
736
737</list></t>
738</section> <!-- dnsmapping -->
739
740<section title="Mapping query components" anchor="querymapping">
741
742<t>((NOTE: SEE ISSUES LIST))
743
744For compatibility with existing deployed HTTP infrastructure,
745the following special case applies for schemes "http" and "https"
746and IRIs whose origin has a document charset other than one which
747is UCS-based (e.g., UTF-8 or UTF-16). In such a case, the "query"
748component of an IRI is mapped into a URI by using the document
749charset rather than UTF-8 as the binary representation before
750pct-encoding. This mapping is not applied for any other scheme
751or component.</t>
752
753</section> <!-- querymapping -->
754
755<section title="Mapping IRIs to URIs" anchor="mapping">
756
757<t>The canonical mapping from a IRI to URI is defined by applying the
758mapping above (from IRI to URI components) and then reassembling a URI
759from the parsed URI components using the original punctuation that
760delimited the IRI components. </t>
761
762</section> <!-- mapping -->
763
764<section title="Converting URIs to IRIs" anchor="URItoIRI">
765
766<t>In some situations, for presentation and further processing,
767it is desirable to convert a URI into an equivalent IRI in which
768natural characters are represented directly rather than
769percent encoded. Of course, every URI is already an IRI in
770its own right without any conversion, and in general there
771This section gives one such procedure for this conversion.
772</t>
773
774<t>
775The conversion described in this section, if given a valid URI, will
776result in an IRI that maps back to the URI used as an input for the
777conversion (except for potential case differences in percent-encoding
778and for potential percent-encoded unreserved characters).
779
780However, the IRI resulting from this conversion may differ
781from the original IRI (if there ever was one).</t> 
782
783<t>URI-to-IRI conversion removes percent-encodings, but not all
784percent-encodings can be eliminated. There are several reasons for
785this:</t>
786
787<t><list style="hanging">
788
789<t hangText="1.">Some percent-encodings are necessary to distinguish
790    percent-encoded and unencoded uses of reserved characters.</t>
791
792<t hangText="2.">Some percent-encodings cannot be interpreted as sequences
793    of UTF-8 octets.<vspace blankLines="1"/>
794    (Note: The octet patterns of UTF-8 are highly regular.
795    Therefore, there is a very high probability, but no guarantee,
796    that percent-encodings that can be interpreted as sequences of UTF-8
797    octets actually originated from UTF-8. For a detailed discussion,
798    see <xref target="Duerst97"/>.)</t>
799
800<t hangText="3.">The conversion may result in a character that is not
801    appropriate in an IRI. See <xref target="abnf"/>, <xref target="visual"/>,
802      and <xref target="limitations"/> for further details.</t>
803
804<t hangText="4.">IRI to URI conversion has different rules for
805    dealing with domain names and query parameters.</t>
806
807</list></t>
808
809<t>Conversion from a URI to an IRI MAY be done by using the following
810steps:
811
812<list style="hanging">
813<t hangText="1.">Represent the URI as a sequence of octets in
814       US-ASCII.</t>
815
816<t hangText="2.">Convert all percent-encodings ("%" followed by two
817      hexadecimal digits) to the corresponding octets, except those
818      corresponding to "%", characters in "reserved", and characters
819      in US-ASCII not allowed in URIs.</t> 
820
821<t hangText="3.">Re-percent-encode any octet produced in step 2 that
822      is not part of a strictly legal UTF-8 octet sequence.</t>
823
824
825<t hangText="4.">Re-percent-encode all octets produced in step 3 that
826      in UTF-8 represent characters that are not appropriate according
827      to <xref target="abnf"/>, <xref target="visual"/>, and <xref
828      target="limitations"/>.</t> 
829
830<t hangText="5.">Interpret the resulting octet sequence as a sequence
831      of characters encoded in UTF-8.</t>
832
833<t hangText="6.">URIs known to contain domain names in the reg-name
834      component SHOULD convert punycode-encoded domain name labels to
835      the corresponding characters using the ToUnicode procedure. </t>
836</list></t>
837
838<t>This procedure will convert as many percent-encoded characters as
839possible to characters in an IRI. Because there are some choices when
840step 4 is applied (see <xref target="limitations"/>), results may
841vary.</t>
842
843<t>Conversions from URIs to IRIs MUST NOT use any character
844encoding other than UTF-8 in steps 3 and 4, even if it might be
845possible to guess from the context that another character encoding
846than UTF-8 was used in the URI.  For example, the URI
847"http://www.example.org/r%E9sum%E9.html" might with some guessing be
848interpreted to contain two e-acute characters encoded as
849iso-8859-1. It must not be converted to an IRI containing these
850e-acute characters. Otherwise, in the future the IRI will be mapped to
851"http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
852URI from "http://www.example.org/r%E9sum%E9.html".</t>
853
854<section title="Examples">
855
856<t>This section shows various examples of converting URIs to IRIs.
857Each example shows the result after each of the steps 1 through 6 is
858applied. XML Notation is used for the final result.  Octets are
859denoted by "&lt;" followed by two hexadecimal digits followed by
860"&gt;".</t>
861
862<t>The following example contains the sequence "%C3%BC", which is a
863strictly legal UTF-8 sequence, and which is converted into the actual
864character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
865u-umlaut).
866
867<list style="hanging">
868<t hangText="1.">http://www.example.org/D%C3%BCrst</t>
869<t hangText="2.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
870<t hangText="3.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
871<t hangText="4.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
872<t hangText="5.">http://www.example.org/D&amp;#xFC;rst</t>
873<t hangText="6.">http://www.example.org/D&amp;#xFC;rst</t>
874</list>
875</t>
876
877<t>The following example contains the sequence "%FC", which might
878represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in
879the<vspace/>iso-8859-1 character encoding.  (It might represent other
880characters in other character encodings. For example, the octet
881&lt;fc&gt; in iso-8859-5 represents U+045C, CYRILLIC SMALL LETTER
882KJE.)  Because &lt;fc&gt; is not part of a strictly legal UTF-8
883sequence, it is re-percent-encoded in step 3.
884
885
886<list style="hanging">
887<t hangText="1.">http://www.example.org/D%FCrst</t>
888<t hangText="2.">http://www.example.org/D&lt;fc&gt;rst</t>
889<t hangText="3.">http://www.example.org/D%FCrst</t>
890<t hangText="4.">http://www.example.org/D%FCrst</t>
891<t hangText="5.">http://www.example.org/D%FCrst</t>
892<t hangText="6.">http://www.example.org/D%FCrst</t>
893</list>
894</t>
895
896<t>The following example contains "%e2%80%ae", which is the percent-encoded<vspace/>UTF-8
897character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. <xref target="visual"/>
898forbids the direct use of this character in an IRI. Therefore, the
899corresponding octets are re-percent-encoded in step 4. This example shows
900that the case (upper- or lowercase) of letters used in percent-encodings may not be preserved.
901The example also contains a punycode-encoded domain name label (xn--99zt52a),
902which is not converted.
903
904<list style="hanging">
905<t hangText="1.">http://xn--99zt52a.example.org/%e2%80%ae</t>
906<t hangText="2.">http://xn--99zt52a.example.org/&lt;e2&gt;&lt;80&gt;&lt;ae&gt;</t>
907<t hangText="3.">http://xn--99zt52a.example.org/&lt;e2&gt;&lt;80&gt;&lt;ae&gt;</t>
908<t hangText="4.">http://xn--99zt52a.example.org/%E2%80%AE</t>
909<t hangText="5.">http://xn--99zt52a.example.org/%E2%80%AE</t>
910<t hangText="6.">http://&amp;#x7D0D;&amp;#x8C46;.example.org/%E2%80%AE</t>
911</list></t>
912
913<t>Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
914(Japanese Natto). ((EDITOR NOTE: There is some inconsistency in this note.))</t>
915
916</section> <!-- examples -->
917</section> <!-- URItoIRI -->
918</section> <!-- processing -->
919<section title="Bidirectional IRIs for Right-to-Left Languages" anchor="Bidi">
920
921<t>Some UCS characters, such as those used in the Arabic and Hebrew
922scripts, have an inherent right-to-left (rtl) writing direction. IRIs
923containing these characters (called bidirectional IRIs or Bidi IRIs)
924require additional attention because of the non-trivial relation
925between logical representation (used for digital representation and
926for reading/spelling) and visual representation (used for
927display/printing).</t>
928
929<t>Because of the complex interaction between the logical representation,
930the visual representation, and the syntax of a Bidi IRI, a balance is
931needed between various requirements.
932The main requirements are<list style="hanging">
933<t hangText="1.">user-predictable conversion between visual and
934    logical representation;</t>
935<t hangText="2.">the ability to include a wide range of characters
936    in various parts of the IRI; and</t>
937<t hangText="3.">minor or no changes or restrictions for
938      implementations.</t>
939</list></t>
940
941<section title="Logical Storage and Visual Presentation" anchor="visual">
942
943<t>When stored or transmitted in digital representation, bidirectional
944IRIs MUST be in full logical order and MUST conform to the IRI syntax
945rules (which includes the rules relevant to their scheme). This
946ensures that bidirectional IRIs can be processed in the same way as
947other IRIs.</t> <t>Bidirectional IRIs MUST be rendered by using the
948Unicode Bidirectional Algorithm <xref target="UNIV4"/>, <xref
949target="UNI9"/>.  Bidirectional IRIs MUST be rendered in the same way
950as they would be if they were in a left-to-right embedding; i.e., as
951if they were preceded by U+202A, LEFT-TO-RIGHT EMBEDDING (LRE), and
952followed by U+202C, POP DIRECTIONAL FORMATTING (PDF).  Setting the
953embedding direction can also be done in a higher-level protocol (e.g.,
954the dir='ltr' attribute in HTML).</t> 
955
956<t>There is no requirement to use the above embedding if the display
957is still the same without the embedding. For example, a bidirectional
958IRI in a text with left-to-right base directionality (such as used for
959English or Cyrillic) that is preceded and followed by whitespace and
960strong left-to-right characters does not need an embedding.  Also, a
961bidirectional relative IRI reference that only contains strong
962right-to-left characters and weak characters and that starts and ends
963with a strong right-to-left character and appears in a text with
964right-to-left base directionality (such as used for Arabic or Hebrew)
965and is preceded and followed by whitespace and strong characters does
966not need an embedding.</t>
967
968<t>In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
969sufficient to force the correct display behavior.  However, the
970details of the Unicode Bidirectional algorithm are not always easy to
971understand. Implementers are strongly advised to err on the side of
972caution and to use embedding in all cases where they are not
973completely sure that the display behavior is unaffected without the
974embedding.</t>
975
976<t>The Unicode Bidirectional Algorithm (<xref target="UNI9"/>, section
9774.3) permits higher-level protocols to influence bidirectional
978rendering. Such changes by higher-level protocols MUST NOT be used if
979they change the rendering of IRIs.</t> 
980
981<t>The bidirectional formatting characters that may be used before or
982after the IRI to ensure correct display are not themselves part of the
983IRI.  IRIs MUST NOT contain bidirectional formatting characters (LRM,
984RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of
985the IRI but do not appear themselves. It would therefore not be
986possible to input an IRI with such characters correctly.</t>
987
988</section> <!-- visual -->
989<section title="Bidi IRI Structure" anchor="bidi-structure">
990
991<t>The Unicode Bidirectional Algorithm is designed mainly for running
992text.  To make sure that it does not affect the rendering of
993bidirectional IRIs too much, some restrictions on bidirectional IRIs
994are necessary. These restrictions are given in terms of delimiters
995(structural characters, mostly punctuation such as "@", ".", ":",
996and<vspace/>"/") and components (usually consisting mostly of letters
997and digits).</t>
998
999<t>The following syntax rules from <xref target="abnf"/> correspond to
1000components for the purpose of Bidi behavior: iuserinfo, ireg-name,
1001isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and
1002ifragment.</t>
1003
1004<t>Specifications that define the syntax of any of the above
1005components MAY divide them further and define smaller parts to be
1006components according to this document. As an example, the restrictions
1007of <xref target="RFC3490"/> on bidirectional domain names correspond
1008to treating each label of a domain name as a component for schemes
1009with ireg-name as a domain name.  Even where the components are not
1010defined formally, it may be helpful to think about some syntax in
1011terms of components and to apply the relevant restrictions.  For
1012example, for the usual name/value syntax in query parts, it is
1013convenient to treat each name and each value as a component. As
1014another example, the extensions in a resource name can be treated as
1015separate components.</t>
1016
1017<t>For each component, the following restrictions apply:</t>
1018<t>
1019<list style="hanging">
1020
1021<t hangText="1.">A component SHOULD NOT use both right-to-left and
1022  left-to-right characters.</t>
1023
1024<t hangText="2.">A component using right-to-left characters SHOULD
1025  start and end with right-to-left characters.</t>
1026
1027</list></t>
1028
1029<t>The above restrictions are given as "SHOULD"s, rather than as
1030"MUST"s.  For IRIs that are never presented visually, they are not
1031relevant.  However, for IRIs in general, they are very important to
1032ensure consistent conversion between visual presentation and logical
1033representation, in both directions.</t>
1034
1035<t><list style="hanging">
1036
1037<t hangText="Note:">In some components, the above restrictions may
1038  actually be strictly enforced.  For example, <xref
1039  target="RFC3490"></xref> requires that these restrictions apply to
1040  the labels of a host name for those schemes where ireg-name is a
1041  host name.  In some other components (for example, path components)
1042  following these restrictions may not be too difficult.  For other
1043  components, such as parts of the query part, it may be very
1044  difficult to enforce the restrictions because the values of query
1045  parameters may be arbitrary character sequences.</t>
1046
1047</list></t>
1048
1049<t>If the above restrictions cannot be satisfied otherwise, the
1050affected component can always be mapped to URI notation as described
1051in <xref target="compmapping"/>. Please note that the whole component
1052has to be mapped (see also Example 9 below).</t>
1053
1054</section> <!-- bidi-structure -->
1055
1056<section title="Input of Bidi IRIs" anchor="bidiInput">
1057
1058<t>Bidi input methods MUST generate Bidi IRIs in logical order while
1059rendering them according to <xref target="visual"/>.  During input,
1060rendering SHOULD be updated after every new character is input to
1061avoid end-user confusion.</t>
1062
1063</section> <!-- bidiInput -->
1064
1065<section title="Examples">
1066
1067<t>This section gives examples of bidirectional IRIs, in Bidi
1068Notation.  It shows legal IRIs with the relationship between logical
1069and visual representation and explains how certain phenomena in this
1070relationship may look strange to somebody not familiar with
1071bidirectional behavior, but familiar to users of Arabic and Hebrew. It
1072also shows what happens if the restrictions given in <xref
1073target="bidi-structure"/> are not followed. The examples below can be
1074seen at <xref target="BidiEx"/>, in Arabic, Hebrew, and Bidi Notation
1075variants.</t>
1076
1077<t>To read the bidi text in the examples, read the visual
1078representation from left to right until you encounter a block of rtl
1079text. Read the rtl block (including slashes and other special
1080characters) from right to left, then continue at the next unread ltr
1081character.</t>
1082
1083<t>Example 1: A single component with rtl characters is inverted:
1084<vspace/>Logical representation:
1085"http://ab.CDEFGH.ij/kl/mn/op.html"<vspace/>Visual representation:
1086"http://ab.HGFEDC.ij/kl/mn/op.html"<vspace/> Components can be read
1087one by one, and each component can be read in its natural
1088direction.</t>
1089
1090<t>Example 2: More than one consecutive component with rtl characters
1091is inverted as a whole: <vspace/>Logical representation:
1092"http://ab.CDE.FGH/ij/kl/mn/op.html"<vspace/>Visual representation:
1093"http://ab.HGF.EDC/ij/kl/mn/op.html"<vspace/> A sequence of rtl
1094components is read rtl, in the same way as a sequence of rtl words is
1095read rtl in a bidi text.</t>
1096
1097<t>Example 3: All components of an IRI (except for the scheme) are
1098rtl.  All rtl components are inverted overall: <vspace/>Logical
1099representation:
1100"http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"<vspace/>Visual
1101representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"<vspace/> The
1102whole IRI (except the scheme) is read rtl. Delimiters between rtl
1103components stay between the respective components; delimiters between
1104ltr and rtl components don't move.</t>
1105
1106<t>Example 4: Each of several sequences of rtl components is inverted
1107on its own: <vspace/>Logical representation:
1108"http://AB.CD.ef/gh/IJ/KL.html"<vspace/>Visual representation:
1109"http://DC.BA.ef/gh/LK/JI.html"<vspace/> Each sequence of rtl
1110components is read rtl, in the same way as each sequence of rtl words
1111in an ltr text is read rtl.</t>
1112
1113<t>Example 5: Example 2, applied to components of different kinds:
1114<vspace/>Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
1115<vspace/>Visual representation:
1116"http://ab.cd.HG/FE/ij/kl.html"<vspace/> The inversion of the domain
1117name label and the path component may be unexpected, but it is
1118consistent with other bidi behavior.  For reassurance that the domain
1119component really is "ab.cd.EF", it may be helpful to read aloud the
1120visual representation following the bidi algorithm. After
1121"http://ab.cd." one reads the RTL block "E-F-slash-G-H", which
1122corresponds to the logical representation.
1123</t>
1124
1125<t>Example 6: Same as Example 5, with more rtl components:
1126<vspace/>Logical representation:
1127"http://ab.CD.EF/GH/IJ/kl.html"<vspace/>Visual representation:
1128"http://ab.JI/HG/FE.DC/kl.html"<vspace/> The inversion of the domain
1129name labels and the path components may be easier to identify because
1130the delimiters also move.</t>
1131
1132<t>Example 7: A single rtl component includes digits: <vspace/>Logical
1133representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"<vspace/>Visual
1134representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"<vspace/>
1135Numbers are written ltr in all cases but are treated as an additional
1136embedding inside a run of rtl characters. This is completely
1137consistent with usual bidirectional text.</t>
1138
1139<t>Example 8 (not allowed): Numbers are at the start or end of an rtl
1140component:<vspace/>Logical representation:
1141"http://ab.cd.ef/GH1/2IJ/KL.html"<vspace/>Visual representation:
1142"http://ab.cd.ef/LK/JI1/2HG.html"<vspace/> The sequence "1/2" is
1143interpreted by the bidi algorithm as a fraction, fragmenting the
1144components and leading to confusion. There are other characters that
1145are interpreted in a special way close to numbers; in particular, "+",
1146"-", "#", "$", "%", ",", ".", and ":".</t>
1147
1148<t>Example 9 (not allowed): The numbers in the previous example are
1149percent-encoded: <vspace/>Logical representation:
1150"http://ab.cd.ef/GH%31/%32IJ/KL.html",<vspace/>Visual representation:
1151"http://ab.cd.ef/LK/JI%32/%31HG.html"</t>
1152
1153<t>Example 10 (allowed but not recommended): <vspace/>Logical
1154representation: "http://ab.CDEFGH.123/kl/mn/op.html"<vspace/>Visual
1155representation: "http://ab.123.HGFEDC/kl/mn/op.html"<vspace/>
1156Components consisting of only numbers are allowed (it would be rather
1157difficult to prohibit them), but these may interact with adjacent RTL
1158components in ways that are not easy to predict.</t>
1159
1160<t>Example 11 (allowed but not recommended): <vspace/>Logical
1161representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"<vspace/>Visual
1162representation: "http://ab.123.HGFEDCij/kl/mn/op.html"<vspace/>
1163Components consisting of numbers and left-to-right characters are
1164allowed, but these may interact with adjacent RTL components in ways
1165that are not easy to predict.</t>
1166</section><!-- examples -->
1167</section><!-- bidi -->
1168
1169<section title="Normalization and Comparison" anchor="equivalence">
1170
1171<t><list style="hanging"><t hangText="Note:">The structure and much of
1172  the material for this section is taken from section 6 of <xref
1173  target="RFC3986"></xref>; the differences are due to the specifics
1174  of IRIs.</t></list></t>
1175
1176<t>One of the most common operations on IRIs is simple comparison:
1177Determining whether two IRIs are equivalent, without using the IRIs to
1178access their respective resource(s). A comparison is performed
1179whenever a response cache is accessed, a browser checks its history to
1180color a link, or an XML parser processes tags within a
1181namespace. Extensive normalization prior to comparison of IRIs may be
1182used by spiders and indexing engines to prune a search space or reduce
1183duplication of request actions and response storage.</t>
1184
1185<t>IRI comparison is performed for some particular purpose. Protocols
1186or implementations that compare IRIs for different purposes will often
1187be subject to differing design trade-offs in regards to how much
1188effort should be spent in reducing aliased identifiers. This section
1189describes various methods that may be used to compare IRIs, the
1190trade-offs between them, and the types of applications that might use
1191them.</t>
1192
1193<section title="Equivalence">
1194
1195<t>Because IRIs exist to identify resources, presumably they should be
1196considered equivalent when they identify the same resource. However,
1197this definition of equivalence is not of much practical use, as there
1198is no way for an implementation to compare two resources to determine
1199if they are "the same" unless it has full knowledge or control of
1200them. For this reason, determination of equivalence or difference of
1201IRIs is based on string comparison, perhaps augmented by reference to
1202additional rules provided by URI scheme definitions.  We use the terms
1203"different" and "equivalent" to describe the possible outcomes of such
1204comparisons, but there are many application-dependent versions of
1205equivalence.</t>
1206
1207<t>Even when it is possible to determine that two IRIs are equivalent,
1208IRI comparison is not sufficient to determine whether two IRIs
1209identify different resources. For example, an owner of two different
1210domain names could decide to serve the same resource from both,
1211resulting in two different IRIs. Therefore, comparison methods are
1212designed to minimize false negatives while strictly avoiding false
1213positives.</t>
1214
1215<t>In testing for equivalence, applications should not directly
1216compare relative references; the references should be converted to
1217their respective target IRIs before comparison. When IRIs are compared
1218to select (or avoid) a network action, such as retrieval of a
1219representation, fragment components (if any) should be excluded from
1220the comparison.</t>
1221
1222<t>Applications using IRIs as identity tokens with no relationship to
1223a protocol MUST use the Simple String Comparison (see <xref
1224target="stringcomp"></xref>).  All other applications MUST select one
1225of the comparison practices from the Comparison Ladder (see <xref
1226target="ladder"></xref>.</t>
1227</section> <!-- equivalence -->
1228
1229
1230<section title="Preparation for Comparison">
1231<t>Any kind of IRI comparison REQUIRES that any additional contextual
1232processing is first performed, including undoing higher-level
1233escapings or encodings in the protocol or format that carries an
1234IRI. This preprocessing is usually done when the protocol or format is
1235parsed.</t>
1236
1237<t>Examples of contextual preprocessing steps are described in <xref
1238target="LEIRIHREF"/>. </t>
1239
1240<t>Examples of such escapings or encodings are entities and
1241numeric character references in <xref target="HTML4"></xref> and <xref
1242target="XML1"></xref>. As an example,
1243"http://example.org/ros&amp;eacute;" (in HTML),
1244"http://example.org/ros&amp;#233;" (in HTML or XML), and
1245<vspace/>"http://example.org/ros&amp;#xE9;" (in HTML or XML) are all
1246resolved into what is denoted in this document (see <xref
1247target="sec-Notation"></xref>) as "http://example.org/ros&amp;#xE9;"
1248(the "&amp;#xE9;" here standing for the actual e-acute character, to
1249compensate for the fact that this document cannot contain non-ASCII
1250characters).</t>
1251
1252<t>Similar considerations apply to encodings such as Transfer Codings
1253in HTTP (see <xref target="RFC2616"></xref>) and Content Transfer
1254Encodings in MIME (<xref target="RFC2045"></xref>), although in these
1255cases, the encoding is based not on characters but on octets, and
1256additional care is required to make sure that characters, and not just
1257arbitrary octets, are compared (see <xref
1258target="stringcomp"></xref>).</t>
1259
1260</section> <!-- preparation -->
1261
1262<section title="Comparison Ladder" anchor="ladder">
1263
1264<t>In practice, a variety of methods are used to test IRI
1265equivalence. These methods fall into a range distinguished by the
1266amount of processing required and the degree to which the probability
1267of false negatives is reduced. As noted above, false negatives cannot
1268be eliminated. In practice, their probability can be reduced, but this
1269reduction requires more processing and is not cost-effective for all
1270applications.</t>
1271
1272
1273<t>If this range of comparison practices is considered as a ladder,
1274the following discussion will climb the ladder, starting with
1275practices that are cheap but have a relatively higher chance of
1276producing false negatives, and proceeding to those that have higher
1277computational cost and lower risk of false negatives.</t>
1278
1279<section title="Simple String Comparison" anchor="stringcomp">
1280
1281<t>If two IRIs, when considered as character strings, are identical,
1282then it is safe to conclude that they are equivalent.  This type of
1283equivalence test has very low computational cost and is in wide use in
1284a variety of applications, particularly in the domain of parsing. It
1285is also used when a definitive answer to the question of IRI
1286equivalence is needed that is independent of the scheme used and that
1287can be calculated quickly and without accessing a network. An example
1288of such a case is XML Namespaces (<xref
1289target="XMLNamespace"></xref>).</t>
1290
1291
1292<t>Testing strings for equivalence requires some basic precautions.
1293This procedure is often referred to as "bit-for-bit" or
1294"byte-for-byte" comparison, which is potentially misleading. Testing
1295strings for equality is normally based on pair comparison of the
1296characters that make up the strings, starting from the first and
1297proceeding until both strings are exhausted and all characters are
1298found to be equal, until a pair of characters compares unequal, or
1299until one of the strings is exhausted before the other.</t>
1300
1301<t>This character comparison requires that each pair of characters be
1302put in comparable encoding form. For example, should one IRI be stored
1303in a byte array in UTF-8 encoding form and the second in a UTF-16
1304encoding form, bit-for-bit comparisons applied naively will produce
1305errors. It is better to speak of equality on a character-for-character
1306rather than on a byte-for-byte or bit-for-bit basis.  In practical
1307terms, character-by-character comparisons should be done codepoint by
1308codepoint after conversion to a common character encoding form.
1309
1310When comparing character by character, the comparison function MUST
1311NOT map IRIs to URIs, because such a mapping would create additional
1312spurious equivalences. It follows that an IRI SHOULD NOT be modified
1313when being transported if there is any chance that this IRI might be
1314used in a context that uses Simple String Comparison.</t>
1315
1316
1317<t>False negatives are caused by the production and use of IRI
1318aliases. Unnecessary aliases can be reduced, regardless of the
1319comparison method, by consistently providing IRI references in an
1320already normalized form (i.e., a form identical to what would be
1321produced after normalization is applied, as described below).
1322Protocols and data formats often limit some IRI comparisons to simple
1323string comparison, based on the theory that people and implementations
1324will, in their own best interest, be consistent in providing IRI
1325references, or at least be consistent enough to negate any efficiency
1326that might be obtained from further normalization.</t>
1327</section> <!-- stringcomp -->
1328
1329<section title="Syntax-Based Normalization">
1330
1331<figure><preamble>Implementations may use logic based on the
1332definitions provided by this specification to reduce the probability
1333of false negatives. This processing is moderately higher in cost than
1334character-for-character string comparison. For example, an application
1335using this approach could reasonably consider the following two IRIs
1336equivalent:</preamble>
1337
1338<artwork>
1339   example://a/b/c/%7Bfoo%7D/ros&amp;#xE9;
1340   eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9
1341</artwork></figure>
1342
1343<t>Web user agents, such as browsers, typically apply this type of IRI
1344normalization when determining whether a cached response is
1345available. Syntax-based normalization includes such techniques as case
1346normalization, character normalization, percent-encoding
1347normalization, and removal of dot-segments.</t>
1348
1349<section title="Case Normalization">
1350
1351<t>For all IRIs, the hexadecimal digits within a percent-encoding
1352triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
1353should be normalized to use uppercase letters for the digits A-F.</t>
1354
1355<t>When an IRI uses components of the generic syntax, the component
1356syntax equivalence rules always apply; namely, that the scheme and
1357US-ASCII only host are case insensitive and therefore should be
1358normalized to lowercase. For example, the URI
1359"HTTP://www.EXAMPLE.com/" is equivalent to
1360"http://www.example.com/". Case equivalence for non-ASCII characters
1361in IRI components that are IDNs are discussed in <xref
1362target="schemecomp"></xref>.  The other generic syntax components are
1363assumed to be case sensitive unless specifically defined otherwise by
1364the scheme.</t>
1365
1366<t>Creating schemes that allow case-insensitive syntax components
1367containing non-ASCII characters should be avoided. Case normalization
1368of non-ASCII characters can be culturally dependent and is always a
1369complex operation. The only exception concerns non-ASCII host names
1370for which the character normalization includes a mapping step derived
1371from case folding.</t>
1372
1373</section> <!-- casenorm -->
1374
1375<section title="Character Normalization" anchor="normalization">
1376
1377<t>The Unicode Standard <xref target="UNIV4"></xref> defines various
1378equivalences between sequences of characters for various
1379purposes. Unicode Standard Annex #15 <xref target="UTR15"></xref>
1380defines various Normalization Forms for these equivalences, in
1381particular Normalization Form C (NFC, Canonical Decomposition,
1382followed by Canonical Composition) and Normalization Form KC (NFKC,
1383Compatibility Decomposition, followed by Canonical Composition).</t>
1384
1385<t> IRIs already in Unicode MUST NOT be normalized before parsing or
1386interpreting. In many non-Unicode character encodings, some text
1387cannot be represented directly. For example, the word "Vietnam" is
1388natively written "Vi&amp;#x1EC7;t Nam" (containing a LATIN SMALL
1389LETTER E WITH CIRCUMFLEX AND DOT BELOW) in NFC, but a direct
1390transcoding from the windows-1258 character encoding leads to
1391"Vi&amp;#xEA;&amp;#x323;t Nam" (containing a LATIN SMALL LETTER E WITH
1392CIRCUMFLEX followed by a COMBINING DOT BELOW). Direct transcoding of
1393other 8-bit encodings of Vietnamese may lead to other
1394representations.</t>
1395
1396<t>Equivalence of IRIs MUST rely on the assumption that IRIs are
1397appropriately pre-character-normalized rather than apply character
1398normalization when comparing two IRIs. The exceptions are conversion
1399from a non-digital form, and conversion from a non-UCS-based character
1400encoding to a UCS-based character encoding. In these cases, NFC or a
1401normalizing transcoder using NFC MUST be used for interoperability. To
1402avoid false negatives and problems with transcoding, IRIs SHOULD be
1403created by using NFC. Using NFKC may avoid even more problems; for
1404example, by choosing half-width Latin letters instead of full-width
1405ones, and full-width instead of half-width Katakana.</t>
1406
1407
1408<t>As an example,
1409"http://www.example.org/r&amp;#xE9;sum&amp;#xE9;.html" (in XML
1410Notation) is in NFC. On the other hand,
1411"http://www.example.org/re&amp;#x301;sume&amp;#x301;.html" is not in
1412NFC.</t>
1413
1414<t>The former uses precombined e-acute characters, and the latter uses
1415"e" characters followed by combining acute accents. Both usages are
1416defined as canonically equivalent in <xref target="UNIV4"></xref>.</t>
1417
1418<t><list style="hanging">
1419
1420<t hangText="Note:">
1421Because it is unknown how a particular sequence of characters is being
1422treated with respect to character normalization, it would be
1423inappropriate to allow third parties to normalize an IRI
1424arbitrarily. This does not contradict the recommendation that when a
1425resource is created, its IRI should be as character normalized as
1426possible (i.e., NFC or even NFKC). This is similar to the
1427uppercase/lowercase problems.  Some parts of a URI are case
1428insensitive (for example, the domain name). For others, it is unclear
1429whether they are case sensitive, case insensitive, or something in
1430between (e.g., case sensitive, but with a multiple choice selection if
1431the wrong case is used, instead of a direct negative result).  The
1432best recipe is that the creator use a reasonable capitalization and,
1433when transferring the URI, capitalization never be
1434changed.</t></list></t>
1435
1436<t>Various IRI schemes may allow the usage of Internationalized Domain
1437Names (IDN) <xref target="RFC3490"></xref> either in the ireg-name
1438part or elsewhere. Character Normalization also applies to IDNs, as
1439discussed in <xref target="schemecomp"></xref>.</t>
1440</section> <!-- charnorm -->
1441
1442<section title="Percent-Encoding Normalization">
1443
1444<t>The percent-encoding mechanism (Section 2.1 of <xref
1445target="RFC3986"></xref>) is a frequent source of variance among
1446otherwise identical IRIs. In addition to the case normalization issue
1447noted above, some IRI producers percent-encode octets that do not
1448require percent-encoding, resulting in IRIs that are equivalent to
1449their nonencoded counterparts. These IRIs should be normalized by
1450decoding any percent-encoded octet sequence that corresponds to an
1451unreserved character, as described in section 2.3 of <xref
1452target="RFC3986"></xref>.</t>
1453
1454<t>For actual resolution, differences in percent-encoding (except for
1455the percent-encoding of reserved characters) MUST always result in the
1456same resource.  For example, "http://example.org/~user",
1457"http://example.org/%7euser", and "http://example.org/%7Euser", must
1458resolve to the same resource.</t>
1459
1460<t>If this kind of equivalence is to be tested, the percent-encoding
1461of both IRIs to be compared has to be aligned; for example, by
1462converting both IRIs to URIs (see Section 3.1), eliminating escape
1463differences in the resulting URIs, and making sure that the case of
1464the hexadecimal characters in the percent-encoding is always the same
1465(preferably upper case). If the IRI is to be passed to another
1466application or used further in some other way, its original form MUST
1467be preserved.  The conversion described here should be performed only
1468for local comparison.</t>
1469
1470</section> <!-- pctnorm -->
1471
1472<section title="Path Segment Normalization">
1473
1474<t>The complete path segments "." and ".." are intended only for use
1475within relative references (Section 4.1 of <xref
1476target="RFC3986"></xref>) and are removed as part of the reference
1477resolution process (Section 5.2 of <xref target="RFC3986"></xref>).
1478However, some implementations may incorrectly assume that reference
1479resolution is not necessary when the reference is already an IRI, and
1480thus fail to remove dot-segments when they occur in non-relative
1481paths.  IRI normalizers should remove dot-segments by applying the
1482remove_dot_segments algorithm to the path, as described in Section
14835.2.4 of <xref target="RFC3986"></xref>.</t>
1484
1485</section> <!-- pathnorm -->
1486</section> <!-- ladder -->
1487
1488<section title="Scheme-Based Normalization" anchor="schemecomp">
1489
1490<t>The syntax and semantics of IRIs vary from scheme to scheme, as
1491described by the defining specification for each
1492scheme. Implementations may use scheme-specific rules, at further
1493processing cost, to reduce the probability of false negatives. For
1494example, because the "http" scheme makes use of an authority
1495component, has a default port of "80", and defines an empty path to be
1496equivalent to "/", the following four IRIs are equivalent:</t>
1497
1498<figure><artwork>
1499   http://example.com
1500   http://example.com/
1501   http://example.com:/
1502   http://example.com:80/</artwork></figure>
1503
1504<t>In general, an IRI that uses the generic syntax for authority with
1505an empty path should be normalized to a path of "/". Likewise, an
1506explicit ":port", for which the port is empty or the default for the
1507scheme, is equivalent to one where the port and its ":" delimiter are
1508elided and thus should be removed by scheme-based normalization. For
1509example, the second IRI above is the normal form for the "http"
1510scheme.</t>
1511
1512<t>Another case where normalization varies by scheme is in the
1513handling of an empty authority component or empty host
1514subcomponent. For many scheme specifications, an empty authority or
1515host is considered an error; for others, it is considered equivalent
1516to "localhost" or the end-user's host. When a scheme defines a default
1517for authority and an IRI reference to that default is desired, the
1518reference should be normalized to an empty authority for the sake of
1519uniformity, brevity, and internationalization. If, however, either the
1520userinfo or port subcomponents are non-empty, then the host should be
1521given explicitly even if it matches the default.</t>
1522
1523<t>Normalization should not remove delimiters when their associated
1524component is empty unless it is licensed to do so by the scheme
1525specification. For example, the IRI "http://example.com/?" cannot be
1526assumed to be equivalent to any of the examples above. Likewise, the
1527presence or absence of delimiters within a userinfo subcomponent is
1528usually significant to its interpretation.  The fragment component is
1529not subject to any scheme-based normalization; thus, two IRIs that
1530differ only by the suffix "#" are considered different regardless of
1531the scheme.</t>
1532 
1533<t>Some IRI schemes allow the usage of Internationalized Domain
1534Names (IDN) <xref target='RFC5890'></xref> either in their ireg-name
1535part or elswhere. When in use in IRIs, those names SHOULD
1536conform to the definition of U-Label in <xref
1537target='RFC5890'></xref>. An IRI containing an invalid IDN cannot
1538successfully be resolved. For legibility purposes, they
1539SHOULD NOT be converted into ASCII Compatible Encoding (ACE).</t>
1540
1541<t>Scheme-based normalization may also consider IDN
1542components and their conversions to punycode as equivalent. As an
1543example, "http://r&amp;#xE9;sum&amp;#xE9;.example.org" may be
1544considered equivalent to
1545"http://xn--rsum-bpad.example.org".</t><t>Other scheme-specific
1546normalizations are possible.</t>
1547
1548</section> <!-- schemenorm -->
1549
1550<section title="Protocol-Based Normalization">
1551
1552<t>Substantial effort to reduce the incidence of false negatives is
1553often cost-effective for web spiders. Consequently, they implement
1554even more aggressive techniques in IRI comparison. For example, if
1555they observe that an IRI such as</t>
1556
1557<figure><artwork>
1558   http://example.com/data</artwork></figure>
1559<t>redirects to an IRI differing only in the trailing slash</t>
1560<figure><artwork>
1561   http://example.com/data/</artwork></figure>
1562
1563<t>they will likely regard the two as equivalent in the future.  This
1564kind of technique is only appropriate when equivalence is clearly
1565indicated by both the result of accessing the resources and the common
1566conventions of their scheme's dereference algorithm (in this case, use
1567of redirection by HTTP origin servers to avoid problems with relative
1568references).</t>
1569
1570</section> <!-- protonorm -->
1571</section> <!-- equivalence -->
1572</section> 
1573
1574<section title="Use of IRIs" anchor="IRIuse">
1575
1576<section title="Limitations on UCS Characters Allowed in IRIs" anchor="limitations">
1577
1578<t>This section discusses limitations on characters and character
1579sequences usable for IRIs beyond those given in <xref target="abnf"/>
1580and <xref target="visual"/>. The considerations in this section are
1581relevant when IRIs are created and when URIs are converted to
1582IRIs.</t>
1583
1584<t>
1585
1586<list style="hanging"><t hangText="a.">The repertoire of characters allowed
1587    in each IRI component is limited by the definition of that component.
1588    For example, the definition of the scheme component does not allow
1589    characters beyond US-ASCII.
1590    <vspace blankLines="1"/>
1591    (Note: In accordance with URI practice, generic IRI
1592    software cannot and should not check for such limitations.)</t>
1593
1594<t hangText="b.">The UCS contains many areas of characters for which
1595    there are strong visual look-alikes. Because of the likelihood of
1596    transcription errors, these also should be avoided. This includes
1597    the full-width equivalents of Latin characters, half-width
1598    Katakana characters for Japanese, and many others. It also
1599    includes many look-alikes of "space", "delims", and "unwise",
1600    characters excluded in <xref target="RFC3491"/>.</t>
1601   
1602</list>
1603</t>
1604
1605<t>Additional information is available from <xref target="UNIXML"/>.
1606    <xref target="UNIXML"/> is written in the context of running text
1607    rather than in that of identifiers. Nevertheless, it discusses
1608    many of the categories of characters not appropriate for IRIs.</t>
1609</section> <!-- limitations -->
1610
1611<section title="Software Interfaces and Protocols">
1612
1613<t>Although an IRI is defined as a sequence of characters, software
1614interfaces for URIs typically function on sequences of octets or other
1615kinds of code units. Thus, software interfaces and protocols MUST
1616define which character encoding is used.</t>
1617
1618<t>Intermediate software interfaces between IRI-capable components and
1619URI-only components MUST map the IRIs per <xref target="mapping"/>,
1620when transferring from IRI-capable to URI-only components.
1621
1622This mapping SHOULD be applied as late as possible. It SHOULD NOT be
1623applied between components that are known to be able to handle IRIs.</t>
1624</section> <!-- software -->
1625
1626<section title="Format of URIs and IRIs in Documents and Protocols">
1627
1628<t>Document formats that transport URIs may have to be upgraded to allow
1629the transport of IRIs. In cases where the document as a whole
1630has a native character encoding, IRIs MUST also be encoded in this
1631character encoding and converted accordingly by a parser or interpreter.
1632
1633IRI characters not expressible in the native character encoding SHOULD
1634be escaped by using the escaping conventions of the document format if
1635such conventions are available. Alternatively, they MAY be
1636percent-encoded according to <xref target="mapping"/>. For example, in
1637HTML or XML, numeric character references SHOULD be used. If a
1638document as a whole has a native character encoding and that character
1639encoding is not UTF-8, then IRIs MUST NOT be placed into the document
1640in the UTF-8 character encoding.</t>
1641
1642<t>((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
1643although they use different terminology. HTML 4.0 <xref
1644target="HTML4"/> defines the conversion from IRIs to URIs as
1645error-avoiding behavior. XML 1.0 <xref target="XML1"/>, XLink <xref
1646target="XLink"/>, XML Schema <xref target="XMLSchema"/>, and
1647specifications based upon them allow IRIs. Also, it is expected that
1648all relevant new W3C formats and protocols will be required to handle
1649IRIs <xref target="CharMod"/>.</t>
1650
1651</section> <!-- format -->
1652
1653<section title="Use of UTF-8 for Encoding Original Characters" anchor="UTF8use">
1654
1655<t>This section discusses details and gives examples for point c) in
1656<xref target="Applicability"/>. To be able to use IRIs, the URI
1657corresponding to the IRI in question has to encode original characters
1658into octets by using UTF-8.  This can be specified for all URIs of a
1659URI scheme or can apply to individual URIs for schemes that do not
1660specify how to encode original characters.  It can apply to the whole
1661URI, or only to some part. For background information on encoding
1662characters into URIs, see also Section 2.5 of <xref
1663target="RFC3986"/>.</t>
1664
1665<t>For new URI schemes, using UTF-8 is recommended in <xref
1666target="RFC4395bis"/>.  Examples where UTF-8 is already used are the URN
1667syntax <xref target="RFC2141"/>, IMAP URLs <xref target="RFC2192"/>,
1668and POP URLs <xref target="RFC2384"/>.  On the other hand, because the
1669HTTP URI scheme does not specify how to encode original characters,
1670only some HTTP URLs can have corresponding but different IRIs.</t>
1671
1672<t>For example, for a document with a URI
1673of<vspace/>"http://www.example.org/r%C3%A9sum%C3%A9.html", it is
1674possible to construct a corresponding IRI (in XML notation, see <xref
1675target="sec-Notation"/>):
1676"http://www.example.org/r&amp;#xE9;sum&amp;#xE9;.html" ("&amp;#xE9;"
1677stands for the e-acute character, and "%C3%A9" is the UTF-8 encoded
1678and percent-encoded representation of that character). On the other
1679hand, for a document with a URI of
1680"http://www.example.org/r%E9sum%E9.html", the percent-encoding octets
1681cannot be converted to actual characters in an IRI, as the
1682percent-encoding is not based on UTF-8.</t>
1683
1684<t>For most URI schemes, there is no need to upgrade their scheme
1685definition in order for them to work with IRIs.  The main case where
1686upgrading makes sense is when a scheme definition, or a particular
1687component of a scheme, is strictly limited to the use of US-ASCII
1688characters with no provision to include non-ASCII characters/octets
1689via percent-encoding, or if a scheme definition currently uses highly
1690scheme-specific provisions for the encoding of non-ASCII characters.
1691An example of this is the mailto: scheme <xref target="RFC2368"/>.</t>
1692
1693<t>This specification updates the IANA registry of URI schemes to note
1694their applicability to IRIs, see <xref target="iana"/>.  All IRIs use
1695URI schemes, and all URIs with URI schemes can be used as IRIs, even
1696though in some cases only by using URIs directly as IRIs, without any
1697conversion.</t>
1698
1699<t>Scheme definitions can impose restrictions on the syntax of
1700scheme-specific URIs; i.e., URIs that are admissible under the generic
1701URI syntax <xref target="RFC3986"/> may not be admissible due to
1702narrower syntactic constraints imposed by a URI scheme
1703specification. URI scheme definitions cannot broaden the syntactic
1704restrictions of the generic URI syntax; otherwise, it would be
1705possible to generate URIs that satisfied the scheme-specific syntactic
1706constraints without satisfying the syntactic constraints of the
1707generic URI syntax. However, additional syntactic constraints imposed
1708by URI scheme specifications are applicable to IRI, as the
1709corresponding URI resulting from the mapping defined in <xref
1710target="mapping"/> MUST be a valid URI under the syntactic
1711restrictions of generic URI syntax and any narrower restrictions
1712imposed by the corresponding URI scheme specification.</t>
1713
1714<t>The requirement for the use of UTF-8 generally applies to all parts
1715of a URI.  However, it is possible that the capability of IRIs to
1716represent a wide range of characters directly is used just in some
1717parts of the IRI (or IRI reference). The other parts of the IRI may
1718only contain US-ASCII characters, or they may not be based on
1719UTF-8. They may be based on another character encoding, or they may
1720directly encode raw binary data (see also <xref
1721target="RFC2397"/>). </t>
1722
1723<t>For example, it is possible to have a URI reference
1724of<vspace/>"http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9",
1725where the document name is encoded in iso-8859-1 based on server
1726settings, but where the fragment identifier is encoded in UTF-8 according
1727to <xref target="XPointer"/>. The IRI corresponding to the above
1728URI would be (in XML notation)<vspace/>"http://www.example.org/r%E9sum%E9.xml#r&amp;#xE9;sum&amp;#xE9;".</t>
1729
1730<t>Similar considerations apply to query parts. The functionality
1731of IRIs (namely, to be able to include non-ASCII characters) can
1732only be used if the query part is encoded in UTF-8.</t>
1733
1734</section> <!-- utf8 -->
1735
1736<section title="Relative IRI References">
1737<t>Processing of relative IRI references against a base is handled
1738straightforwardly; the algorithms of <xref target="RFC3986"/> can
1739be applied directly, treating the characters additionally allowed
1740in IRI references in the same way that unreserved characters are in URI
1741references.</t>
1742
1743</section> <!-- relative -->
1744</section> <!-- IRIuse -->
1745
1746<section title="Liberal handling of otherwise invalid IRIs" anchor="LEIRIHREF">
1747
1748<t>(EDITOR NOTE: This Section may move to an appendix.)
1749 
1750Some technical specifications and widely-deployed software have
1751allowed additional variations and extensions of IRIs to be used in
1752syntactic components. This section describes two widely-used
1753preprocessing agreements. Other technical specifications may wish to
1754reference a syntactic component which is "a valid IRI or a string that
1755will map to a valid IRI after this preprocessing algorithm". These two
1756variants are known as <xref target="LEIRI">Legacy Extended IRI or
1757LEIRI</xref>, and <xref target="HTML5">Web Address</xref>).
1758</t>
1759
1760<t>Future technical specifications SHOULD NOT allow conforming
1761producers to produce, or conforming content to contain, such forms,
1762as they are not interoperable with other IRI consuming software.</t>
1763
1764<section title="LEIRI processing"  anchor="LEIRIspec">
1765  <t>This section defines Legacy Extended IRIs (LEIRIs).
1766    The syntax of Legacy Extended IRIs is the same as that for IRIs,
1767    except that the ucschar production is replaced by the leiri-ucschar production:</t>
1768<figure>
1769
1770<artwork>
1771  leiri-ucschar  = " " / "&lt;" / "&gt;" / '"' / "{" / "}" / "|"
1772                   / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1773                   / %xE000-FFFD / %x10000-10FFFF
1774</artwork>
1775
1776<postamble>
1777  Among other extensions, processors based on this specification also
1778  did not enforce the restriction on bidirectional formatting
1779  characters in <xref target="visual"></xref>, and the iprivate
1780  production becomes redundant.</postamble>
1781</figure>
1782
1783<t>To convert a string allowed as a LEIRI to an IRI, each character
1784allowed in leiri-ucschar but not in ucschar must be percent-encoded
1785using <xref target="compmapping"/>.</t>
1786</section> <!-- leiriproc -->
1787
1788<section title="Web Address processing" anchor="webaddress">
1789
1790<t>Many popular web browsers have taken the approach of being quite
1791liberal in what is accepted as a "URL" or its relative
1792forms. This section describes their behavior in terms of a preprocessor
1793which maps strings into the IRI space for subsequent parsing and
1794interpretation as an IRI.</t>
1795
1796<t>In some situations, it might be appropriate to describe the syntax
1797that a liberal consumer implementation might accept as a "Web
1798Address" or "Hypertext Reference" or "HREF". However,
1799technical specifications SHOULD restrict the syntactic form allowed by compliant producers
1800to the IRI or IRI reference syntax defined in this document
1801even if they want to mandate this processing.</t>
1802
1803<t>
1804Summary:
1805<list style="symbols">
1806   <t>Leading and trailing whitespace is removed.</t>
1807   <t>Some additional characters are removed.</t>
1808   <t>Some additional characters are allowed and escaped (as with LEIRI).</t>
1809   <t>If interpreting an IRI as a URI, the pct-encoding of the query
1810   component of the parsed URI component depends on operational
1811   context.</t>
1812</list>
1813</t>
1814
1815<t>Each string provided may have an associated charset (called
1816the HREF-charset here); this defaults to UTF-8.
1817For web browsers interpreting HTML, the document
1818charset of a string is determined:
1819
1820<list style="hanging">
1821<t hangText="If the string came from a script (e.g. as an argument to
1822 a method)">The HRef-charset is the script's charset.</t>
1823
1824<t hangText="If the string came from a DOM node (e.g. from an
1825  element)">The node has a Document, and the HRef-charset is the
1826  Document's character encoding.</t>
1827
1828<t hangText="If the string had a HRef-charset defined when the string was
1829created or defined">The HRef-charset is as defined.</t>
1830
1831</list></t>
1832
1833<t>If the resulting HRef-charset is a unicode based character encoding
1834(e.g., UTF-16), then use UTF-8 instead.</t>
1835
1836
1837<figure>
1838<preamble>The syntax for Web Addresses is obtained by replacing the 'ucschar',
1839  pct-form, and path-sep rules with the href-ucschar, href-pct-form, and href-path-sep
1840  rules below. In addition, some characters are stripped.</preamble>
1841
1842<artwork type='abnf'>
1843  href-ucschar  = " " / "&lt;" / "&gt;" / DQUOTE / "{" / "}" / "|"
1844                   / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1845                   / %xE000-FFFD / %x10000-10FFFF
1846  href-pct-form = pct-encoded / "%"
1847  href-path-sep = "/" / "\"
1848  href-strip    = &lt;to be done&gt;
1849</artwork>
1850
1851<postamble>
1852(NOTE: NEED TO FIX THESE SETS TO MATCH HTML5; NOT SURE ABOUT NEXT SENTENCE)
1853browsers did not enforce the restriction on bidirectional formatting
1854  characters in <xref target="visual"></xref>, and the iprivate
1855  production becomes redundant.</postamble>
1856</figure>
1857
1858<t>'Web Address processing' requires the following additional
1859preprocessing steps:
1860
1861<list style="numbers">
1862
1863<t>Leading and trailing instances of space (U+0020),
1864CR (U+000A), LF (U+000D), and TAB (U+0009) characters are removed.</t>
1865
1866<t>strip all characters in href-strip.</t>
1867  <t>Percent-encode all characters in href-ucschar not in ucschar.</t>
1868  <t>Replace occurrences of "%" not followed by two hexadecimal digits by "%25".</t>
1869  <t>Convert backslashes ('\') matching href-path-sep to forward slashes ('/').</t>
1870</list></t>
1871</section> <!-- webaddress -->
1872
1873<section title="Characters not allowed in IRIs" anchor="notAllowed">
1874
1875<t>This section provides a list of the groups of characters and code
1876points that are allowed by LEIRI or HREF but are not allowed in IRIs or are
1877allowed in IRIs only in the query part. For each group of characters,
1878advice on the usage of these characters is also given, concentrating
1879on the reasons for why they are excluded from IRI use.</t>
1880
1881<t>
1882
1883<list><t>Space (U+0020): Some formats and applications use space as a
1884delimiter, e.g. for items in a list. Appendix C of <xref
1885target="RFC3986"></xref> also mentions that white space may have to be
1886added when displaying or printing long URIs; the same applies to long
1887IRIs. This means that spaces can disappear, or can make the what is
1888intended as a single IRI or IRI reference to be treated as two or more
1889separate IRIs.</t>
1890
1891<t>Delimiters "&lt;" (U+003C), "&gt;" (U+003E), and '"' (U+0022):
1892Appendix C of <xref target="RFC3986"></xref> suggests the use of
1893double-quotes ("http://example.com/") and angle brackets
1894(&lt;http://example.com/&gt;) as delimiters for URIs in plain
1895text. These conventions are often used, and also apply to IRIs.  Using
1896these characters in strings intended to be IRIs would result in the
1897IRIs being cut off at the wrong place.</t>
1898
1899<t>Unwise characters "\" (U+005C), "^" (U+005E), "`"
1900(U+0060), "{" (U+007B), "|" (U+007C), and "}" (U+007D): These
1901characters originally have been excluded from URIs because the
1902respective codepoints are assigned to different graphic characters in
1903some 7-bit or 8-bit encoding. Despite the move to Unicode, some of
1904these characters are still occasionally displayed differently on some
1905systems, e.g. U+005C may appear as a Japanese Yen symbol on some
1906systems. Also, the fact that these characters are not used in URIs or
1907IRIs has encouraged their use outside URIs or IRIs in contexts that
1908may include URIs or IRIs. If a string with such a character were used
1909as an IRI in such a context, it would likely be interpreted
1910piecemeal.</t>
1911
1912<t>The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
1913#x9F): There is generally no way to transmit these characters reliably
1914as text outside of a charset encoding.  Even when in encoded form,
1915many software components silently filter out some of these characters,
1916or may stop processing alltogether when encountering some of
1917them. These characters may affect text display in subtle, unnoticable
1918ways or in drastic, global, and irreversible ways depending on the
1919hardware and software involved. The use of some of these characters
1920would allow malicious users to manipulate the display of an IRI and
1921its context in many situations.</t>
1922
1923<t>Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
1924characters affect the display ordering of characters. If IRIs were
1925allowed to contain these characters and the resulting visual display
1926transcribed. they could not be converted back to electronic form
1927(logical order) unambiguously. These characters, if allowed in IRIs,
1928might allow malicious users to manipulate the display of IRI and its
1929context.</t>
1930
1931<t>Specials (U+FFF0-FFFD): These code points provide functionality
1932beyond that useful in an IRI, for example byte order identification,
1933annotation, and replacements for unknown characters and objects. Their
1934use and interpretation in an IRI would serve no purpose and might lead
1935to confusing display variations.</t>
1936
1937<t>Private use code points (U+E000-F8FF, U+F0000-FFFFD,
1938U+100000-10FFFD): Display and interpretation of these code points is
1939by definition undefined without private agreement. Therefore, these
1940code points are not suited for use on the Internet. They are not
1941interoperable and may have unpredictable effects.</t>
1942
1943<t>Tags (U+E0000-E0FFF): These characters provide a way to language
1944tag in Unicode plain text. They are not appropriate for IRIs because
1945language information in identifiers cannot reliably be input,
1946transmitted (e.g. on a visual medium such as paper), or
1947recognized.</t>
1948
1949<t>Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
1950U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
1951U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
1952U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
1953U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
1954non-characters. Applications may use some of them internally, but are
1955not prepared to interchange them.</t>
1956
1957</list></t>
1958
1959<t>LEIRI preprocessing disallowed some code points and
1960code units:
1961
1962<list><t>Surrogate code units (D800-DFFF): These do not represent
1963Unicode codepoints.</t></list></t>
1964</section> <!-- notallowed -->
1965</section> <!-- lieirihref -->
1966 
1967<section title="URI/IRI Processing Guidelines (Informative)" anchor="guidelines">
1968
1969<t>This informative section provides guidelines for supporting IRIs in
1970the same software components and operations that currently process
1971URIs: Software interfaces that handle URIs, software that allows users
1972to enter URIs, software that creates or generates URIs, software that
1973displays URIs, formats and protocols that transport URIs, and software
1974that interprets URIs. These may all require modification before
1975functioning properly with IRIs. The considerations in this section
1976also apply to URI references and IRI references.</t>
1977
1978<section title="URI/IRI Software Interfaces">
1979<t>Software interfaces that handle URIs, such as URI-handling APIs and
1980protocols transferring URIs, need interfaces and protocol elements
1981that are designed to carry IRIs.</t>
1982
1983<t>In case the current handling in an API or protocol is based on
1984US-ASCII, UTF-8 is recommended as the character encoding for IRIs, as
1985it is compatible with US-ASCII, is in accordance with the
1986recommendations of <xref target="RFC2277"/>, and makes converting to
1987URIs easy. In any case, the API or protocol definition must clearly
1988define the character encoding to be used.</t>
1989
1990<t>The transfer from URI-only to IRI-capable components requires no
1991mapping, although the conversion described in <xref
1992target="URItoIRI"/> above may be performed. It is preferable not to
1993perform this inverse conversion unless it is certain this can be done
1994correctly.</t>
1995</section>
1996
1997<section title="URI/IRI Entry">
1998
1999<t>Some components allow users to enter URIs into the system
2000by typing or dictation, for example. This software must be updated to allow
2001for IRI entry.</t>
2002
2003<t>A person viewing a visual representation of an IRI (as a sequence
2004of glyphs, in some order, in some visual display) or hearing an IRI
2005will use an entry method for characters in the user's language to
2006input the IRI. Depending on the script and the input method used, this
2007may be a more or less complicated process.</t>
2008
2009<t>The process of IRI entry must ensure, as much as possible, that the
2010restrictions defined in <xref target="abnf"/> are met. This may be
2011done by choosing appropriate input methods or variants/settings
2012thereof, by appropriately converting the characters being input, by
2013eliminating characters that cannot be converted, and/or by issuing a
2014warning or error message to the user.</t>
2015
2016<t>As an example of variant settings, input method editors for East
2017Asian Languages usually allow the input of Latin letters and related
2018characters in full-width or half-width versions. For IRI input, the
2019input method editor should be set so that it produces half-width Latin
2020letters and punctuation and full-width Katakana.</t>
2021
2022<t>An input field primarily or solely used for the input of URIs/IRIs
2023might allow the user to view an IRI as it is mapped to a URI.  Places
2024where the input of IRIs is frequent may provide the possibility for
2025viewing an IRI as mapped to a URI. This will help users when some of
2026the software they use does not yet accept IRIs.</t>
2027
2028<t>An IRI input component interfacing to components that handle URIs,
2029but not IRIs, must map the IRI to a URI before passing it to these
2030components.</t>
2031
2032<t>For the input of IRIs with right-to-left characters, please see
2033<xref target="bidiInput"></xref>.</t>
2034</section>
2035
2036<section title="URI/IRI Transfer between Applications">
2037
2038<t>Many applications (for example, mail user agents) try to detect
2039URIs appearing in plain text. For this, they use some heuristics based
2040on URI syntax. They then allow the user to click on such URIs and
2041retrieve the corresponding resource in an appropriate (usually
2042scheme-dependent) application.</t>
2043
2044<t>Such applications would need to be upgraded, in order to use the
2045IRI syntax as a base for heuristics. In particular, a non-ASCII
2046character should not be taken as the indication of the end of an IRI.
2047Such applications also would need to make sure that they correctly
2048convert the detected IRI from the character encoding of the document
2049or application where the IRI appears, to the character encoding used
2050by the system-wide IRI invocation mechanism, or to a URI (according to
2051<xref target="mapping"/>) if the system-wide invocation mechanism only
2052accepts URIs.</t>
2053
2054<t>The clipboard is another frequently used way to transfer URIs and
2055IRIs from one application to another. On most platforms, the clipboard
2056is able to store and transfer text in many languages and scripts.
2057Correctly used, the clipboard transfers characters, not octets, which
2058will do the right thing with IRIs.</t>
2059</section>
2060
2061<section title="URI/IRI Generation">
2062
2063<t>Systems that offer resources through the Internet, where those
2064resources have logical names, sometimes automatically generate URIs
2065for the resources they offer. For example, some HTTP servers can
2066generate a directory listing for a file directory and then respond to
2067the generated URIs with the files.</t>
2068
2069<t>Many legacy character encodings are in use in various file systems.
2070Many currently deployed systems do not transform the local character
2071representation of the underlying system before generating URIs.</t>
2072
2073<t>For maximum interoperability, systems that generate resource
2074identifiers should make the appropriate transformations. For example,
2075if a file system contains a file named
2076"r&amp;#xE9;sum&amp;#xE9;.html", a server should expose this as
2077"r%C3%A9sum%C3%A9.html" in a URI, which allows use of
2078"r&amp;#xE9;sum&amp;#xE9;.html" in an IRI, even if locally the file
2079name is kept in a character encoding other than UTF-8.
2080</t>
2081
2082<t>This recommendation particularly applies to HTTP servers. For FTP
2083servers, similar considerations apply; see <xref target="RFC2640"/>.</t>
2084</section>
2085
2086<section title="URI/IRI Selection" anchor="selection">
2087<t>In some cases, resource owners and publishers have control over the
2088IRIs used to identify their resources. This control is mostly
2089executed by controlling the resource names, such as file names,
2090directly.</t>
2091
2092<t>In these cases, it is recommended to avoid choosing IRIs that are
2093easily confused. For example, for US-ASCII, the lower-case ell ("l") is
2094easily confused with the digit one ("1"), and the upper-case oh ("O") is
2095easily confused with the digit zero ("0"). Publishers should avoid
2096confusing users with "br0ken" or "1ame" identifiers.</t>
2097
2098<t>Outside the US-ASCII repertoire, there are many more opportunities for
2099confusion; a complete set of guidelines is too lengthy to include
2100here. As long as names are limited to characters from a single script,
2101native writers of a given script or language will know best when
2102ambiguities can appear, and how they can be avoided. What may look
2103ambiguous to a stranger may be completely obvious to the average
2104native user. On the other hand, in some cases, the UCS contains
2105variants for compatibility reasons; for example, for typographic purposes.
2106These should be avoided wherever possible. Although there may be exceptions,
2107newly created resource names should generally be in NFKC
2108<xref target="UTR15"></xref> (which means that they are also in NFC).</t>
2109
2110<t>As an example, the UCS contains the "fi" ligature at U+FB01
2111for compatibility reasons.
2112Wherever possible, IRIs should use the two letters "f" and "i" rather
2113than the "fi" ligature. An example where the latter may be used is
2114in the query part of an IRI for an explicit search for a word written
2115containing the "fi" ligature.</t>
2116
2117<t>In certain cases, there is a chance that characters from different
2118scripts look the same. The best known example is the similarity of the
2119Latin "A", the Greek "Alpha", and the Cyrillic "A". To avoid such
2120cases, IRIs should only be created where all the characters in a
2121single component are used together in a given language. This usually
2122means that all of these characters will be from the same script, but
2123there are languages that mix characters from different scripts (such
2124as Japanese).  This is similar to the heuristics used to distinguish
2125between letters and numbers in the examples above. Also, for Latin,
2126Greek, and Cyrillic, using lowercase letters results in fewer
2127ambiguities than using uppercase letters would.</t>
2128</section>
2129
2130<section title="Display of URIs/IRIs" anchor="display">
2131<t>
2132In situations where the rendering software is not expected to display
2133non-ASCII parts of the IRI correctly using the available layout and font
2134resources, these parts should be percent-encoded before being displayed.</t>
2135
2136<t>For display of Bidi IRIs, please see <xref target="visual"/>.</t>
2137</section>
2138
2139<section title="Interpretation of URIs and IRIs">
2140<t>Software that interprets IRIs as the names of local resources should
2141accept IRIs in multiple forms and convert and match them with the
2142appropriate local resource names.</t>
2143
2144<t>First, multiple representations include both IRIs in the native
2145character encoding of the protocol and also their URI counterparts.</t>
2146
2147<t>Second, it may include URIs constructed based on character
2148encodings other than UTF-8. These URIs may be produced by user agents that do
2149not conform to this specification and that use legacy character encodings to
2150convert non-ASCII characters to URIs. Whether this is necessary, and what
2151character encodings to cover, depends on a number of factors, such as
2152the legacy character encodings used locally and the distribution of
2153various versions of user agents. For example, software for Japanese
2154may accept URIs in Shift_JIS and/or EUC-JP in addition to UTF-8.</t>
2155
2156<t>Third, it may include additional mappings to be more user-friendly
2157and robust against transmission errors. These would be similar to how
2158some servers currently treat URIs as case insensitive or perform
2159additional matching to account for spelling errors. For characters
2160beyond the US-ASCII repertoire, this may, for example, include
2161ignoring the accents on received IRIs or resource names. Please note
2162that such mappings, including case mappings, are language
2163dependent.</t>
2164
2165<t>It can be difficult to identify a resource unambiguously if too
2166many mappings are taken into consideration. However, percent-encoded
2167and not percent-encoded parts of IRIs can always be clearly distinguished.
2168Also, the regularity of UTF-8 (see <xref target="Duerst97"/>) makes the
2169potential for collisions lower than it may seem at first.</t>
2170</section>
2171
2172<section title="Upgrading Strategy">
2173<t>Where this recommendation places further constraints on software
2174for which many instances are already deployed, it is important to
2175introduce upgrades carefully and to be aware of the various
2176interdependencies.</t>
2177
2178<t>If IRIs cannot be interpreted correctly, they should not be created,
2179generated, or transported. This suggests that upgrading URI interpreting
2180software to accept IRIs should have highest priority.</t>
2181
2182<t>On the other hand, a single IRI is interpreted only by a single or
2183very few interpreters that are known in advance, although it may be
2184entered and transported very widely.</t>
2185
2186<t>Therefore, IRIs benefit most from a broad upgrade of software to be
2187able to enter and transport IRIs. However, before an
2188individual IRI is published, care should be taken to upgrade the corresponding
2189interpreting software in order to cover the forms expected to be
2190received by various versions of entry and transport software.</t>
2191
2192<t>The upgrade of generating software to generate IRIs instead of using a
2193local character encoding should happen only after the service is upgraded
2194to accept IRIs. Similarly, IRIs should only be generated when the service
2195accepts IRIs and the intervening infrastructure and protocol is known
2196to transport them safely.</t>
2197
2198<t>Software converting from URIs to IRIs for display should be upgraded
2199only after upgraded entry software has been widely deployed to the
2200population that will see the displayed result.</t>
2201
2202
2203<t>Where there is a free choice of character encodings, it is often
2204possible to reduce the effort and dependencies for upgrading to IRIs
2205by using UTF-8 rather than another encoding. For example, when a new
2206file-based Web server is set up, using UTF-8 as the character encoding
2207for file names will make the transition to IRIs easier. Likewise, when
2208a new Web form is set up using UTF-8 as the character encoding of the
2209form page, the returned query URIs will use UTF-8 as the character
2210encoding (unless the user, for whatever reason, changes the character
2211encoding) and will therefore be compatible with IRIs.</t>
2212
2213
2214<t>These recommendations, when taken together, will allow for the
2215extension from URIs to IRIs in order to handle characters other than
2216US-ASCII while minimizing interoperability problems. For
2217considerations regarding the upgrade of URI scheme definitions, see
2218<xref target="UTF8use"/>.</t>
2219
2220</section>
2221</section> <!-- guidelines -->
2222
2223<section title="IANA Considerations" anchor="iana">
2224
2225<t>RFC Editor and IANA note: Please Replace RFC XXXX with the
2226number of this document when it issues as an RFC. </t>
2227
2228<t>IANA maintains a registry of "URI schemes". A "URI scheme" also
2229serves an "IRI scheme". </t>
2230
2231<t>To clarify that the URI scheme registration process also applies to
2232IRIs, change the description of the "URI schemes" registry
2233header to say "[RFC4395] defines an IANA-maintained registry of URI
2234Schemes. These registries include the Permanent and Provisional URI
2235Schemes.  RFC XXXX updates this registry to designate that schemes may
2236also indicate their usability as IRI schemes.</t>
2237
2238<t> Update "per RFC 4395" to "per RFC 4395 and RFC XXXX".
2239</t>
2240
2241</section> <!-- IANA -->
2242   
2243<section title="Security Considerations" anchor="security">
2244<t>The security considerations discussed in <xref target="RFC3986"/>
2245also apply to IRIs. In addition, the following issues require
2246particular care for IRIs.</t>
2247<t>Incorrect encoding or decoding can lead to security problems.
2248In particular, some UTF-8 decoders do not check against overlong
2249byte sequences. As an example, a "/" is encoded with the byte 0x2F
2250both in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly
2251interpret the sequence 0xC0 0xAF as a "/". A sequence such as "%C0%AF.."
2252may pass some security tests and then be interpreted
2253as "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion
2254and checking are not done in the right order, and/or if reserved
2255characters and unreserved characters are not clearly distinguished.</t>
2256
2257<t>There are various ways in which "spoofing" can occur with IRIs.
2258"Spoofing" means that somebody may add a resource name that looks the
2259same or similar to the user, but that points to a different resource.
2260The added resource may pretend to be the real resource by looking
2261very similar but may contain all kinds of changes that may be
2262difficult to spot and that can cause all kinds of problems.
2263Most spoofing possibilities for IRIs are extensions of those for URIs.</t>
2264
2265<t>Spoofing can occur for various reasons. First, a user's normalization expectations or actual normalization
2266when entering an IRI or  transcoding an IRI from a legacy character
2267encoding do not match the normalization used on the
2268server side. Conceptually, this is no different from the problems
2269surrounding the use of case-insensitive web servers. For example,
2270a popular web page with a mixed-case name ("http://big.example.com/PopularPage.html")
2271might be "spoofed" by someone who is able to create "http://big.example.com/popularpage.html".
2272However, the use of unnormalized character sequences, and of additional
2273mappings for user convenience, may increase the chance for spoofing.
2274Protocols and servers that allow the creation of resources with
2275names that are not normalized are particularly vulnerable to such
2276attacks. This is an inherent
2277security problem of the relevant protocol, server, or resource
2278and is not specific to IRIs, but it is mentioned here for completeness.</t>
2279
2280<t>Spoofing can occur in various IRI components, such as the
2281domain name part or a path part. For considerations specific
2282to the domain name part, see <xref target="RFC3491"/>.
2283For the path part, administrators of sites that allow independent
2284users to create resources in the same sub area may have to be careful
2285to check for spoofing.</t>
2286
2287<t>Spoofing can occur because in the UCS many characters look very similar. Details are discussed in <xref target="selection"/>.
2288Again, this is very similar to spoofing possibilities on US-ASCII,
2289e.g., using "br0ken" or "1ame" URIs.</t>
2290
2291<t>Spoofing can occur when URIs with percent-encodings based on various
2292character encodings are accepted to deal with older user agents. In some
2293cases, particularly for Latin-based resource names, this is usually easy to
2294detect because UTF-8-encoded names, when interpreted and viewed as
2295legacy character encodings, produce mostly garbage.</t><t>When
2296concurrently used character encodings have a similar structure but there
2297are no characters that have exactly the same encoding, detection is more
2298difficult.</t>
2299
2300<t>Spoofing can occur with bidirectional IRIs, if the restrictions
2301in <xref target="bidi-structure"/> are not followed. The same visual
2302representation may be interpreted as different logical representations,
2303and vice versa. It is also very important that a correct Unicode bidirectional
2304implementation be used.</t><t>The use of Legacy Extended IRIs introduces additional security issues.</t>
2305</section><!-- security -->
2306
2307<section title="Acknowledgements">
2308<t>For contributions to this update, we would like to thank Ian Hickson, Michael Sperberg-McQueen, Dan Connolly, Norman Walsh, Richard Tobin, Henry S. Thomson, and the XML Core Working Group of the W3C.</t>
2309
2310<t>The discussion on the issue addressed here started a long time
2311ago. There was a thread in the HTML working
2312group in August 1995 (under the topic of "Globalizing URIs") and in the
2313www-international mailing list in July 1996 (under the topic of
2314"Internationalization and URLs"), and there were ad-hoc meetings at the Unicode
2315conferences in September 1995 and September 1997.</t>
2316
2317<t>For contributions to the previous version of this document, RFC 3987, many thanks go to
2318Francois Yergeau, Matitiahu Allouche,
2319Roy Fielding, Tim Berners-Lee, Mark Davis,
2320M.T. Carrasco Benitez, James Clark, Tim Bray, Chris Wendt, Yaron Goland,
2321Andrea Vine, Misha Wolf, Leslie Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman,
2322Russ Housley, Makoto MURATA, Steven Atkin,
2323Ryan Stansifer, Tex Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs,
2324Adam Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown,
2325Roy Badami, Jonathan Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio,
2326Chris Haynes, Walter Underwood, and many others.</t>
2327<t>A definition of HyperText Reference was initially produced by Ian Hixson,
2328and further edited by Dan Connolly and C. M. Spergerg-McQueen.</t>
2329<t>Thanks to the Internationalization Working
2330Group (I18N WG) of the World Wide Web Consortium (W3C),
2331and the members of the W3C
2332I18N Working Group and Interest Group for their contributions and their
2333work on <xref target="CharMod"/>. Thanks also go
2334to the members of many other W3C Working Groups for adopting IRIs, and to
2335the members of the Montreal IAB Workshop on Internationalization and
2336Localization for their review.</t>
2337</section>
2338
2339
2340<section title="Change Log">
2341
2342<t>Note to RFC Editor: Please completely remove this section before publication.</t>
2343
2344<section title='Changes from draft-duerst-iri-bis-07 to draft-ietf-iri-3987bis-00'>
2345     <t>Changed draft name, date, last paragraph of abstract, and titles in change log, and added this section
2346     in moving from draft-duerst-iri-bis-07 (personal submission) to draft-ietf-iri-3987bis-00 (WG document).</t>
2347</section>
2348
2349<section title="Changes from -06 to -07 of draft-duerst-iri-bis" anchor="forkChanges"><t>
2350
2351Major restructuring of IRI processing model to make scheme-specific translation necessary to handle IDNA requirements and for consistency with web implementations. </t>
2352<t>Starting with IRI, you want one of:
2353<list style="hanging">
2354<t hangText="a"> IRI components (IRI parsed into UTF8 pieces)</t>
2355<t hangText="b"> URI components (URI parsed into ASCII pieces, encoded correctly) </t>
2356<t hangText="c"> whole URI  (for passing on to some other system that wants whole URIs) </t>
2357</list></t>
2358
2359<section title="OLD WAY">
2360<t><list style="numbers">
2361
2362 <t>Pct-encoding on the whole thing to a URI.
2363 (c1) If you want a (maybe broken) whole URI, you might
2364        stop here.</t>
2365
2366 <t>Parsing the URI into URI components.
2367   (b1) If you want (maybe broken) URI components, stop here.</t>
2368
2369 <t> Decode the components (undoing the pct-encoding).
2370   (a) if you want IRI components, stop here.</t>
2371
2372 <t> reencode:  Either using a different encoding some components
2373   (for domain names, and query components in web pages, which
2374   depends on the component, scheme and context), and otherwise
2375   using pct-encoding.
2376   (b2) if you want (good) URI components, stop here.</t>
2377
2378 <t> reassemble the reencoded components.
2379   (c2) if you want a (*good*) whole URI stop here.</t>
2380</list>
2381
2382</t>
2383
2384</section>
2385
2386<section title="NEW WAY">
2387<t>
2388<list style="numbers">
2389
2390<t> Parse the IRI into IRI components using the generic syntax.
2391   (a) if you want IRI components, stop here.</t>
2392
2393<t> Encode each components, using pct-encoding, IDN encoding, or
2394         special query part encoding depending on the component
2395         scheme or context. (b) If you want URI components, stop here.</t>
2396<t> reassemble the a whole URI from URI components.
2397   (c) if you want a whole URI stop here.</t>
2398</list></t>
2399</section>
2400</section>
2401
2402<section title='Changes from -00 to -01'><t><list style="symbols">
2403  <t>Removed 'mailto:' before mail addresses of authors.</t>
2404  <t>Added "&lt;to be done&gt;" as right side of 'href-strip' rule. Fixed '|' to '/' for
2405    alternatives.</t>
2406</list></t>
2407</section>
2408
2409<section title="Changes from -05 to -06 of draft-duerst-iri-bis-00"><t><list style="symbols">
2410<t>Add HyperText Reference, change abstract, acks and references for it</t>
2411<t>Add Masinter back as another editor.</t>
2412<t>Masinter integrates HRef material from HTML5 spec.</t>
2413<t>Rewrite introduction sections to modernize.</t>
2414</list></t>
2415</section>
2416
2417<section title="Changes from -04 to -05 of draft-duerst-iri-bis"><t><list style="symbols"><t>Updated references.</t><t>Changed IPR text to pre5378Trust200902.</t></list></t>
2418</section>
2419
2420<section title="Changes from -03 to -04 of draft-duerst-iri-bis"><t><list style="symbols"><t>Added explicit abbreviation for LEIRIs.</t><t>Mentioned LEIRI references.</t><t>Completed text in LEIRI section about tag characters and about specials.</t></list></t>
2421</section>
2422
2423<section title="Changes from -02 to -03 of draft-duerst-iri-bis"><t><list style="symbols"><t>Updated some references.</t><t>Updated Michel Suginard's coordinates.</t></list></t>
2424</section>
2425
2426<section title="Changes from -01 to -02 of draft-duerst-iri-bis"><t><list style="symbols"><t>Added tag range to iprivate (issue private-include-tags-115).</t><t>Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs.</t></list></t>
2427</section>
2428<section title="Changes from -00 to -01 of draft-duerst-iri-bis"><t><list style="symbols"><t>Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI" based on input from the W3C XML Core WG. Moved the relevant subsections to the back and promoted them to a section.</t><t>Added some text re. Legacy Extended IRIs to the security section.</t><t>Added a IANA Consideration Section.</t><t>Added this Change Log Section.</t><t>Added a section about "IRIs with Spaces/Controls" (converting from a Note in RFC 3987).</t></list></t>
2429</section>
2430<section title="Changes from RFC 3987 to -00 of draft-duerst-iri-bis"><t><list><t>Fixed errata (see http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987).</t></list></t>
2431</section>
2432</section>
2433</middle>
2434
2435<back>
2436<references title="Normative References">
2437
2438<reference anchor="ASCII">
2439<front>
2440<title>Coded Character Set -- 7-bit American Standard Code for Information
2441Interchange</title>
2442<author>
2443<organization>American National Standards Institute</organization>
2444</author>
2445<date year="1986"/>
2446</front>
2447<seriesInfo name="ANSI" value="X3.4"/>
2448</reference>
2449
2450<reference anchor="ISO10646">
2451<front>
2452<title>ISO/IEC 10646:2003: Information Technology -
2453Universal Multiple-Octet Coded Character Set (UCS)</title>
2454<author>
2455<organization>International Organization for Standardization</organization>
2456</author>
2457<date month="December" year="2003"/>
2458</front>
2459<seriesInfo name="ISO" value="Standard 10646"/>
2460</reference>
2461
2462&rfc2119;
2463&rfc3490;
2464&rfc3491;
2465&rfc3629;
2466&rfc3986;
2467
2468<reference anchor="STD68">
2469<front>
2470<title abbrev="ABNF">Augmented BNF for Syntax Specifications: ABNF</title>
2471<author initials="D." surname="Crocker" fullname="Dave Crocker"><organization/></author>
2472<author initials="P." surname="Overell" fullname="Paul Overell"><organization/></author>
2473<date month="January" year="2008"/></front>
2474<seriesInfo name="STD" value="68"/><seriesInfo name="RFC" value="5234"/>
2475</reference>
2476 
2477&rfc5890;
2478&rfc5891;
2479
2480<reference anchor="UNIV4">
2481<front>
2482<title>The Unicode Standard, Version 5.1.0, defined by: The Unicode Standard,
2483Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0),
2484as amended by Unicode 4.1.0 (http://www.unicode.org/versions/Unicode5.1.0/)</title>
2485<author><organization>The Unicode Consortium</organization></author>
2486<date year="2008" month="April"/>
2487</front>
2488</reference>
2489
2490<reference anchor="UNI9" target="http://www.unicode.org/reports/tr9/tr9-13.html">
2491<front>
2492<title>The Bidirectional Algorithm</title>
2493<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
2494<date year="2004" month="March"/>
2495</front>
2496<seriesInfo name="Unicode Standard Annex" value="#9"/>
2497</reference>
2498
2499<reference anchor="UTR15" target="http://www.unicode.org/unicode/reports/tr15/tr15-23.html">
2500<front>
2501<title>Unicode Normalization Forms</title>
2502<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
2503<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2504<date year="2008" month="March"/>
2505</front>
2506<seriesInfo name="Unicode Standard Annex" value="#15"/>
2507</reference>
2508
2509</references>
2510
2511<references title="Informative References">
2512
2513<reference anchor="BidiEx" target="http://www.w3.org/International/iri-edit/BidiExamples">
2514<front>
2515<title>Examples of bidirectional IRIs</title>
2516<author><organization/></author>
2517<date year="" month=""/>
2518</front>
2519</reference>
2520
2521<reference anchor="CharMod" target="http://www.w3.org/TR/charmod-resid">
2522<front>
2523<title>Character Model for the World Wide Web: Resource Identifiers</title>
2524<author initials="M." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2525<author initials="F." surname="Yergeau" fullname="Francois Yergeau"><organization/></author>
2526<author initials="R." surname="Ishida" fullname="Richard Ishida"><organization/></author>
2527<author initials="M." surname="Wolf" fullname="Misha Wolf"><organization/></author>
2528<author initials="T." surname="Texin" fullname="Tex Texin"><organization/></author>
2529<date year="2004" month="November" day="25"/>
2530</front>
2531<seriesInfo name="World Wide Web Consortium" value="Candidate Recommendation"/>
2532</reference>
2533
2534<reference anchor="Duerst97" target="http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf">
2535<front>
2536<title>The Properties and Promises of UTF-8</title>
2537<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2538<date year="1997" month="September"/>
2539</front>
2540<seriesInfo name="Proc. 11th International Unicode Conference, San Jose" value=""/>
2541</reference>
2542
2543<reference anchor="Gettys" target="http://www.w3.org/DesignIssues/ModelConsequences">
2544<front>
2545<title>URI Model Consequences</title>
2546<author initials="J." surname="Gettys" fullname="Jim Gettys"><organization/></author>
2547<date month="" year=""/>
2548</front>
2549</reference>
2550
2551<reference anchor="HTML4" target="http://www.w3.org/TR/html401/appendix/notes.html#h-B.2">
2552<front>
2553<title>HTML 4.01 Specification</title>
2554<author initials="D." surname="Raggett" fullname="Dave Raggett"><organization/></author>
2555<author initials="A." surname="Le Hors" fullname="Arnaud Le Hors"><organization/></author>
2556<author initials="I." surname="Jacobs" fullname="Ian Jacobs"><organization/></author>
2557<date year="1999" month="December" day="24"/>
2558</front>
2559<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2560</reference>
2561
2562<reference anchor="LEIRI" target="http://www.w3.org/TR/leiri/">
2563<front>
2564<title>Legacy extended IRIs for XML resource identification</title>
2565<author initials="H." surname="Thompson" fullname="Henry Thompson"><organization/></author>
2566<author initials="R." surname="Tobin"    fullname="Richard Tobin"><organization/></author>
2567<author initials="N." surname="Walsh" fullname="Norman Walsh"><organization/></author>
2568  <date year="2008" month="November" day="3"/>
2569
2570</front>
2571<seriesInfo name="World Wide Web Consortium" value="Note"/>
2572</reference>
2573
2574
2575&rfc2045;
2576&rfc2130;
2577&rfc2141;
2578&rfc2192;
2579&rfc2277;
2580&rfc2368;
2581&rfc2384;
2582&rfc2396;
2583&rfc2397;
2584&rfc2616;
2585&rfc1738;
2586&rfc2640;
2587<reference anchor='RFC4395bis'>
2588  <front>
2589    <title>Guidelines and Registration Procedures for New URI/IRI Schemes</title>
2590    <author initials='T.' surname='Hansen' fullname="Tony Hansen"><organization/></author>
2591    <author initials='T.' surname='Hardie' fullname="Ted Hardie"><organization/></author>
2592    <author initials='L.' surname='Masinter' fullname="Larry Masinter"><organization/></author>
2593    <date year="2010" month='September' day="30"/>
2594    <workgroup>IRI</workgroup>
2595  </front>
2596  <seriesInfo name="Internet-Draft" value="draft-hansen-iri-4395bis-irireg-00"/>
2597</reference>
2598 
2599 
2600<reference anchor="UNIXML" target="http://www.w3.org/TR/unicode-xml/">
2601<front>
2602<title>Unicode in XML and other Markup Languages</title>
2603<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2604<author initials="A." surname="Freytag" fullname="Asmus Freytag"><organization/></author>
2605<date year="2003" month="June" day="18"/>
2606</front>
2607<seriesInfo name="Unicode Technical Report" value="#20"/>
2608<seriesInfo name="World Wide Web Consortium" value="Note"/>
2609</reference>
2610 
2611<reference anchor="UTR36" target="http://unicode.org/reports/tr36/">
2612<front>
2613<title>Unicode Security Considerations</title>
2614<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
2615<author initials="M." surname="Suignard" fullname="Michel Suignard"><organization/></author>
2616<date year="2010" month="August" day="4"/>
2617</front>
2618<seriesInfo name="Unicode Technical Report" value="#36"/>
2619</reference>
2620
2621<reference anchor="XLink" target="http://www.w3.org/TR/xlink/#link-locators">
2622<front>
2623<title>XML Linking Language (XLink) Version 1.0</title>
2624<author initials="S." surname="DeRose" fullname="Steve DeRose"><organization/></author>
2625<author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2626<author initials="D." surname="Orchard" fullname="David Orchard"><organization/></author>
2627<date year="2001" month="June" day="27"/>
2628</front>
2629<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2630</reference>
2631
2632<reference anchor="XML1" target="http://www.w3.org/TR/REC-xml">
2633  <front>
2634    <title>Extensible Markup Language (XML) 1.0 (Forth Edition)</title>
2635    <author initials="T." surname="Bray" fullname="Tim Bray"><organization/></author>
2636    <author initials="J." surname="Paoli" fullname="Jean Paoli"><organization/></author>
2637    <author initials="C.M." surname="Sperberg-McQueen" fullname="C. M. Sperberg-McQueen">
2638      <organization/></author>
2639    <author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2640    <author initials="F." surname="Yergeau" fullname="Francois Yergeau"><organization/></author>
2641    <date day="16" month="August" year="2006"/>
2642  </front>
2643  <seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2644</reference>
2645
2646<reference anchor="XMLNamespace" target="http://www.w3.org/TR/REC-xml-names">
2647  <front>
2648    <title>Namespaces in XML (Second Edition)</title>
2649    <author initials="T." surname="Bray" fullname="Tim Bray"><organization/></author>
2650    <author initials="D." surname="Hollander" fullname="Dave Hollander"><organization/></author>
2651    <author initials="A." surname="Layman" fullname="Andrew Layman"><organization/></author>
2652    <author initials="R." surname="Tobin" fullname="Richard Tobin"><organization></organization></author><date day="16" month="August" year="2006"/>
2653  </front>
2654  <seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2655</reference>
2656
2657<reference anchor="XMLSchema" target="http://www.w3.org/TR/xmlschema-2/#anyURI">
2658<front>
2659<title>XML Schema Part 2: Datatypes</title>
2660<author initials="P." surname="Biron" fullname="Paul Biron"><organization/></author>
2661<author initials="A." surname="Malhotra" fullname="Ashok Malhotra"><organization/></author>
2662<date year="2001" month="May" day="2"/>
2663</front>
2664<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2665</reference>
2666
2667<reference anchor="XPointer" target="http://www.w3.org/TR/xptr-framework/#escaping">
2668<front>
2669<title>XPointer Framework</title>
2670<author initials="P." surname="Grosso" fullname="Paul Grosso"><organization/></author>
2671<author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2672<author initials="J." surname="Marsh" fullname="Jonathan Marsh"><organization/></author>
2673<author initials="N." surname="Walsh" fullname="Norman Walsh"><organization/></author>
2674<date year="2003" month="March" day="25"/>
2675</front>
2676<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2677</reference>
2678
2679<reference anchor="HTML5" target="http://www.w3.org/TR/2009/WD-html5-20090423/">
2680<front>
2681<title>A vocabulary and associated APIs for HTML and XHTML</title>
2682<author initials="I." surname="Hickson" fullname="Ian Hickson"><organization>Google, Inc.</organization></author>
2683<author initials="D." surname="Hyatt" fullname="David Hyatt"><organization>Apple, Inc.</organization></author>
2684<date year="2009"  month="April" day="23"/>
2685</front>
2686<seriesInfo name="World Wide Web Consortium" value="Working Draft"/>
2687</reference>
2688
2689</references>
2690
2691<section title="Design Alternatives">
2692<t>This section briefly summarizes some design alternatives
2693considered earlier and the reasons why they were not chosen.</t>
2694<section title="New Scheme(s)">
2695<t>Introducing new schemes (for example, httpi:, ftpi:,...) or a
2696new metascheme (e.g., i:, leading to URI/IRI prefixes such as
2697i:http:, i:ftp:,...) was proposed to make IRI-to-URI conversion
2698scheme dependent or to distinguish between percent-encodings
2699resulting from IRI-to-URI conversion and percent-encodings from
2700legacy character encodings.</t>
2701
2702<t>New schemes are not needed to distinguish URIs from true IRIs (i.e.,
2703  IRIs that contain non-ASCII characters). The benefit of being able
2704  to detect the origin of percent-encodings is marginal, as UTF-8
2705  can be detected with very high reliability. Deploying new schemes is
2706  extremely hard, so not requiring new schemes for IRIs makes
2707  deployment of IRIs vastly easier. Making conversion scheme dependent
2708  is highly inadvisable and would be encouraged by separate schemes for IRIs.
2709  Using a uniform convention for conversion from IRIs to URIs makes
2710  IRI implementation orthogonal to the introduction of actual new
2711  schemes.</t>
2712</section>
2713<section title="Character Encodings Other Than UTF-8">
2714<t>At an early stage, UTF-7 was considered as an alternative to
2715UTF-8 when IRIs are converted to URIs. UTF-7 would not have needed
2716percent-encoding and  in most cases would have been shorter than
2717percent-encoded UTF-8.</t>
2718<t>Using UTF-8 avoids a double layering and overloading of the use of
2719   the "+" character. UTF-8 is fully compatible with US-ASCII and has
2720   therefore been recommended by the IETF, and is being used widely.</t>
2721 
2722  <t>UTF-7 has never been used much and is now clearly being
2723   discouraged. Requiring implementations to convert from UTF-8
2724   to UTF-7 and back would be an additional implementation burden.</t>
2725</section> <!-- notutf8 -->
2726<section title="New Encoding Convention">
2727<t>Instead of using the existing percent-encoding convention
2728of URIs, which is based on octets, the idea was to create a new
2729encoding convention; for example, to use "%u" to introduce
2730UCS code points.</t>
2731<t>Using the existing octet-based percent-encoding mechanism
2732does not need an upgrade of the URI syntax and does not
2733need corresponding server upgrades.</t>
2734</section> <!-- new encoding -->
2735<section title="Indicating Character Encodings in the URI/IRI">
2736<t>Some proposals suggested indicating the character encodings used
2737in an URI or IRI with some new syntactic convention in the URI itself,
2738similar to the "charset" parameter for e-mails and Web pages.
2739As an example, the label in square brackets in
2740"http://www.example.org/ros[iso-8859-1]&amp;#xE9;" indicated that
2741the following "&amp;#xE9;" had to be interpreted as iso-8859-1.</t>
2742<t>If UTF-8 is used exclusively, an upgrade to the URI syntax is not needed.
2743It avoids potentially multiple labels that have to be copied correctly
2744in all cases, even on the side of a bus or on a napkin, leading to
2745usability problems (and being prohibitively annoying).
2746Exclusively using UTF-8 also reduces transcoding errors and confusion.</t>
2747</section> <!-- indicating -->
2748</section>
2749</back>
2750</rfc>
Note: See TracBrowser for help on using the repository browser.