source: draft-ietf-iri-3987bis/draft-ietf-iri-3987bis.xml @ 26

Last change on this file since 26 was 26, checked in by duerst@…, 9 years ago

changed draft number to -03 and adjusted date

  • Property svn:executable set to *
File size: 125.8 KB
Line 
1<?xml version="1.0"?>
2<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
3<!ENTITY rfc1738 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.1738.xml">
4<!ENTITY rfc2045 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2045.xml">
5<!ENTITY rfc2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
6<!ENTITY rfc2130 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2130.xml">
7<!ENTITY rfc2141 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2141.xml">
8<!ENTITY rfc2192 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2192.xml">
9<!ENTITY rfc2277 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2277.xml">
10<!ENTITY rfc2368 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2368.xml">
11<!ENTITY rfc2384 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2384.xml">
12<!ENTITY rfc2396 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2396.xml">
13<!ENTITY rfc2397 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2397.xml">
14<!ENTITY rfc2616 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2616.xml">
15<!ENTITY rfc2640 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2640.xml">
16<!ENTITY rfc3490 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3490.xml">
17<!ENTITY rfc3491 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3491.xml">
18<!ENTITY rfc3629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3629.xml">
19<!ENTITY rfc3986 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3986.xml">
20<!ENTITY rfc5890 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5890.xml">
21<!ENTITY rfc5891 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5891.xml">
22]>
23<?rfc strict='yes'?>
24<!--     complains about too long lines (2 cases)
25     and appendix, but otherwise is okay
26-->
27<?xml-stylesheet type='text/css' href='rfc2629.css' ?>
28<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
29<?rfc symrefs='yes'?>
30<?rfc sortrefs='yes'?>
31<?rfc iprnotified="no" ?>
32<?rfc toc='yes'?>
33<?rfc compact='yes'?>
34<?rfc subcompact='no'?>
35<rfc ipr="pre5378Trust200902" docName="draft-ietf-iri-3987bis-03" category="std" xml:lang="en" obsoletes="3987">
36<front>
37<title abbrev="IRIs">Internationalized Resource Identifiers (IRIs)</title>
38
39  <author initials="M.J." surname="Duerst" fullname='Martin Duerst'>
40    <!-- (Note: Please write "Duerst" with u-umlaut wherever
41      possible, for example as "D&#252;rst" in XML and HTML) -->
42  <organization abbrev="Aoyama Gakuin University">Aoyama Gakuin University</organization>
43  <address>
44  <postal>
45  <street>5-10-1 Fuchinobe</street>
46  <city>Sagamihara</city>
47  <region>Kanagawa</region>
48  <code>229-8558</code>
49  <country>Japan</country>
50  </postal>
51  <phone>+81 42 759 6329</phone>
52  <facsimile>+81 42 759 6495</facsimile>
53  <email>duerst@it.aoyama.ac.jp</email>
54  <uri>http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/<!-- (Note: This is the percent-encoded form of an IRI)--></uri>
55  </address>
56</author>
57
58<author initials="M.L." surname="Suignard" fullname="Michel Suignard">
59   <organization>Unicode Consortium</organization>
60   <address>
61   <postal>
62   <street></street>
63   <street>P.O. Box 391476</street>
64   <city>Mountain View</city>
65   <region>CA</region>
66   <code>94039-1476</code>
67   <country>U.S.A.</country>
68   </postal>
69   <phone>+1-650-693-3921</phone>
70   <email>michel@unicode.org</email>
71   <uri>http://www.suignard.com</uri>
72   </address>
73</author>
74<author initials="L." surname="Masinter" fullname="Larry Masinter">
75   <organization>Adobe</organization>
76   <address>
77   <postal>
78   <street>345 Park Ave</street>
79   <city>San Jose</city>
80   <region>CA</region>
81   <code>95110</code>
82   <country>U.S.A.</country>
83   </postal>
84   <phone>+1-408-536-3024</phone>
85   <email>masinter@adobe.com</email>
86   <uri>http://larry.masinter.net</uri>
87   </address>
88</author>
89
90<date year="2010" month="October" day="25"/>
91<area>Applications</area>
92<workgroup>Internationalized Resource Identifiers (iri)</workgroup>
93<keyword>IRI</keyword>
94<keyword>Internationalized Resource Identifier</keyword>
95<keyword>UTF-8</keyword>
96<keyword>URI</keyword>
97<keyword>URL</keyword>
98<keyword>IDN</keyword>
99<keyword>LEIRI</keyword>
100
101<abstract>
102<t>This document defines the Internationalized Resource Identifier
103(IRI) protocol element, as an extension of the Uniform Resource
104Identifier (URI).  An IRI is a sequence of characters from the
105Universal Character Set (Unicode/ISO 10646). Grammar and processing
106rules are given for IRIs and related syntactic forms.</t>
107
108<t>In addition, this document provides named additional rule sets
109for processing otherwise invalid IRIs, in a way that supports
110other specifications that wish to mandate common behavior for
111'error' handling. In particular, rules used in some XML languages
112(LEIRI) and web applications are given.</t>
113
114<t>Defining IRI as new protocol element (rather than updating or
115extending the definition of URI) allows independent orderly
116transitions: other protocols and languages that use URIs must
117explicitly choose to allow IRIs.</t>
118
119<t>Guidelines are provided for the use and deployment of IRIs and
120related protocol elements when revising protocols, formats, and
121software components that currently deal only with URIs.</t>
122
123</abstract>
124  <note title='RFC Editor: Please remove the next paragraph before publication.'>
125    <t>This document is intended to update RFC 3987 and move towards IETF
126    Draft Standard.  For discussion and comments on this
127    draft, please join the IETF IRI WG by subscribing to the mailing
128    list public-iri@w3.org. For a list of open issues, please see
129    the issue tracker of the WG at http://trac.tools.ietf.org/wg/iri/trac/report/1.</t>
130</note>
131</front>
132<middle>
133
134<section title="Introduction">
135
136<section title="Overview and Motivation" anchor="overview">
137
138<t>A Uniform Resource Identifier (URI) is defined in <xref
139target="RFC3986"/> as a sequence of characters chosen from a limited
140subset of the repertoire of US-ASCII <xref target="ASCII"/>
141characters.</t>
142
143<t>The characters in URIs are frequently used for representing words
144of natural languages.  This usage has many advantages: Such URIs are
145easier to memorize, easier to interpret, easier to transcribe, easier
146to create, and easier to guess. For most languages other than English,
147however, the natural script uses characters other than A - Z. For many
148people, handling Latin characters is as difficult as handling the
149characters of other scripts is for those who use only the Latin
150alphabet. Many languages with non-Latin scripts are transcribed with
151Latin letters. These transcriptions are now often used in URIs, but
152they introduce additional difficulties.</t>
153
154<t>The infrastructure for the appropriate handling of characters from
155additional scripts is now widely deployed in operating system and
156application software. Software that can handle a wide variety of
157scripts and languages at the same time is increasingly common. Also,
158an increasing number of protocols and formats can carry a wide range of
159characters.</t>
160
161<t>URIs are used both as a protocol element (for transmission and
162processing by software) and also a presentation element (for display
163and handling by people who read, interpret, coin, or guess them). The
164transition between these roles is more difficult and complex when
165dealing with the larger set of characters than allowed for URIs in
166<xref target="RFC3986"/>. </t>
167
168<t>This document defines the protocol element called Internationalized
169Resource Identifier (IRI), which allow applications of URIs to be
170extended to use resource identifiers that have a much wider repertoire
171of characters. It also provides corresponding "internationalized"
172versions of other constructs from <xref target="RFC3986"/>, such as
173URI references. The syntax of IRIs is defined in <xref
174target="syntax"/>.
175</t>
176
177<t>Using characters outside of A - Z in IRIs adds a number of
178difficulties. <xref target="Bidi"/> discusses the special case of
179bidirectional IRIs using characters from scripts written
180right-to-left.  <xref target="equivalence"/> discusses various forms
181of equivalence between IRIs. <xref target="IRIuse"/> discusses the use
182of IRIs in different situations.  <xref target="guidelines"/> gives
183additional informative guidelines.  <xref target="security"/>
184discusses IRI-specific security considerations.</t>
185</section> <!-- overview -->
186
187<section title="Applicability" anchor="Applicability">
188
189<t>IRIs are designed to allow protocols and software that deal with
190URIs to be updated to handle IRIs. A "URI scheme" (as defined by <xref
191target="RFC3986"/> and registered through the IANA process defined in
192<xref target="RFC4395bis"/> also serves as an "IRI scheme". Processing of
193IRIs is accomplished by extending the URI syntax while retaining (and
194not expanding) the set of "reserved" characters, such that the syntax
195for any URI scheme may be uniformly extended to allow non-ASCII
196characters. In addition, following parsing of an IRI, it is possible
197to construct a corresponding URI by first encoding characters outside
198of the allowed URI range and then reassembling the components.
199</t>
200
201<t>Practical use of IRIs forms in place of URIs forms depends on the
202following conditions being met:</t>
203
204<t><list style="hanging">
205   
206<t hangText="a.">A protocol or format element MUST be explicitly designated to be
207  able to carry IRIs. The intent is to avoid introducing IRIs into
208  contexts that are not defined to accept them.  For example, XML
209  schema <xref target="XMLSchema"/> has an explicit type "anyURI" that
210  includes IRIs and IRI references. Therefore, IRIs and IRI references
211  can be in attributes and elements of type "anyURI".  On the other
212  hand, in the <xref target="RFC2616"/> definition of HTTP/1.1, the
213  Request URI is defined as a URI, which means that direct use of IRIs
214  is not allowed in HTTP requests.</t>
215
216<t hangText="b.">The protocol or format carrying the IRIs MUST have a
217  mechanism to represent the wide range of characters used in IRIs,
218  either natively or by some protocol- or format-specific escaping
219  mechanism (for example, numeric character references in <xref
220  target="XML1"/>).</t>
221
222<t hangText="c.">The URI scheme definition, if it explicitly allows a
223  percent sign ("%") in any syntactic component, SHOULD define the
224  interpretation of sequences of percent-encoded octets (using "%XX"
225  hex octets) as octet from sequences of UTF-8 encoded strings; this
226  is recommended in the guidelines for registering new schemes, <xref
227  target="RFC4395bis"/>.  For example, this is the practice for IMAP URLs
228  <xref target="RFC2192"/>, POP URLs <xref target="RFC2384"/> and the
229  URN syntax <xref target="RFC2141"/>). Note that use of
230  percent-encoding may also be restricted in some situations, for
231  example, URI schemes that disallow percent-encoding might still be
232  used with a fragment identifier which is percent-encoded (e.g.,
233  <xref target="XPointer"/>). See <xref target="UTF8use"/> for further
234  discussion.</t>
235</list></t>
236
237</section> <!-- applicability -->
238
239<section title="Definitions" anchor="sec-Definitions">
240 
241<t>The following definitions are used in this document; they follow the
242terms in <xref target="RFC2130"/>, <xref target="RFC2277"/>, and
243<xref target="ISO10646"/>.</t>
244<t><list style="hanging">
245   
246<t hangText="character:">A member of a set of elements used for the
247    organization, control, or representation of data. For example,
248    "LATIN CAPITAL LETTER A" names a character.</t>
249   
250<t hangText="octet:">An ordered sequence of eight bits considered as a
251    unit.</t>
252   
253<t hangText="character repertoire:">A set of characters (set in the
254    mathematical sense).</t>
255   
256<t hangText="sequence of characters:">A sequence of characters (one
257    after another).</t>
258   
259<t hangText="sequence of octets:">A sequence of octets (one after
260    another).</t>
261   
262<t hangText="character encoding:">A method of representing a sequence
263    of characters as a sequence of octets (maybe with variants). Also,
264    a method of (unambiguously) converting a sequence of octets into a
265    sequence of characters.</t>
266   
267<t hangText="charset:">The name of a parameter or attribute used to
268    identify a character encoding.</t>
269   
270<t hangText="UCS:">Universal Character Set. The coded character set
271    defined by ISO/IEC 10646 <xref target="ISO10646"/> and the Unicode
272    Standard <xref target="UNIV4"/>.</t>
273   
274<t hangText="IRI reference:">Denotes the common usage of an
275    Internationalized Resource Identifier. An IRI reference may be
276    absolute or relative.  However, the "IRI" that results from such a
277    reference only includes absolute IRIs; any relative IRI references
278    are resolved to their absolute form.  Note that in <xref
279    target="RFC2396"/> URIs did not include fragment identifiers, but
280    in <xref target="RFC3986"/> fragment identifiers are part of
281    URIs.</t>
282   
283<t hangText="URL:">The term "URL" was originally used <xref
284   target="RFC1738"/> for roughly what is now called a "URI".  Books,
285   software and documentation often refers to URIs and IRIs using the
286   "URL" term. Some usages restrict "URL" to those URIs which are not
287   URNs. Because of the ambiguity of the term using the term "URL" is
288   NOT RECOMMENDED in formal documents.</t>
289
290<t hangText="LEIRI (Legacy Extended IRI) processing:">  This term was used in
291   various XML specifications to refer
292   to strings that, although not valid IRIs, were acceptable input to
293   the processing rules in <xref target="LEIRIspec" />.</t>
294
295<t hangText="(Web Address, Hypertext Reference, HREF):"> These terms have been
296   added in this document for convenience, to allow other
297   specifications to refer to those strings that, although not valid
298   IRIs, are acceptable input to the processing rules in <xref
299   target="webaddress"/>. This usage corresponds to the parsing rules
300   of some popular web browsing applications.
301   ISSUE: Need to find a good name/abbreviation for these.</t>
302   
303<t hangText="running text:">Human text (paragraphs, sentences,
304   phrases) with syntax according to orthographic conventions of a
305   natural language, as opposed to syntax defined for ease of
306   processing by machines (e.g., markup, programming languages).</t>
307   
308<t hangText="protocol element:">Any portion of a message that affects
309    processing of that message by the protocol in question.</t>
310   
311<t hangText="presentation element:">A presentation form corresponding
312    to a protocol element; for example, using a wider range of
313    characters.</t>
314   
315<t hangText="create (a URI or IRI):">With respect to URIs and IRIs,
316     the term is used for the initial creation. This may be the
317     initial creation of a resource with a certain identifier, or the
318     initial exposition of a resource under a particular
319     identifier.</t>
320   
321<t hangText="generate (a URI or IRI):">With respect to URIs and IRIs,
322     the term is used when the identifier is generated by derivation
323     from other information.</t>
324
325<t hangText="parsed URI component:">When a URI processor parses a URI
326   (following the generic syntax or a scheme-specific syntax, the result
327   is a set of parsed URI components, each of which has a type
328   (corresponding to the syntactic definition) and a sequence of URI
329   characters.  </t>
330
331<t hangText="parsed IRI component:">When an IRI processor parses
332   an IRI directly, following the general syntax or a scheme-specific
333   syntax, the result is a set of parsed IRI components, each of
334   which has a type (corresponding to the syntactice definition)
335   and a sequence of IRI characters. (This definition is analogous
336   to "parsed URI component".)</t>
337
338<t hangText="IRI scheme:">A URI scheme may also be known as
339   an "IRI scheme" if the scheme's syntax has been extended to
340   allow non-US-ASCII characters according to the rules in this
341   document.</t>
342
343</list></t>
344</section> <!-- definitions -->
345<section title="Notation" anchor="sec-Notation">
346     
347<t>RFCs and Internet Drafts currently do not allow any characters
348outside the US-ASCII repertoire. Therefore, this document uses various
349special notations to denote such characters in examples.</t>
350     
351<t>In text, characters outside US-ASCII are sometimes referenced by
352using a prefix of 'U+', followed by four to six hexadecimal
353digits.</t>
354
355<t>To represent characters outside US-ASCII in examples, this document
356uses two notations: 'XML Notation' and 'Bidi Notation'.</t>
357
358<t>XML Notation uses a leading '&amp;#x', a trailing ';', and the
359hexadecimal number of the character in the UCS in between. For
360example, &amp;#x44F; stands for CYRILLIC CAPITAL LETTER YA. In this
361notation, an actual '&amp;' is denoted by '&amp;amp;'.</t>
362
363<t>Bidi Notation is used for bidirectional examples: Lower case
364letters stand for Latin letters or other letters that are written left
365to right, whereas upper case letters represent Arabic or Hebrew
366letters that are written right to left.</t>
367
368<t>To denote actual octets in examples (as opposed to percent-encoded
369octets), the two hex digits denoting the octet are enclosed in "&lt;"
370and "&gt;".  For example, the octet often denoted as 0xc9 is denoted
371here as &lt;c9&gt;.</t>
372
373<t> In this document, the key words "MUST", "MUST NOT", "REQUIRED",
374"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
375and "OPTIONAL" are to be interpreted as described in <xref
376target="RFC2119"/>.</t>
377
378</section> <!-- notation -->
379</section> <!-- introduction -->
380
381<section title="IRI Syntax" anchor="syntax">
382<t>This section defines the syntax of Internationalized Resource
383Identifiers (IRIs).</t>
384
385<t>As with URIs, an IRI is defined as a sequence of characters, not as
386a sequence of octets. This definition accommodates the fact that IRIs
387may be written on paper or read over the radio as well as stored or
388transmitted digitally.  The same IRI might be represented as different
389sequences of octets in different protocols or documents if these
390protocols or documents use different character encodings (and/or
391transfer encodings).  Using the same character encoding as the
392containing protocol or document ensures that the characters in the IRI
393can be handled (e.g., searched, converted, displayed) in the same way
394as the rest of the protocol or document.</t>
395
396<section title="Summary of IRI Syntax" anchor="summary">
397
398<t>IRIs are defined by extending the URI syntax in <xref
399target="RFC3986"/>, but extending the class of unreserved characters
400by adding the characters of the UCS (Universal Character Set, <xref
401target="ISO10646"/>) beyond U+007F, subject to the limitations given
402in the syntax rules below and in <xref target="limitations"/>.</t>
403
404<t>The syntax and use of components and reserved characters is the
405same as that in <xref target="RFC3986"/>. Each "URI scheme" thus also
406functions as an "IRI scheme", in that scheme-specific parsing rules
407for URIs of a scheme are be extended to allow parsing of IRIs using
408the same parsing rules.</t>
409
410<t>All the operations defined in <xref target="RFC3986"/>, such as the
411resolution of relative references, can be applied to IRIs by
412IRI-processing software in exactly the same way as they are for URIs
413by URI-processing software.</t>
414
415<t>Characters outside the US-ASCII repertoire MUST NOT be reserved and
416therefore MUST NOT be used for syntactical purposes, such as to
417delimit components in newly defined schemes. For example, U+00A2, CENT
418SIGN, is not allowed as a delimiter in IRIs, because it is in the
419'iunreserved' category. This is similar to the fact that it is not
420possible to use '-' as a delimiter in URIs, because it is in the
421'unreserved' category.</t>
422
423</section> <!-- summary -->
424<section title="ABNF for IRI References and IRIs" anchor="abnf">
425
426<t>An ABNF definition for IRI references (which are the most general
427concept and the start of the grammar) and IRIs is given here. The
428syntax of this ABNF is described in <xref target="STD68"/>. Character
429numbers are taken from the UCS, without implying any actual binary
430encoding. Terminals in the ABNF are characters, not octets.</t>
431
432<t>The following grammar closely follows the URI grammar in <xref
433target="RFC3986"/>, except that the range of unreserved characters is
434expanded to include UCS characters, with the restriction that private
435UCS characters can occur only in query parts. The grammar is split
436into two parts: Rules that differ from <xref target="RFC3986"/>
437because of the above-mentioned expansion, and rules that are the same
438as those in <xref target="RFC3986"/>. For rules that are different
439than those in <xref target="RFC3986"/>, the names of the non-terminals
440have been changed as follows. If the non-terminal contains 'URI', this
441has been changed to 'IRI'. Otherwise, an 'i' has been prefixed.</t>
442
443<!--
444for line length measuring in artwork (max 72 chars, three chars at start):
445      1         2         3         4         5         6         7
446456789012345678901234567890123456789012345678901234567890123456789012
447-->
448<figure>
449<preamble>The following rules are different from those in <xref target="RFC3986"/>:</preamble>
450<artwork>
451IRI            = scheme ":" ihier-part [ "?" iquery ]
452                 [ "#" ifragment ]
453
454ihier-part     = "//" iauthority ipath-abempty
455               / ipath-absolute
456               / ipath-rootless
457               / ipath-empty
458
459IRI-reference  = IRI / irelative-ref
460
461absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]
462
463irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]
464
465irelative-part = "//" iauthority ipath-abempty
466               / ipath-absolute
467               / ipath-noscheme
468               / ipath-empty
469
470iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
471iuserinfo      = *( iunreserved / pct-form / sub-delims / ":" )
472ihost          = IP-literal / IPv4address / ireg-name
473
474pct-form       = pct-encoded
475
476ireg-name      = *( iunreserved / sub-delims )
477
478ipath          = ipath-abempty   ; begins with "/" or is empty
479               / ipath-absolute  ; begins with "/" but not "//"
480               / ipath-noscheme  ; begins with a non-colon segment
481               / ipath-rootless  ; begins with a segment
482               / ipath-empty     ; zero characters
483
484ipath-abempty  = *( path-sep isegment )
485ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
486ipath-noscheme = isegment-nz-nc *( path-sep isegment )
487ipath-rootless = isegment-nz *( path-sep isegment )
488ipath-empty    = 0&lt;ipchar&gt;
489path-sep       = "/"
490
491isegment       = *ipchar
492isegment-nz    = 1*ipchar
493isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
494                     / "@" )
495               ; non-zero-length segment without any colon ":"                     
496
497ipchar         = iunreserved / pct-form / sub-delims / ":"
498               / "@"
499 
500iquery         = *( ipchar / iprivate / "/" / "?" )
501
502ifragment      = *( ipchar / "/" / "?" / "#" )
503
504iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
505
506ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
507               / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
508               / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
509               / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
510               / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
511               / %xD0000-DFFFD / %xE1000-EFFFD
512
513iprivate       = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD
514               / %x100000-10FFFD
515</artwork>
516</figure>
517
518<t>Some productions are ambiguous. The "first-match-wins" (a.k.a. "greedy")
519algorithm applies. For details, see <xref target="RFC3986"/>.</t>
520
521<figure>
522<preamble>The following rules are the same as those in <xref target="RFC3986"/>:</preamble>
523<artwork>
524scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
525 
526port           = *DIGIT
527 
528IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"
529 
530IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
531 
532IPv6address    =                            6( h16 ":" ) ls32
533               /                       "::" 5( h16 ":" ) ls32
534               / [               h16 ] "::" 4( h16 ":" ) ls32
535               / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
536               / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
537               / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
538               / [ *4( h16 ":" ) h16 ] "::"              ls32
539               / [ *5( h16 ":" ) h16 ] "::"              h16
540               / [ *6( h16 ":" ) h16 ] "::"
541               
542h16            = 1*4HEXDIG
543ls32           = ( h16 ":" h16 ) / IPv4address
544
545IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet
546
547dec-octet      = DIGIT                 ; 0-9
548               / %x31-39 DIGIT         ; 10-99
549               / "1" 2DIGIT            ; 100-199
550               / "2" %x30-34 DIGIT     ; 200-249
551               / "25" %x30-35          ; 250-255
552           
553pct-encoded    = "%" HEXDIG HEXDIG
554
555unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
556reserved       = gen-delims / sub-delims
557gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
558sub-delims     = "!" / "$" / "&amp;" / "'" / "(" / ")"
559               / "*" / "+" / "," / ";" / "="
560</artwork></figure>
561
562<t>This syntax does not support IPv6 scoped addressing zone identifiers.</t>
563
564</section> <!-- abnf -->
565
566</section> <!-- syntax -->
567
568<section title="Processing IRIs and related protocol elements" anchor="processing">
569
570<t>IRIs are meant to replace URIs in identifying resources within new
571versions of protocols, formats, and software components that use a
572UCS-based character repertoire.  Protocols and components may use and
573process IRIs directly. However, there are still numerous systems and
574protocols which only accept URIs or components of parsed URIs; that is,
575they only accept sequences of characters within the subset of US-ASCII
576characters allowed in URIs. </t>
577
578<t>This section defines specific processing steps for IRI consumers
579which establish the relationship between the string given and the
580interpreted derivatives. These
581processing steps apply to both IRIs and IRI references (i.e., absolute
582or relative forms); for IRIs, some steps are scheme specific. </t>
583
584<section title="Converting to UCS" anchor="ucsconv"> 
585 
586<t>Input that is already in a Unicode form (i.e., a sequence of Unicode
587 characters or an octet-stream representing a Unicode-based character
588 encoding such as UTF-8 or UTF-16) should be left as is and not
589 normalized (see (see <xref target="normalization"/>).</t>
590
591  <t>An IRI or IRI reference is a sequence of characters from the UCS.
592    For IRIs that are not already in a Unicode form
593    (as when written on paper, read aloud, or represented in a text stream
594    using a legacy character encoding), convert the IRI to Unicode.
595    Note that some character encodings or transcriptions can be converted
596    to or represented by more than one sequence of Unicode characters.
597    Ideally the resulting IRI would use a normalized form,
598    such as Unicode Normalization Form C <xref target="UTR15"/>
599    (see <xref target='ladder'/> Normalization and Comparison),
600    since that ensures a stable, consistent representation
601    that is most likely to produce the intended results.
602    Implementers and users are cautioned that, while denormalized character sequences are valid,
603    they might be difficult for other users or processes to reproduce
604    and might lead to unexpected results.
605  </t>
606
607<t> In other cases (written on paper, read aloud, or otherwise
608 represented independent of any character encoding) represent the IRI
609 as a sequence of characters from the UCS normalized according to
610 Unicode Normalization Form C (NFC, <xref target="UTR15"/>).</t>
611</section> <!-- ucsconv -->
612
613<section title="Parse the IRI into IRI components">
614
615<t>Parse the IRI, either as a relative reference (no scheme)
616or using scheme specific processing (according to the scheme
617given); the result resulting in a set of parsed IRI components.
618(NOTE: FIX BEFORE RELEASE: INTENT IS THAT ALL IRI SCHEMES
619THAT USE GENERIC SYNTAX AND ALLOW NON-ASCII AUTHORITY CAN
620ONLY USE AUTHORITY FOR NAMES THAT FOLLOW PUNICODE.)
621 </t>
622
623<t>NOTE: The result of parsing into components will correspond result
624in a correspondence of subtrings of the IRI according to the part
625matched.  For example, in <xref target="HTML5"/>, the protocol
626components of interest are SCHEME (scheme), HOST (ireg-name), PORT
627(port), the PATH (ipath after the initial "/"), QUERY (iquery),
628FRAGMENT (ifragment), and AUTHORITY (iauthority).
629</t>
630
631<t>Subsequent processing rules are sometimes used to define other
632syntactic components. For example, <xref target="HTML5"/> defines APIs
633for IRI processing; in these APIs:
634
635<list style="hanging">
636<t hangText="HOSTSPECIFIC"> the substring that follows
637the substring matched by the iauthority production, or the whole
638string if the iauthority production wasn't matched.</t>
639<t hangText="HOSTPORT"> if there is a scheme component and a port
640component and the port given by the port component is different than
641the default port defined for the protocol given by the scheme
642component, then HOSTPORT is the substring that starts with the
643substring matched by the host production and ends with the substring
644matched by the port production, and includes the colon in between the
645two. Otherwise, it is the same as the host component.
646</t>
647</list>
648</t>
649</section> <!-- parse -->
650
651<section title="General percent-encoding of IRI components" anchor="compmapping">
652   
653<t>For most IRI components, it is possible to map the IRI component
654to an equivalent URI component by percent-encoding those characters
655not allowed in URIs. Previous processing steps will have removed
656some characters, and the interpretation of reserved characters will
657have already been done (with the syntactic reserved characters outside
658of the IRI component). This mapping is defined for all sequences
659of Unicode characters, whether or not they are valid for the component
660in question. </t>
661   
662<t>For each character which is not allowed in a valid URI (NOTE: WHAT
663IS THE RIGHT REFERENCE HERE), apply the following steps. </t>
664
665<t><list style="hanging">
666
667<t hangText="Convert to UTF-8">Convert the character to a sequence of
668  one or more octets using UTF-8 <xref target="RFC3629"/>.</t>
669
670<t hangText="Percent encode">Convert each octet of this sequence to %HH,
671   where HH is the hexadecimal notation of the octet value. The
672   hexadecimal notation SHOULD use uppercase letters. (This is the
673   general URI percent-encoding mechanism in Section 2.1 of <xref
674   target="RFC3986"/>.)</t>
675   
676</list></t>
677
678<t>Note that the mapping is an identity transformation for parsed URI
679components of valid URIs, and is idempotent: applying the mapping a
680second time will not change anything.</t>
681</section> <!-- general conversion -->
682
683<section title="Mapping ireg-name" anchor="dnsmapping">
684
685<t>Schemes that allow non-ASCII based characters
686in the reg-name (ireg-name) position MUST convert the ireg-name
687component of an IRI as follows:</t>
688
689<t>Replace the ireg-name part of the IRI by the part converted using
690the ToASCII operation specified in Section 4.1 of <xref
691target="RFC3490"/> on each dot-separated label, and by using U+002E
692(FULL STOP) as a label separator, with the flag UseSTD3ASCIIRules set
693to FALSE, and with the flag AllowUnassigned set to FALSE.
694The ToASCII operation may
695fail, but this would mean that the IRI cannot be resolved.
696In such cases, if the domain name conversion fails, then the
697entire IRI conversion fails. Processors that have no mechanism for
698signalling a failure MAY instead substitute an otherwise
699invalid host name, although such processing SHOULD be avoided.
700 </t>
701
702<t>For example, the IRI
703<vspace/>"http://r&amp;#xE9;sum&amp;#xE9;.example.org"<vspace/> MAY be
704converted to <vspace/>"http://xn--rsum-bad.example.org"<vspace/>;
705conversion to percent-encoded form, e.g.,
706 <vspace/>"http://r%C3%A9sum%C3%A9.example.org", MUST NOT be performed. </t>
707
708<t><list style="hanging"> 
709
710<t hangText="Note:">Domain Names may appear in parts of an IRI other
711than the ireg-name part.  It is the responsibility of scheme-specific
712implementations (if the Internationalized Domain Name is part of the
713scheme syntax) or of server-side implementations (if the
714Internationalized Domain Name is part of 'iquery') to apply the
715necessary conversions at the appropriate point. Example: Trying to
716validate the Web page at<vspace/>
717http://r&amp;#xE9;sum&amp;#xE9;.example.org would lead to an IRI of
718<vspace/>http://validator.w3.org/check?uri=http%3A%2F%2Fr&amp;#xE9;sum&amp;#xE9;.<vspace/>example.org,
719which would convert to a URI
720of<vspace/>http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.<vspace/>example.org.
721The server-side implementation is responsible for making the
722necessary conversions to be able to retrieve the Web page.</t>
723
724<t hangText="Note:">In this process, characters allowed in URI
725references and existing percent-encoded sequences are not encoded further.
726(This mapping is similar to, but different from, the encoding applied
727when arbitrary content is included in some part of a URI.)
728
729For example, an IRI of
730<vspace/>"http://www.example.org/red%09ros&amp;#xE9;#red"
731(in XML notation) is converted to
732<vspace/>"http://www.example.org/red%09ros%C3%A9#red", not to
733something like
734<vspace/>"http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".
735((DESIGN QUESTION: What about e.g. http://r%C3%A9sum%C3%A9.example.org in an IRI? Will that get converted to punycode, or not?))
736
737</t>
738
739</list></t>
740</section> <!-- dnsmapping -->
741
742<section title="Mapping query components" anchor="querymapping">
743
744<t>((NOTE: SEE ISSUES LIST))
745
746For compatibility with existing deployed HTTP infrastructure,
747the following special case applies for schemes "http" and "https"
748and IRIs whose origin has a document charset other than one which
749is UCS-based (e.g., UTF-8 or UTF-16). In such a case, the "query"
750component of an IRI is mapped into a URI by using the document
751charset rather than UTF-8 as the binary representation before
752pct-encoding. This mapping is not applied for any other scheme
753or component.</t>
754
755</section> <!-- querymapping -->
756
757<section title="Mapping IRIs to URIs" anchor="mapping">
758
759<t>The canonical mapping from a IRI to URI is defined by applying the
760mapping above (from IRI to URI components) and then reassembling a URI
761from the parsed URI components using the original punctuation that
762delimited the IRI components. </t>
763
764</section> <!-- mapping -->
765
766<section title="Converting URIs to IRIs" anchor="URItoIRI">
767
768<t>In some situations, for presentation and further processing,
769it is desirable to convert a URI into an equivalent IRI in which
770natural characters are represented directly rather than
771percent encoded. Of course, every URI is already an IRI in
772its own right without any conversion, and in general there
773This section gives one such procedure for this conversion.
774</t>
775
776<t>
777The conversion described in this section, if given a valid URI, will
778result in an IRI that maps back to the URI used as an input for the
779conversion (except for potential case differences in percent-encoding
780and for potential percent-encoded unreserved characters).
781
782However, the IRI resulting from this conversion may differ
783from the original IRI (if there ever was one).</t> 
784
785<t>URI-to-IRI conversion removes percent-encodings, but not all
786percent-encodings can be eliminated. There are several reasons for
787this:</t>
788
789<t><list style="hanging">
790
791<t hangText="1.">Some percent-encodings are necessary to distinguish
792    percent-encoded and unencoded uses of reserved characters.</t>
793
794<t hangText="2.">Some percent-encodings cannot be interpreted as sequences
795    of UTF-8 octets.<vspace blankLines="1"/>
796    (Note: The octet patterns of UTF-8 are highly regular.
797    Therefore, there is a very high probability, but no guarantee,
798    that percent-encodings that can be interpreted as sequences of UTF-8
799    octets actually originated from UTF-8. For a detailed discussion,
800    see <xref target="Duerst97"/>.)</t>
801
802<t hangText="3.">The conversion may result in a character that is not
803    appropriate in an IRI. See <xref target="abnf"/>, <xref target="visual"/>,
804      and <xref target="limitations"/> for further details.</t>
805
806<t hangText="4.">IRI to URI conversion has different rules for
807    dealing with domain names and query parameters.</t>
808
809</list></t>
810
811<t>Conversion from a URI to an IRI MAY be done by using the following
812steps:
813
814<list style="hanging">
815<t hangText="1.">Represent the URI as a sequence of octets in
816       US-ASCII.</t>
817
818<t hangText="2.">Convert all percent-encodings ("%" followed by two
819      hexadecimal digits) to the corresponding octets, except those
820      corresponding to "%", characters in "reserved", and characters
821      in US-ASCII not allowed in URIs.</t> 
822
823<t hangText="3.">Re-percent-encode any octet produced in step 2 that
824      is not part of a strictly legal UTF-8 octet sequence.</t>
825
826
827<t hangText="4.">Re-percent-encode all octets produced in step 3 that
828      in UTF-8 represent characters that are not appropriate according
829      to <xref target="abnf"/>, <xref target="visual"/>, and <xref
830      target="limitations"/>.</t> 
831
832<t hangText="5.">Interpret the resulting octet sequence as a sequence
833      of characters encoded in UTF-8.</t>
834
835<t hangText="6.">URIs known to contain domain names in the reg-name
836      component SHOULD convert punycode-encoded domain name labels to
837      the corresponding characters using the ToUnicode procedure. </t>
838</list></t>
839
840<t>This procedure will convert as many percent-encoded characters as
841possible to characters in an IRI. Because there are some choices when
842step 4 is applied (see <xref target="limitations"/>), results may
843vary.</t>
844
845<t>Conversions from URIs to IRIs MUST NOT use any character
846encoding other than UTF-8 in steps 3 and 4, even if it might be
847possible to guess from the context that another character encoding
848than UTF-8 was used in the URI.  For example, the URI
849"http://www.example.org/r%E9sum%E9.html" might with some guessing be
850interpreted to contain two e-acute characters encoded as
851iso-8859-1. It must not be converted to an IRI containing these
852e-acute characters. Otherwise, in the future the IRI will be mapped to
853"http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
854URI from "http://www.example.org/r%E9sum%E9.html".</t>
855
856<section title="Examples">
857
858<t>This section shows various examples of converting URIs to IRIs.
859Each example shows the result after each of the steps 1 through 6 is
860applied. XML Notation is used for the final result.  Octets are
861denoted by "&lt;" followed by two hexadecimal digits followed by
862"&gt;".</t>
863
864<t>The following example contains the sequence "%C3%BC", which is a
865strictly legal UTF-8 sequence, and which is converted into the actual
866character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
867u-umlaut).
868
869<list style="hanging">
870<t hangText="1.">http://www.example.org/D%C3%BCrst</t>
871<t hangText="2.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
872<t hangText="3.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
873<t hangText="4.">http://www.example.org/D&lt;c3&gt;&lt;bc&gt;rst</t>
874<t hangText="5.">http://www.example.org/D&amp;#xFC;rst</t>
875<t hangText="6.">http://www.example.org/D&amp;#xFC;rst</t>
876</list>
877</t>
878
879<t>The following example contains the sequence "%FC", which might
880represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in
881the<vspace/>iso-8859-1 character encoding.  (It might represent other
882characters in other character encodings. For example, the octet
883&lt;fc&gt; in iso-8859-5 represents U+045C, CYRILLIC SMALL LETTER
884KJE.)  Because &lt;fc&gt; is not part of a strictly legal UTF-8
885sequence, it is re-percent-encoded in step 3.
886
887
888<list style="hanging">
889<t hangText="1.">http://www.example.org/D%FCrst</t>
890<t hangText="2.">http://www.example.org/D&lt;fc&gt;rst</t>
891<t hangText="3.">http://www.example.org/D%FCrst</t>
892<t hangText="4.">http://www.example.org/D%FCrst</t>
893<t hangText="5.">http://www.example.org/D%FCrst</t>
894<t hangText="6.">http://www.example.org/D%FCrst</t>
895</list>
896</t>
897
898<t>The following example contains "%e2%80%ae", which is the percent-encoded<vspace/>UTF-8
899character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. <xref target="visual"/>
900forbids the direct use of this character in an IRI. Therefore, the
901corresponding octets are re-percent-encoded in step 4. This example shows
902that the case (upper- or lowercase) of letters used in percent-encodings may not be preserved.
903The example also contains a punycode-encoded domain name label (xn--99zt52a),
904which is not converted.
905
906<list style="hanging">
907<t hangText="1.">http://xn--99zt52a.example.org/%e2%80%ae</t>
908<t hangText="2.">http://xn--99zt52a.example.org/&lt;e2&gt;&lt;80&gt;&lt;ae&gt;</t>
909<t hangText="3.">http://xn--99zt52a.example.org/&lt;e2&gt;&lt;80&gt;&lt;ae&gt;</t>
910<t hangText="4.">http://xn--99zt52a.example.org/%E2%80%AE</t>
911<t hangText="5.">http://xn--99zt52a.example.org/%E2%80%AE</t>
912<t hangText="6.">http://&amp;#x7D0D;&amp;#x8C46;.example.org/%E2%80%AE</t>
913</list></t>
914
915<t>Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
916(Japanese Natto). ((EDITOR NOTE: There is some inconsistency in this note.))</t>
917
918</section> <!-- examples -->
919</section> <!-- URItoIRI -->
920</section> <!-- processing -->
921<section title="Bidirectional IRIs for Right-to-Left Languages" anchor="Bidi">
922
923<t>Some UCS characters, such as those used in the Arabic and Hebrew
924scripts, have an inherent right-to-left (rtl) writing direction. IRIs
925containing these characters (called bidirectional IRIs or Bidi IRIs)
926require additional attention because of the non-trivial relation
927between logical representation (used for digital representation and
928for reading/spelling) and visual representation (used for
929display/printing).</t>
930
931<t>Because of the complex interaction between the logical representation,
932the visual representation, and the syntax of a Bidi IRI, a balance is
933needed between various requirements.
934The main requirements are<list style="hanging">
935<t hangText="1.">user-predictable conversion between visual and
936    logical representation;</t>
937<t hangText="2.">the ability to include a wide range of characters
938    in various parts of the IRI; and</t>
939<t hangText="3.">minor or no changes or restrictions for
940      implementations.</t>
941</list></t>
942
943<section title="Logical Storage and Visual Presentation" anchor="visual">
944
945<t>When stored or transmitted in digital representation, bidirectional
946IRIs MUST be in full logical order and MUST conform to the IRI syntax
947rules (which includes the rules relevant to their scheme). This
948ensures that bidirectional IRIs can be processed in the same way as
949other IRIs.</t> <t>Bidirectional IRIs MUST be rendered by using the
950Unicode Bidirectional Algorithm <xref target="UNIV4"/>, <xref
951target="UNI9"/>.  Bidirectional IRIs MUST be rendered in the same way
952as they would be if they were in a left-to-right embedding; i.e., as
953if they were preceded by U+202A, LEFT-TO-RIGHT EMBEDDING (LRE), and
954followed by U+202C, POP DIRECTIONAL FORMATTING (PDF).  Setting the
955embedding direction can also be done in a higher-level protocol (e.g.,
956the dir='ltr' attribute in HTML).</t> 
957
958<t>There is no requirement to use the above embedding if the display
959is still the same without the embedding. For example, a bidirectional
960IRI in a text with left-to-right base directionality (such as used for
961English or Cyrillic) that is preceded and followed by whitespace and
962strong left-to-right characters does not need an embedding.  Also, a
963bidirectional relative IRI reference that only contains strong
964right-to-left characters and weak characters and that starts and ends
965with a strong right-to-left character and appears in a text with
966right-to-left base directionality (such as used for Arabic or Hebrew)
967and is preceded and followed by whitespace and strong characters does
968not need an embedding.</t>
969
970<t>In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
971sufficient to force the correct display behavior.  However, the
972details of the Unicode Bidirectional algorithm are not always easy to
973understand. Implementers are strongly advised to err on the side of
974caution and to use embedding in all cases where they are not
975completely sure that the display behavior is unaffected without the
976embedding.</t>
977
978<t>The Unicode Bidirectional Algorithm (<xref target="UNI9"/>, section
9794.3) permits higher-level protocols to influence bidirectional
980rendering. Such changes by higher-level protocols MUST NOT be used if
981they change the rendering of IRIs.</t> 
982
983<t>The bidirectional formatting characters that may be used before or
984after the IRI to ensure correct display are not themselves part of the
985IRI.  IRIs MUST NOT contain bidirectional formatting characters (LRM,
986RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of
987the IRI but do not appear themselves. It would therefore not be
988possible to input an IRI with such characters correctly.</t>
989
990</section> <!-- visual -->
991<section title="Bidi IRI Structure" anchor="bidi-structure">
992
993<t>The Unicode Bidirectional Algorithm is designed mainly for running
994text.  To make sure that it does not affect the rendering of
995bidirectional IRIs too much, some restrictions on bidirectional IRIs
996are necessary. These restrictions are given in terms of delimiters
997(structural characters, mostly punctuation such as "@", ".", ":",
998and<vspace/>"/") and components (usually consisting mostly of letters
999and digits).</t>
1000
1001<t>The following syntax rules from <xref target="abnf"/> correspond to
1002components for the purpose of Bidi behavior: iuserinfo, ireg-name,
1003isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and
1004ifragment.</t>
1005
1006<t>Specifications that define the syntax of any of the above
1007components MAY divide them further and define smaller parts to be
1008components according to this document. As an example, the restrictions
1009of <xref target="RFC3490"/> on bidirectional domain names correspond
1010to treating each label of a domain name as a component for schemes
1011with ireg-name as a domain name.  Even where the components are not
1012defined formally, it may be helpful to think about some syntax in
1013terms of components and to apply the relevant restrictions.  For
1014example, for the usual name/value syntax in query parts, it is
1015convenient to treat each name and each value as a component. As
1016another example, the extensions in a resource name can be treated as
1017separate components.</t>
1018
1019<t>For each component, the following restrictions apply:</t>
1020<t>
1021<list style="hanging">
1022
1023<t hangText="1.">A component SHOULD NOT use both right-to-left and
1024  left-to-right characters.</t>
1025
1026<t hangText="2.">A component using right-to-left characters SHOULD
1027  start and end with right-to-left characters.</t>
1028
1029</list></t>
1030
1031<t>The above restrictions are given as "SHOULD"s, rather than as
1032"MUST"s.  For IRIs that are never presented visually, they are not
1033relevant.  However, for IRIs in general, they are very important to
1034ensure consistent conversion between visual presentation and logical
1035representation, in both directions.</t>
1036
1037<t><list style="hanging">
1038
1039<t hangText="Note:">In some components, the above restrictions may
1040  actually be strictly enforced.  For example, <xref
1041  target="RFC3490"></xref> requires that these restrictions apply to
1042  the labels of a host name for those schemes where ireg-name is a
1043  host name.  In some other components (for example, path components)
1044  following these restrictions may not be too difficult.  For other
1045  components, such as parts of the query part, it may be very
1046  difficult to enforce the restrictions because the values of query
1047  parameters may be arbitrary character sequences.</t>
1048
1049</list></t>
1050
1051<t>If the above restrictions cannot be satisfied otherwise, the
1052affected component can always be mapped to URI notation as described
1053in <xref target="compmapping"/>. Please note that the whole component
1054has to be mapped (see also Example 9 below).</t>
1055
1056</section> <!-- bidi-structure -->
1057
1058<section title="Input of Bidi IRIs" anchor="bidiInput">
1059
1060<t>Bidi input methods MUST generate Bidi IRIs in logical order while
1061rendering them according to <xref target="visual"/>.  During input,
1062rendering SHOULD be updated after every new character is input to
1063avoid end-user confusion.</t>
1064
1065</section> <!-- bidiInput -->
1066
1067<section title="Examples">
1068
1069<t>This section gives examples of bidirectional IRIs, in Bidi
1070Notation.  It shows legal IRIs with the relationship between logical
1071and visual representation and explains how certain phenomena in this
1072relationship may look strange to somebody not familiar with
1073bidirectional behavior, but familiar to users of Arabic and Hebrew. It
1074also shows what happens if the restrictions given in <xref
1075target="bidi-structure"/> are not followed. The examples below can be
1076seen at <xref target="BidiEx"/>, in Arabic, Hebrew, and Bidi Notation
1077variants.</t>
1078
1079<t>To read the bidi text in the examples, read the visual
1080representation from left to right until you encounter a block of rtl
1081text. Read the rtl block (including slashes and other special
1082characters) from right to left, then continue at the next unread ltr
1083character.</t>
1084
1085<t>Example 1: A single component with rtl characters is inverted:
1086<vspace/>Logical representation:
1087"http://ab.CDEFGH.ij/kl/mn/op.html"<vspace/>Visual representation:
1088"http://ab.HGFEDC.ij/kl/mn/op.html"<vspace/> Components can be read
1089one by one, and each component can be read in its natural
1090direction.</t>
1091
1092<t>Example 2: More than one consecutive component with rtl characters
1093is inverted as a whole: <vspace/>Logical representation:
1094"http://ab.CDE.FGH/ij/kl/mn/op.html"<vspace/>Visual representation:
1095"http://ab.HGF.EDC/ij/kl/mn/op.html"<vspace/> A sequence of rtl
1096components is read rtl, in the same way as a sequence of rtl words is
1097read rtl in a bidi text.</t>
1098
1099<t>Example 3: All components of an IRI (except for the scheme) are
1100rtl.  All rtl components are inverted overall: <vspace/>Logical
1101representation:
1102"http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"<vspace/>Visual
1103representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"<vspace/> The
1104whole IRI (except the scheme) is read rtl. Delimiters between rtl
1105components stay between the respective components; delimiters between
1106ltr and rtl components don't move.</t>
1107
1108<t>Example 4: Each of several sequences of rtl components is inverted
1109on its own: <vspace/>Logical representation:
1110"http://AB.CD.ef/gh/IJ/KL.html"<vspace/>Visual representation:
1111"http://DC.BA.ef/gh/LK/JI.html"<vspace/> Each sequence of rtl
1112components is read rtl, in the same way as each sequence of rtl words
1113in an ltr text is read rtl.</t>
1114
1115<t>Example 5: Example 2, applied to components of different kinds:
1116<vspace/>Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
1117<vspace/>Visual representation:
1118"http://ab.cd.HG/FE/ij/kl.html"<vspace/> The inversion of the domain
1119name label and the path component may be unexpected, but it is
1120consistent with other bidi behavior.  For reassurance that the domain
1121component really is "ab.cd.EF", it may be helpful to read aloud the
1122visual representation following the bidi algorithm. After
1123"http://ab.cd." one reads the RTL block "E-F-slash-G-H", which
1124corresponds to the logical representation.
1125</t>
1126
1127<t>Example 6: Same as Example 5, with more rtl components:
1128<vspace/>Logical representation:
1129"http://ab.CD.EF/GH/IJ/kl.html"<vspace/>Visual representation:
1130"http://ab.JI/HG/FE.DC/kl.html"<vspace/> The inversion of the domain
1131name labels and the path components may be easier to identify because
1132the delimiters also move.</t>
1133
1134<t>Example 7: A single rtl component includes digits: <vspace/>Logical
1135representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"<vspace/>Visual
1136representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"<vspace/>
1137Numbers are written ltr in all cases but are treated as an additional
1138embedding inside a run of rtl characters. This is completely
1139consistent with usual bidirectional text.</t>
1140
1141<t>Example 8 (not allowed): Numbers are at the start or end of an rtl
1142component:<vspace/>Logical representation:
1143"http://ab.cd.ef/GH1/2IJ/KL.html"<vspace/>Visual representation:
1144"http://ab.cd.ef/LK/JI1/2HG.html"<vspace/> The sequence "1/2" is
1145interpreted by the bidi algorithm as a fraction, fragmenting the
1146components and leading to confusion. There are other characters that
1147are interpreted in a special way close to numbers; in particular, "+",
1148"-", "#", "$", "%", ",", ".", and ":".</t>
1149
1150<t>Example 9 (not allowed): The numbers in the previous example are
1151percent-encoded: <vspace/>Logical representation:
1152"http://ab.cd.ef/GH%31/%32IJ/KL.html",<vspace/>Visual representation:
1153"http://ab.cd.ef/LK/JI%32/%31HG.html"</t>
1154
1155<t>Example 10 (allowed but not recommended): <vspace/>Logical
1156representation: "http://ab.CDEFGH.123/kl/mn/op.html"<vspace/>Visual
1157representation: "http://ab.123.HGFEDC/kl/mn/op.html"<vspace/>
1158Components consisting of only numbers are allowed (it would be rather
1159difficult to prohibit them), but these may interact with adjacent RTL
1160components in ways that are not easy to predict.</t>
1161
1162<t>Example 11 (allowed but not recommended): <vspace/>Logical
1163representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"<vspace/>Visual
1164representation: "http://ab.123.HGFEDCij/kl/mn/op.html"<vspace/>
1165Components consisting of numbers and left-to-right characters are
1166allowed, but these may interact with adjacent RTL components in ways
1167that are not easy to predict.</t>
1168</section><!-- examples -->
1169</section><!-- bidi -->
1170
1171<section title="Normalization and Comparison" anchor="equivalence">
1172
1173<t><list style="hanging"><t hangText="Note:">The structure and much of
1174  the material for this section is taken from section 6 of <xref
1175  target="RFC3986"></xref>; the differences are due to the specifics
1176  of IRIs.</t></list></t>
1177
1178<t>One of the most common operations on IRIs is simple comparison:
1179Determining whether two IRIs are equivalent, without using the IRIs to
1180access their respective resource(s). A comparison is performed
1181whenever a response cache is accessed, a browser checks its history to
1182color a link, or an XML parser processes tags within a
1183namespace. Extensive normalization prior to comparison of IRIs may be
1184used by spiders and indexing engines to prune a search space or reduce
1185duplication of request actions and response storage.</t>
1186
1187<t>IRI comparison is performed for some particular purpose. Protocols
1188or implementations that compare IRIs for different purposes will often
1189be subject to differing design trade-offs in regards to how much
1190effort should be spent in reducing aliased identifiers. This section
1191describes various methods that may be used to compare IRIs, the
1192trade-offs between them, and the types of applications that might use
1193them.</t>
1194
1195<section title="Equivalence">
1196
1197<t>Because IRIs exist to identify resources, presumably they should be
1198considered equivalent when they identify the same resource. However,
1199this definition of equivalence is not of much practical use, as there
1200is no way for an implementation to compare two resources to determine
1201if they are "the same" unless it has full knowledge or control of
1202them. For this reason, determination of equivalence or difference of
1203IRIs is based on string comparison, perhaps augmented by reference to
1204additional rules provided by URI scheme definitions.  We use the terms
1205"different" and "equivalent" to describe the possible outcomes of such
1206comparisons, but there are many application-dependent versions of
1207equivalence.</t>
1208
1209<t>Even when it is possible to determine that two IRIs are equivalent,
1210IRI comparison is not sufficient to determine whether two IRIs
1211identify different resources. For example, an owner of two different
1212domain names could decide to serve the same resource from both,
1213resulting in two different IRIs. Therefore, comparison methods are
1214designed to minimize false negatives while strictly avoiding false
1215positives.</t>
1216
1217<t>In testing for equivalence, applications should not directly
1218compare relative references; the references should be converted to
1219their respective target IRIs before comparison. When IRIs are compared
1220to select (or avoid) a network action, such as retrieval of a
1221representation, fragment components (if any) should be excluded from
1222the comparison.</t>
1223
1224<t>Applications using IRIs as identity tokens with no relationship to
1225a protocol MUST use the Simple String Comparison (see <xref
1226target="stringcomp"></xref>).  All other applications MUST select one
1227of the comparison practices from the Comparison Ladder (see <xref
1228target="ladder"></xref>.</t>
1229</section> <!-- equivalence -->
1230
1231
1232<section title="Preparation for Comparison">
1233<t>Any kind of IRI comparison REQUIRES that any additional contextual
1234processing is first performed, including undoing higher-level
1235escapings or encodings in the protocol or format that carries an
1236IRI. This preprocessing is usually done when the protocol or format is
1237parsed.</t>
1238
1239<t>Examples of contextual preprocessing steps are described in <xref
1240target="LEIRIHREF"/>. </t>
1241
1242<t>Examples of such escapings or encodings are entities and
1243numeric character references in <xref target="HTML4"></xref> and <xref
1244target="XML1"></xref>. As an example,
1245"http://example.org/ros&amp;eacute;" (in HTML),
1246"http://example.org/ros&amp;#233;" (in HTML or XML), and
1247<vspace/>"http://example.org/ros&amp;#xE9;" (in HTML or XML) are all
1248resolved into what is denoted in this document (see <xref
1249target="sec-Notation"></xref>) as "http://example.org/ros&amp;#xE9;"
1250(the "&amp;#xE9;" here standing for the actual e-acute character, to
1251compensate for the fact that this document cannot contain non-ASCII
1252characters).</t>
1253
1254<t>Similar considerations apply to encodings such as Transfer Codings
1255in HTTP (see <xref target="RFC2616"></xref>) and Content Transfer
1256Encodings in MIME (<xref target="RFC2045"></xref>), although in these
1257cases, the encoding is based not on characters but on octets, and
1258additional care is required to make sure that characters, and not just
1259arbitrary octets, are compared (see <xref
1260target="stringcomp"></xref>).</t>
1261
1262</section> <!-- preparation -->
1263
1264<section title="Comparison Ladder" anchor="ladder">
1265
1266<t>In practice, a variety of methods are used to test IRI
1267equivalence. These methods fall into a range distinguished by the
1268amount of processing required and the degree to which the probability
1269of false negatives is reduced. As noted above, false negatives cannot
1270be eliminated. In practice, their probability can be reduced, but this
1271reduction requires more processing and is not cost-effective for all
1272applications.</t>
1273
1274
1275<t>If this range of comparison practices is considered as a ladder,
1276the following discussion will climb the ladder, starting with
1277practices that are cheap but have a relatively higher chance of
1278producing false negatives, and proceeding to those that have higher
1279computational cost and lower risk of false negatives.</t>
1280
1281<section title="Simple String Comparison" anchor="stringcomp">
1282
1283<t>If two IRIs, when considered as character strings, are identical,
1284then it is safe to conclude that they are equivalent.  This type of
1285equivalence test has very low computational cost and is in wide use in
1286a variety of applications, particularly in the domain of parsing. It
1287is also used when a definitive answer to the question of IRI
1288equivalence is needed that is independent of the scheme used and that
1289can be calculated quickly and without accessing a network. An example
1290of such a case is XML Namespaces (<xref
1291target="XMLNamespace"></xref>).</t>
1292
1293
1294<t>Testing strings for equivalence requires some basic precautions.
1295This procedure is often referred to as "bit-for-bit" or
1296"byte-for-byte" comparison, which is potentially misleading. Testing
1297strings for equality is normally based on pair comparison of the
1298characters that make up the strings, starting from the first and
1299proceeding until both strings are exhausted and all characters are
1300found to be equal, until a pair of characters compares unequal, or
1301until one of the strings is exhausted before the other.</t>
1302
1303<t>This character comparison requires that each pair of characters be
1304put in comparable encoding form. For example, should one IRI be stored
1305in a byte array in UTF-8 encoding form and the second in a UTF-16
1306encoding form, bit-for-bit comparisons applied naively will produce
1307errors. It is better to speak of equality on a character-for-character
1308rather than on a byte-for-byte or bit-for-bit basis.  In practical
1309terms, character-by-character comparisons should be done codepoint by
1310codepoint after conversion to a common character encoding form.
1311
1312When comparing character by character, the comparison function MUST
1313NOT map IRIs to URIs, because such a mapping would create additional
1314spurious equivalences. It follows that an IRI SHOULD NOT be modified
1315when being transported if there is any chance that this IRI might be
1316used in a context that uses Simple String Comparison.</t>
1317
1318
1319<t>False negatives are caused by the production and use of IRI
1320aliases. Unnecessary aliases can be reduced, regardless of the
1321comparison method, by consistently providing IRI references in an
1322already normalized form (i.e., a form identical to what would be
1323produced after normalization is applied, as described below).
1324Protocols and data formats often limit some IRI comparisons to simple
1325string comparison, based on the theory that people and implementations
1326will, in their own best interest, be consistent in providing IRI
1327references, or at least be consistent enough to negate any efficiency
1328that might be obtained from further normalization.</t>
1329</section> <!-- stringcomp -->
1330
1331<section title="Syntax-Based Normalization">
1332
1333<figure><preamble>Implementations may use logic based on the
1334definitions provided by this specification to reduce the probability
1335of false negatives. This processing is moderately higher in cost than
1336character-for-character string comparison. For example, an application
1337using this approach could reasonably consider the following two IRIs
1338equivalent:</preamble>
1339
1340<artwork>
1341   example://a/b/c/%7Bfoo%7D/ros&amp;#xE9;
1342   eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9
1343</artwork></figure>
1344
1345<t>Web user agents, such as browsers, typically apply this type of IRI
1346normalization when determining whether a cached response is
1347available. Syntax-based normalization includes such techniques as case
1348normalization, character normalization, percent-encoding
1349normalization, and removal of dot-segments.</t>
1350
1351<section title="Case Normalization">
1352
1353<t>For all IRIs, the hexadecimal digits within a percent-encoding
1354triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
1355should be normalized to use uppercase letters for the digits A-F.</t>
1356
1357<t>When an IRI uses components of the generic syntax, the component
1358syntax equivalence rules always apply; namely, that the scheme and
1359US-ASCII only host are case insensitive and therefore should be
1360normalized to lowercase. For example, the URI
1361"HTTP://www.EXAMPLE.com/" is equivalent to
1362"http://www.example.com/". Case equivalence for non-ASCII characters
1363in IRI components that are IDNs are discussed in <xref
1364target="schemecomp"></xref>.  The other generic syntax components are
1365assumed to be case sensitive unless specifically defined otherwise by
1366the scheme.</t>
1367
1368<t>Creating schemes that allow case-insensitive syntax components
1369containing non-ASCII characters should be avoided. Case normalization
1370of non-ASCII characters can be culturally dependent and is always a
1371complex operation. The only exception concerns non-ASCII host names
1372for which the character normalization includes a mapping step derived
1373from case folding.</t>
1374
1375</section> <!-- casenorm -->
1376
1377<section title="Character Normalization" anchor="normalization">
1378
1379<t>The Unicode Standard <xref target="UNIV4"></xref> defines various
1380equivalences between sequences of characters for various
1381purposes. Unicode Standard Annex #15 <xref target="UTR15"></xref>
1382defines various Normalization Forms for these equivalences, in
1383particular Normalization Form C (NFC, Canonical Decomposition,
1384followed by Canonical Composition) and Normalization Form KC (NFKC,
1385Compatibility Decomposition, followed by Canonical Composition).</t>
1386
1387<t> IRIs already in Unicode MUST NOT be normalized before parsing or
1388interpreting. In many non-Unicode character encodings, some text
1389cannot be represented directly. For example, the word "Vietnam" is
1390natively written "Vi&amp;#x1EC7;t Nam" (containing a LATIN SMALL
1391LETTER E WITH CIRCUMFLEX AND DOT BELOW) in NFC, but a direct
1392transcoding from the windows-1258 character encoding leads to
1393"Vi&amp;#xEA;&amp;#x323;t Nam" (containing a LATIN SMALL LETTER E WITH
1394CIRCUMFLEX followed by a COMBINING DOT BELOW). Direct transcoding of
1395other 8-bit encodings of Vietnamese may lead to other
1396representations.</t>
1397
1398<t>Equivalence of IRIs MUST rely on the assumption that IRIs are
1399appropriately pre-character-normalized rather than apply character
1400normalization when comparing two IRIs. The exceptions are conversion
1401from a non-digital form, and conversion from a non-UCS-based character
1402encoding to a UCS-based character encoding. In these cases, NFC or a
1403normalizing transcoder using NFC MUST be used for interoperability. To
1404avoid false negatives and problems with transcoding, IRIs SHOULD be
1405created by using NFC. Using NFKC may avoid even more problems; for
1406example, by choosing half-width Latin letters instead of full-width
1407ones, and full-width instead of half-width Katakana.</t>
1408
1409
1410<t>As an example,
1411"http://www.example.org/r&amp;#xE9;sum&amp;#xE9;.html" (in XML
1412Notation) is in NFC. On the other hand,
1413"http://www.example.org/re&amp;#x301;sume&amp;#x301;.html" is not in
1414NFC.</t>
1415
1416<t>The former uses precombined e-acute characters, and the latter uses
1417"e" characters followed by combining acute accents. Both usages are
1418defined as canonically equivalent in <xref target="UNIV4"></xref>.</t>
1419
1420<t><list style="hanging">
1421
1422<t hangText="Note:">
1423Because it is unknown how a particular sequence of characters is being
1424treated with respect to character normalization, it would be
1425inappropriate to allow third parties to normalize an IRI
1426arbitrarily. This does not contradict the recommendation that when a
1427resource is created, its IRI should be as character normalized as
1428possible (i.e., NFC or even NFKC). This is similar to the
1429uppercase/lowercase problems.  Some parts of a URI are case
1430insensitive (for example, the domain name). For others, it is unclear
1431whether they are case sensitive, case insensitive, or something in
1432between (e.g., case sensitive, but with a multiple choice selection if
1433the wrong case is used, instead of a direct negative result).  The
1434best recipe is that the creator use a reasonable capitalization and,
1435when transferring the URI, capitalization never be
1436changed.</t></list></t>
1437
1438<t>Various IRI schemes may allow the usage of Internationalized Domain
1439Names (IDN) <xref target="RFC3490"></xref> either in the ireg-name
1440part or elsewhere. Character Normalization also applies to IDNs, as
1441discussed in <xref target="schemecomp"></xref>.</t>
1442</section> <!-- charnorm -->
1443
1444<section title="Percent-Encoding Normalization">
1445
1446<t>The percent-encoding mechanism (Section 2.1 of <xref
1447target="RFC3986"></xref>) is a frequent source of variance among
1448otherwise identical IRIs. In addition to the case normalization issue
1449noted above, some IRI producers percent-encode octets that do not
1450require percent-encoding, resulting in IRIs that are equivalent to
1451their nonencoded counterparts. These IRIs should be normalized by
1452decoding any percent-encoded octet sequence that corresponds to an
1453unreserved character, as described in section 2.3 of <xref
1454target="RFC3986"></xref>.</t>
1455
1456<t>For actual resolution, differences in percent-encoding (except for
1457the percent-encoding of reserved characters) MUST always result in the
1458same resource.  For example, "http://example.org/~user",
1459"http://example.org/%7euser", and "http://example.org/%7Euser", must
1460resolve to the same resource.</t>
1461
1462<t>If this kind of equivalence is to be tested, the percent-encoding
1463of both IRIs to be compared has to be aligned; for example, by
1464converting both IRIs to URIs (see Section 3.1), eliminating escape
1465differences in the resulting URIs, and making sure that the case of
1466the hexadecimal characters in the percent-encoding is always the same
1467(preferably upper case). If the IRI is to be passed to another
1468application or used further in some other way, its original form MUST
1469be preserved.  The conversion described here should be performed only
1470for local comparison.</t>
1471
1472</section> <!-- pctnorm -->
1473
1474<section title="Path Segment Normalization">
1475
1476<t>The complete path segments "." and ".." are intended only for use
1477within relative references (Section 4.1 of <xref
1478target="RFC3986"></xref>) and are removed as part of the reference
1479resolution process (Section 5.2 of <xref target="RFC3986"></xref>).
1480However, some implementations may incorrectly assume that reference
1481resolution is not necessary when the reference is already an IRI, and
1482thus fail to remove dot-segments when they occur in non-relative
1483paths.  IRI normalizers should remove dot-segments by applying the
1484remove_dot_segments algorithm to the path, as described in Section
14855.2.4 of <xref target="RFC3986"></xref>.</t>
1486
1487</section> <!-- pathnorm -->
1488</section> <!-- ladder -->
1489
1490<section title="Scheme-Based Normalization" anchor="schemecomp">
1491
1492<t>The syntax and semantics of IRIs vary from scheme to scheme, as
1493described by the defining specification for each
1494scheme. Implementations may use scheme-specific rules, at further
1495processing cost, to reduce the probability of false negatives. For
1496example, because the "http" scheme makes use of an authority
1497component, has a default port of "80", and defines an empty path to be
1498equivalent to "/", the following four IRIs are equivalent:</t>
1499
1500<figure><artwork>
1501   http://example.com
1502   http://example.com/
1503   http://example.com:/
1504   http://example.com:80/</artwork></figure>
1505
1506<t>In general, an IRI that uses the generic syntax for authority with
1507an empty path should be normalized to a path of "/". Likewise, an
1508explicit ":port", for which the port is empty or the default for the
1509scheme, is equivalent to one where the port and its ":" delimiter are
1510elided and thus should be removed by scheme-based normalization. For
1511example, the second IRI above is the normal form for the "http"
1512scheme.</t>
1513
1514<t>Another case where normalization varies by scheme is in the
1515handling of an empty authority component or empty host
1516subcomponent. For many scheme specifications, an empty authority or
1517host is considered an error; for others, it is considered equivalent
1518to "localhost" or the end-user's host. When a scheme defines a default
1519for authority and an IRI reference to that default is desired, the
1520reference should be normalized to an empty authority for the sake of
1521uniformity, brevity, and internationalization. If, however, either the
1522userinfo or port subcomponents are non-empty, then the host should be
1523given explicitly even if it matches the default.</t>
1524
1525<t>Normalization should not remove delimiters when their associated
1526component is empty unless it is licensed to do so by the scheme
1527specification. For example, the IRI "http://example.com/?" cannot be
1528assumed to be equivalent to any of the examples above. Likewise, the
1529presence or absence of delimiters within a userinfo subcomponent is
1530usually significant to its interpretation.  The fragment component is
1531not subject to any scheme-based normalization; thus, two IRIs that
1532differ only by the suffix "#" are considered different regardless of
1533the scheme.</t>
1534 
1535<t>Some IRI schemes allow the usage of Internationalized Domain
1536Names (IDN) <xref target='RFC5890'></xref> either in their ireg-name
1537part or elswhere. When in use in IRIs, those names SHOULD
1538conform to the definition of U-Label in <xref
1539target='RFC5890'></xref>. An IRI containing an invalid IDN cannot
1540successfully be resolved. For legibility purposes, they
1541SHOULD NOT be converted into ASCII Compatible Encoding (ACE).</t>
1542
1543<t>Scheme-based normalization may also consider IDN
1544components and their conversions to punycode as equivalent. As an
1545example, "http://r&amp;#xE9;sum&amp;#xE9;.example.org" may be
1546considered equivalent to
1547"http://xn--rsum-bpad.example.org".</t><t>Other scheme-specific
1548normalizations are possible.</t>
1549
1550</section> <!-- schemenorm -->
1551
1552<section title="Protocol-Based Normalization">
1553
1554<t>Substantial effort to reduce the incidence of false negatives is
1555often cost-effective for web spiders. Consequently, they implement
1556even more aggressive techniques in IRI comparison. For example, if
1557they observe that an IRI such as</t>
1558
1559<figure><artwork>
1560   http://example.com/data</artwork></figure>
1561<t>redirects to an IRI differing only in the trailing slash</t>
1562<figure><artwork>
1563   http://example.com/data/</artwork></figure>
1564
1565<t>they will likely regard the two as equivalent in the future.  This
1566kind of technique is only appropriate when equivalence is clearly
1567indicated by both the result of accessing the resources and the common
1568conventions of their scheme's dereference algorithm (in this case, use
1569of redirection by HTTP origin servers to avoid problems with relative
1570references).</t>
1571
1572</section> <!-- protonorm -->
1573</section> <!-- equivalence -->
1574</section> 
1575
1576<section title="Use of IRIs" anchor="IRIuse">
1577
1578<section title="Limitations on UCS Characters Allowed in IRIs" anchor="limitations">
1579
1580<t>This section discusses limitations on characters and character
1581sequences usable for IRIs beyond those given in <xref target="abnf"/>
1582and <xref target="visual"/>. The considerations in this section are
1583relevant when IRIs are created and when URIs are converted to
1584IRIs.</t>
1585
1586<t>
1587
1588<list style="hanging"><t hangText="a.">The repertoire of characters allowed
1589    in each IRI component is limited by the definition of that component.
1590    For example, the definition of the scheme component does not allow
1591    characters beyond US-ASCII.
1592    <vspace blankLines="1"/>
1593    (Note: In accordance with URI practice, generic IRI
1594    software cannot and should not check for such limitations.)</t>
1595
1596<t hangText="b.">The UCS contains many areas of characters for which
1597    there are strong visual look-alikes. Because of the likelihood of
1598    transcription errors, these also should be avoided. This includes
1599    the full-width equivalents of Latin characters, half-width
1600    Katakana characters for Japanese, and many others. It also
1601    includes many look-alikes of "space", "delims", and "unwise",
1602    characters excluded in <xref target="RFC3491"/>.</t>
1603   
1604</list>
1605</t>
1606
1607<t>Additional information is available from <xref target="UNIXML"/>.
1608    <xref target="UNIXML"/> is written in the context of running text
1609    rather than in that of identifiers. Nevertheless, it discusses
1610    many of the categories of characters not appropriate for IRIs.</t>
1611</section> <!-- limitations -->
1612
1613<section title="Software Interfaces and Protocols">
1614
1615<t>Although an IRI is defined as a sequence of characters, software
1616interfaces for URIs typically function on sequences of octets or other
1617kinds of code units. Thus, software interfaces and protocols MUST
1618define which character encoding is used.</t>
1619
1620<t>Intermediate software interfaces between IRI-capable components and
1621URI-only components MUST map the IRIs per <xref target="mapping"/>,
1622when transferring from IRI-capable to URI-only components.
1623
1624This mapping SHOULD be applied as late as possible. It SHOULD NOT be
1625applied between components that are known to be able to handle IRIs.</t>
1626</section> <!-- software -->
1627
1628<section title="Format of URIs and IRIs in Documents and Protocols">
1629
1630<t>Document formats that transport URIs may have to be upgraded to allow
1631the transport of IRIs. In cases where the document as a whole
1632has a native character encoding, IRIs MUST also be encoded in this
1633character encoding and converted accordingly by a parser or interpreter.
1634
1635IRI characters not expressible in the native character encoding SHOULD
1636be escaped by using the escaping conventions of the document format if
1637such conventions are available. Alternatively, they MAY be
1638percent-encoded according to <xref target="mapping"/>. For example, in
1639HTML or XML, numeric character references SHOULD be used. If a
1640document as a whole has a native character encoding and that character
1641encoding is not UTF-8, then IRIs MUST NOT be placed into the document
1642in the UTF-8 character encoding.</t>
1643
1644<t>((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
1645although they use different terminology. HTML 4.0 <xref
1646target="HTML4"/> defines the conversion from IRIs to URIs as
1647error-avoiding behavior. XML 1.0 <xref target="XML1"/>, XLink <xref
1648target="XLink"/>, XML Schema <xref target="XMLSchema"/>, and
1649specifications based upon them allow IRIs. Also, it is expected that
1650all relevant new W3C formats and protocols will be required to handle
1651IRIs <xref target="CharMod"/>.</t>
1652
1653</section> <!-- format -->
1654
1655<section title="Use of UTF-8 for Encoding Original Characters" anchor="UTF8use">
1656
1657<t>This section discusses details and gives examples for point c) in
1658<xref target="Applicability"/>. To be able to use IRIs, the URI
1659corresponding to the IRI in question has to encode original characters
1660into octets by using UTF-8.  This can be specified for all URIs of a
1661URI scheme or can apply to individual URIs for schemes that do not
1662specify how to encode original characters.  It can apply to the whole
1663URI, or only to some part. For background information on encoding
1664characters into URIs, see also Section 2.5 of <xref
1665target="RFC3986"/>.</t>
1666
1667<t>For new URI schemes, using UTF-8 is recommended in <xref
1668target="RFC4395bis"/>.  Examples where UTF-8 is already used are the URN
1669syntax <xref target="RFC2141"/>, IMAP URLs <xref target="RFC2192"/>,
1670and POP URLs <xref target="RFC2384"/>.  On the other hand, because the
1671HTTP URI scheme does not specify how to encode original characters,
1672only some HTTP URLs can have corresponding but different IRIs.</t>
1673
1674<t>For example, for a document with a URI
1675of<vspace/>"http://www.example.org/r%C3%A9sum%C3%A9.html", it is
1676possible to construct a corresponding IRI (in XML notation, see <xref
1677target="sec-Notation"/>):
1678"http://www.example.org/r&amp;#xE9;sum&amp;#xE9;.html" ("&amp;#xE9;"
1679stands for the e-acute character, and "%C3%A9" is the UTF-8 encoded
1680and percent-encoded representation of that character). On the other
1681hand, for a document with a URI of
1682"http://www.example.org/r%E9sum%E9.html", the percent-encoding octets
1683cannot be converted to actual characters in an IRI, as the
1684percent-encoding is not based on UTF-8.</t>
1685
1686<t>For most URI schemes, there is no need to upgrade their scheme
1687definition in order for them to work with IRIs.  The main case where
1688upgrading makes sense is when a scheme definition, or a particular
1689component of a scheme, is strictly limited to the use of US-ASCII
1690characters with no provision to include non-ASCII characters/octets
1691via percent-encoding, or if a scheme definition currently uses highly
1692scheme-specific provisions for the encoding of non-ASCII characters.
1693An example of this is the mailto: scheme <xref target="RFC2368"/>.</t>
1694
1695<t>This specification updates the IANA registry of URI schemes to note
1696their applicability to IRIs, see <xref target="iana"/>.  All IRIs use
1697URI schemes, and all URIs with URI schemes can be used as IRIs, even
1698though in some cases only by using URIs directly as IRIs, without any
1699conversion.</t>
1700
1701<t>Scheme definitions can impose restrictions on the syntax of
1702scheme-specific URIs; i.e., URIs that are admissible under the generic
1703URI syntax <xref target="RFC3986"/> may not be admissible due to
1704narrower syntactic constraints imposed by a URI scheme
1705specification. URI scheme definitions cannot broaden the syntactic
1706restrictions of the generic URI syntax; otherwise, it would be
1707possible to generate URIs that satisfied the scheme-specific syntactic
1708constraints without satisfying the syntactic constraints of the
1709generic URI syntax. However, additional syntactic constraints imposed
1710by URI scheme specifications are applicable to IRI, as the
1711corresponding URI resulting from the mapping defined in <xref
1712target="mapping"/> MUST be a valid URI under the syntactic
1713restrictions of generic URI syntax and any narrower restrictions
1714imposed by the corresponding URI scheme specification.</t>
1715
1716<t>The requirement for the use of UTF-8 generally applies to all parts
1717of a URI.  However, it is possible that the capability of IRIs to
1718represent a wide range of characters directly is used just in some
1719parts of the IRI (or IRI reference). The other parts of the IRI may
1720only contain US-ASCII characters, or they may not be based on
1721UTF-8. They may be based on another character encoding, or they may
1722directly encode raw binary data (see also <xref
1723target="RFC2397"/>). </t>
1724
1725<t>For example, it is possible to have a URI reference
1726of<vspace/>"http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9",
1727where the document name is encoded in iso-8859-1 based on server
1728settings, but where the fragment identifier is encoded in UTF-8 according
1729to <xref target="XPointer"/>. The IRI corresponding to the above
1730URI would be (in XML notation)<vspace/>"http://www.example.org/r%E9sum%E9.xml#r&amp;#xE9;sum&amp;#xE9;".</t>
1731
1732<t>Similar considerations apply to query parts. The functionality
1733of IRIs (namely, to be able to include non-ASCII characters) can
1734only be used if the query part is encoded in UTF-8.</t>
1735
1736</section> <!-- utf8 -->
1737
1738<section title="Relative IRI References">
1739<t>Processing of relative IRI references against a base is handled
1740straightforwardly; the algorithms of <xref target="RFC3986"/> can
1741be applied directly, treating the characters additionally allowed
1742in IRI references in the same way that unreserved characters are in URI
1743references.</t>
1744
1745</section> <!-- relative -->
1746</section> <!-- IRIuse -->
1747
1748<section title="Liberal handling of otherwise invalid IRIs" anchor="LEIRIHREF">
1749
1750<t>(EDITOR NOTE: This Section may move to an appendix.)
1751 
1752Some technical specifications and widely-deployed software have
1753allowed additional variations and extensions of IRIs to be used in
1754syntactic components. This section describes two widely-used
1755preprocessing agreements. Other technical specifications may wish to
1756reference a syntactic component which is "a valid IRI or a string that
1757will map to a valid IRI after this preprocessing algorithm". These two
1758variants are known as <xref target="LEIRI">Legacy Extended IRI or
1759LEIRI</xref>, and <xref target="HTML5">Web Address</xref>).
1760</t>
1761
1762<t>Future technical specifications SHOULD NOT allow conforming
1763producers to produce, or conforming content to contain, such forms,
1764as they are not interoperable with other IRI consuming software.</t>
1765
1766<section title="LEIRI processing"  anchor="LEIRIspec">
1767  <t>This section defines Legacy Extended IRIs (LEIRIs).
1768    The syntax of Legacy Extended IRIs is the same as that for IRIs,
1769    except that the ucschar production is replaced by the leiri-ucschar production:</t>
1770<figure>
1771
1772<artwork>
1773  leiri-ucschar  = " " / "&lt;" / "&gt;" / '"' / "{" / "}" / "|"
1774                   / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1775                   / %xE000-FFFD / %x10000-10FFFF
1776</artwork>
1777
1778<postamble>
1779  Among other extensions, processors based on this specification also
1780  did not enforce the restriction on bidirectional formatting
1781  characters in <xref target="visual"></xref>, and the iprivate
1782  production becomes redundant.</postamble>
1783</figure>
1784
1785<t>To convert a string allowed as a LEIRI to an IRI, each character
1786allowed in leiri-ucschar but not in ucschar must be percent-encoded
1787using <xref target="compmapping"/>.</t>
1788</section> <!-- leiriproc -->
1789
1790<section title="Web Address processing" anchor="webaddress">
1791
1792<t>Many popular web browsers have taken the approach of being quite
1793liberal in what is accepted as a "URL" or its relative
1794forms. This section describes their behavior in terms of a preprocessor
1795which maps strings into the IRI space for subsequent parsing and
1796interpretation as an IRI.</t>
1797
1798<t>In some situations, it might be appropriate to describe the syntax
1799that a liberal consumer implementation might accept as a "Web
1800Address" or "Hypertext Reference" or "HREF". However,
1801technical specifications SHOULD restrict the syntactic form allowed by compliant producers
1802to the IRI or IRI reference syntax defined in this document
1803even if they want to mandate this processing.</t>
1804
1805<t>
1806Summary:
1807<list style="symbols">
1808   <t>Leading and trailing whitespace is removed.</t>
1809   <t>Some additional characters are removed.</t>
1810   <t>Some additional characters are allowed and escaped (as with LEIRI).</t>
1811   <t>If interpreting an IRI as a URI, the pct-encoding of the query
1812   component of the parsed URI component depends on operational
1813   context.</t>
1814</list>
1815</t>
1816
1817<t>Each string provided may have an associated charset (called
1818the HREF-charset here); this defaults to UTF-8.
1819For web browsers interpreting HTML, the document
1820charset of a string is determined:
1821
1822<list style="hanging">
1823<t hangText="If the string came from a script (e.g. as an argument to
1824 a method)">The HRef-charset is the script's charset.</t>
1825
1826<t hangText="If the string came from a DOM node (e.g. from an
1827  element)">The node has a Document, and the HRef-charset is the
1828  Document's character encoding.</t>
1829
1830<t hangText="If the string had a HRef-charset defined when the string was
1831created or defined">The HRef-charset is as defined.</t>
1832
1833</list></t>
1834
1835<t>If the resulting HRef-charset is a unicode based character encoding
1836(e.g., UTF-16), then use UTF-8 instead.</t>
1837
1838
1839<figure>
1840<preamble>The syntax for Web Addresses is obtained by replacing the 'ucschar',
1841  pct-form, and path-sep rules with the href-ucschar, href-pct-form, and href-path-sep
1842  rules below. In addition, some characters are stripped.</preamble>
1843
1844<artwork type='abnf'>
1845  href-ucschar  = " " / "&lt;" / "&gt;" / DQUOTE / "{" / "}" / "|"
1846                   / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
1847                   / %xE000-FFFD / %x10000-10FFFF
1848  href-pct-form = pct-encoded / "%"
1849  href-path-sep = "/" / "\"
1850  href-strip    = &lt;to be done&gt;
1851</artwork>
1852
1853<postamble>
1854(NOTE: NEED TO FIX THESE SETS TO MATCH HTML5; NOT SURE ABOUT NEXT SENTENCE)
1855browsers did not enforce the restriction on bidirectional formatting
1856  characters in <xref target="visual"></xref>, and the iprivate
1857  production becomes redundant.</postamble>
1858</figure>
1859
1860<t>'Web Address processing' requires the following additional
1861preprocessing steps:
1862
1863<list style="numbers">
1864
1865<t>Leading and trailing instances of space (U+0020),
1866CR (U+000A), LF (U+000D), and TAB (U+0009) characters are removed.</t>
1867
1868<t>strip all characters in href-strip.</t>
1869  <t>Percent-encode all characters in href-ucschar not in ucschar.</t>
1870  <t>Replace occurrences of "%" not followed by two hexadecimal digits by "%25".</t>
1871  <t>Convert backslashes ('\') matching href-path-sep to forward slashes ('/').</t>
1872</list></t>
1873</section> <!-- webaddress -->
1874
1875<section title="Characters not allowed in IRIs" anchor="notAllowed">
1876
1877<t>This section provides a list of the groups of characters and code
1878points that are allowed by LEIRI or HREF but are not allowed in IRIs or are
1879allowed in IRIs only in the query part. For each group of characters,
1880advice on the usage of these characters is also given, concentrating
1881on the reasons for why they are excluded from IRI use.</t>
1882
1883<t>
1884
1885<list><t>Space (U+0020): Some formats and applications use space as a
1886delimiter, e.g. for items in a list. Appendix C of <xref
1887target="RFC3986"></xref> also mentions that white space may have to be
1888added when displaying or printing long URIs; the same applies to long
1889IRIs. This means that spaces can disappear, or can make the what is
1890intended as a single IRI or IRI reference to be treated as two or more
1891separate IRIs.</t>
1892
1893<t>Delimiters "&lt;" (U+003C), "&gt;" (U+003E), and '"' (U+0022):
1894Appendix C of <xref target="RFC3986"></xref> suggests the use of
1895double-quotes ("http://example.com/") and angle brackets
1896(&lt;http://example.com/&gt;) as delimiters for URIs in plain
1897text. These conventions are often used, and also apply to IRIs.  Using
1898these characters in strings intended to be IRIs would result in the
1899IRIs being cut off at the wrong place.</t>
1900
1901<t>Unwise characters "\" (U+005C), "^" (U+005E), "`"
1902(U+0060), "{" (U+007B), "|" (U+007C), and "}" (U+007D): These
1903characters originally have been excluded from URIs because the
1904respective codepoints are assigned to different graphic characters in
1905some 7-bit or 8-bit encoding. Despite the move to Unicode, some of
1906these characters are still occasionally displayed differently on some
1907systems, e.g. U+005C may appear as a Japanese Yen symbol on some
1908systems. Also, the fact that these characters are not used in URIs or
1909IRIs has encouraged their use outside URIs or IRIs in contexts that
1910may include URIs or IRIs. If a string with such a character were used
1911as an IRI in such a context, it would likely be interpreted
1912piecemeal.</t>
1913
1914<t>The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
1915#x9F): There is generally no way to transmit these characters reliably
1916as text outside of a charset encoding.  Even when in encoded form,
1917many software components silently filter out some of these characters,
1918or may stop processing alltogether when encountering some of
1919them. These characters may affect text display in subtle, unnoticable
1920ways or in drastic, global, and irreversible ways depending on the
1921hardware and software involved. The use of some of these characters
1922would allow malicious users to manipulate the display of an IRI and
1923its context in many situations.</t>
1924
1925<t>Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
1926characters affect the display ordering of characters. If IRIs were
1927allowed to contain these characters and the resulting visual display
1928transcribed. they could not be converted back to electronic form
1929(logical order) unambiguously. These characters, if allowed in IRIs,
1930might allow malicious users to manipulate the display of IRI and its
1931context.</t>
1932
1933<t>Specials (U+FFF0-FFFD): These code points provide functionality
1934beyond that useful in an IRI, for example byte order identification,
1935annotation, and replacements for unknown characters and objects. Their
1936use and interpretation in an IRI would serve no purpose and might lead
1937to confusing display variations.</t>
1938
1939<t>Private use code points (U+E000-F8FF, U+F0000-FFFFD,
1940U+100000-10FFFD): Display and interpretation of these code points is
1941by definition undefined without private agreement. Therefore, these
1942code points are not suited for use on the Internet. They are not
1943interoperable and may have unpredictable effects.</t>
1944
1945<t>Tags (U+E0000-E0FFF): These characters provide a way to language
1946tag in Unicode plain text. They are not appropriate for IRIs because
1947language information in identifiers cannot reliably be input,
1948transmitted (e.g. on a visual medium such as paper), or
1949recognized.</t>
1950
1951<t>Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
1952U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
1953U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
1954U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
1955U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
1956non-characters. Applications may use some of them internally, but are
1957not prepared to interchange them.</t>
1958
1959</list></t>
1960
1961<t>LEIRI preprocessing disallowed some code points and
1962code units:
1963
1964<list><t>Surrogate code units (D800-DFFF): These do not represent
1965Unicode codepoints.</t></list></t>
1966</section> <!-- notallowed -->
1967</section> <!-- lieirihref -->
1968 
1969<section title="URI/IRI Processing Guidelines (Informative)" anchor="guidelines">
1970
1971<t>This informative section provides guidelines for supporting IRIs in
1972the same software components and operations that currently process
1973URIs: Software interfaces that handle URIs, software that allows users
1974to enter URIs, software that creates or generates URIs, software that
1975displays URIs, formats and protocols that transport URIs, and software
1976that interprets URIs. These may all require modification before
1977functioning properly with IRIs. The considerations in this section
1978also apply to URI references and IRI references.</t>
1979
1980<section title="URI/IRI Software Interfaces">
1981<t>Software interfaces that handle URIs, such as URI-handling APIs and
1982protocols transferring URIs, need interfaces and protocol elements
1983that are designed to carry IRIs.</t>
1984
1985<t>In case the current handling in an API or protocol is based on
1986US-ASCII, UTF-8 is recommended as the character encoding for IRIs, as
1987it is compatible with US-ASCII, is in accordance with the
1988recommendations of <xref target="RFC2277"/>, and makes converting to
1989URIs easy. In any case, the API or protocol definition must clearly
1990define the character encoding to be used.</t>
1991
1992<t>The transfer from URI-only to IRI-capable components requires no
1993mapping, although the conversion described in <xref
1994target="URItoIRI"/> above may be performed. It is preferable not to
1995perform this inverse conversion unless it is certain this can be done
1996correctly.</t>
1997</section>
1998
1999<section title="URI/IRI Entry">
2000
2001<t>Some components allow users to enter URIs into the system
2002by typing or dictation, for example. This software must be updated to allow
2003for IRI entry.</t>
2004
2005<t>A person viewing a visual representation of an IRI (as a sequence
2006of glyphs, in some order, in some visual display) or hearing an IRI
2007will use an entry method for characters in the user's language to
2008input the IRI. Depending on the script and the input method used, this
2009may be a more or less complicated process.</t>
2010
2011<t>The process of IRI entry must ensure, as much as possible, that the
2012restrictions defined in <xref target="abnf"/> are met. This may be
2013done by choosing appropriate input methods or variants/settings
2014thereof, by appropriately converting the characters being input, by
2015eliminating characters that cannot be converted, and/or by issuing a
2016warning or error message to the user.</t>
2017
2018<t>As an example of variant settings, input method editors for East
2019Asian Languages usually allow the input of Latin letters and related
2020characters in full-width or half-width versions. For IRI input, the
2021input method editor should be set so that it produces half-width Latin
2022letters and punctuation and full-width Katakana.</t>
2023
2024<t>An input field primarily or solely used for the input of URIs/IRIs
2025might allow the user to view an IRI as it is mapped to a URI.  Places
2026where the input of IRIs is frequent may provide the possibility for
2027viewing an IRI as mapped to a URI. This will help users when some of
2028the software they use does not yet accept IRIs.</t>
2029
2030<t>An IRI input component interfacing to components that handle URIs,
2031but not IRIs, must map the IRI to a URI before passing it to these
2032components.</t>
2033
2034<t>For the input of IRIs with right-to-left characters, please see
2035<xref target="bidiInput"></xref>.</t>
2036</section>
2037
2038<section title="URI/IRI Transfer between Applications">
2039
2040<t>Many applications (for example, mail user agents) try to detect
2041URIs appearing in plain text. For this, they use some heuristics based
2042on URI syntax. They then allow the user to click on such URIs and
2043retrieve the corresponding resource in an appropriate (usually
2044scheme-dependent) application.</t>
2045
2046<t>Such applications would need to be upgraded, in order to use the
2047IRI syntax as a base for heuristics. In particular, a non-ASCII
2048character should not be taken as the indication of the end of an IRI.
2049Such applications also would need to make sure that they correctly
2050convert the detected IRI from the character encoding of the document
2051or application where the IRI appears, to the character encoding used
2052by the system-wide IRI invocation mechanism, or to a URI (according to
2053<xref target="mapping"/>) if the system-wide invocation mechanism only
2054accepts URIs.</t>
2055
2056<t>The clipboard is another frequently used way to transfer URIs and
2057IRIs from one application to another. On most platforms, the clipboard
2058is able to store and transfer text in many languages and scripts.
2059Correctly used, the clipboard transfers characters, not octets, which
2060will do the right thing with IRIs.</t>
2061</section>
2062
2063<section title="URI/IRI Generation">
2064
2065<t>Systems that offer resources through the Internet, where those
2066resources have logical names, sometimes automatically generate URIs
2067for the resources they offer. For example, some HTTP servers can
2068generate a directory listing for a file directory and then respond to
2069the generated URIs with the files.</t>
2070
2071<t>Many legacy character encodings are in use in various file systems.
2072Many currently deployed systems do not transform the local character
2073representation of the underlying system before generating URIs.</t>
2074
2075<t>For maximum interoperability, systems that generate resource
2076identifiers should make the appropriate transformations. For example,
2077if a file system contains a file named
2078"r&amp;#xE9;sum&amp;#xE9;.html", a server should expose this as
2079"r%C3%A9sum%C3%A9.html" in a URI, which allows use of
2080"r&amp;#xE9;sum&amp;#xE9;.html" in an IRI, even if locally the file
2081name is kept in a character encoding other than UTF-8.
2082</t>
2083
2084<t>This recommendation particularly applies to HTTP servers. For FTP
2085servers, similar considerations apply; see <xref target="RFC2640"/>.</t>
2086</section>
2087
2088<section title="URI/IRI Selection" anchor="selection">
2089<t>In some cases, resource owners and publishers have control over the
2090IRIs used to identify their resources. This control is mostly
2091executed by controlling the resource names, such as file names,
2092directly.</t>
2093
2094<t>In these cases, it is recommended to avoid choosing IRIs that are
2095easily confused. For example, for US-ASCII, the lower-case ell ("l") is
2096easily confused with the digit one ("1"), and the upper-case oh ("O") is
2097easily confused with the digit zero ("0"). Publishers should avoid
2098confusing users with "br0ken" or "1ame" identifiers.</t>
2099
2100<t>Outside the US-ASCII repertoire, there are many more opportunities for
2101confusion; a complete set of guidelines is too lengthy to include
2102here. As long as names are limited to characters from a single script,
2103native writers of a given script or language will know best when
2104ambiguities can appear, and how they can be avoided. What may look
2105ambiguous to a stranger may be completely obvious to the average
2106native user. On the other hand, in some cases, the UCS contains
2107variants for compatibility reasons; for example, for typographic purposes.
2108These should be avoided wherever possible. Although there may be exceptions,
2109newly created resource names should generally be in NFKC
2110<xref target="UTR15"></xref> (which means that they are also in NFC).</t>
2111
2112<t>As an example, the UCS contains the "fi" ligature at U+FB01
2113for compatibility reasons.
2114Wherever possible, IRIs should use the two letters "f" and "i" rather
2115than the "fi" ligature. An example where the latter may be used is
2116in the query part of an IRI for an explicit search for a word written
2117containing the "fi" ligature.</t>
2118
2119<t>In certain cases, there is a chance that characters from different
2120scripts look the same. The best known example is the similarity of the
2121Latin "A", the Greek "Alpha", and the Cyrillic "A". To avoid such
2122cases, IRIs should only be created where all the characters in a
2123single component are used together in a given language. This usually
2124means that all of these characters will be from the same script, but
2125there are languages that mix characters from different scripts (such
2126as Japanese).  This is similar to the heuristics used to distinguish
2127between letters and numbers in the examples above. Also, for Latin,
2128Greek, and Cyrillic, using lowercase letters results in fewer
2129ambiguities than using uppercase letters would.</t>
2130</section>
2131
2132<section title="Display of URIs/IRIs" anchor="display">
2133<t>
2134In situations where the rendering software is not expected to display
2135non-ASCII parts of the IRI correctly using the available layout and font
2136resources, these parts should be percent-encoded before being displayed.</t>
2137
2138<t>For display of Bidi IRIs, please see <xref target="visual"/>.</t>
2139</section>
2140
2141<section title="Interpretation of URIs and IRIs">
2142<t>Software that interprets IRIs as the names of local resources should
2143accept IRIs in multiple forms and convert and match them with the
2144appropriate local resource names.</t>
2145
2146<t>First, multiple representations include both IRIs in the native
2147character encoding of the protocol and also their URI counterparts.</t>
2148
2149<t>Second, it may include URIs constructed based on character
2150encodings other than UTF-8. These URIs may be produced by user agents that do
2151not conform to this specification and that use legacy character encodings to
2152convert non-ASCII characters to URIs. Whether this is necessary, and what
2153character encodings to cover, depends on a number of factors, such as
2154the legacy character encodings used locally and the distribution of
2155various versions of user agents. For example, software for Japanese
2156may accept URIs in Shift_JIS and/or EUC-JP in addition to UTF-8.</t>
2157
2158<t>Third, it may include additional mappings to be more user-friendly
2159and robust against transmission errors. These would be similar to how
2160some servers currently treat URIs as case insensitive or perform
2161additional matching to account for spelling errors. For characters
2162beyond the US-ASCII repertoire, this may, for example, include
2163ignoring the accents on received IRIs or resource names. Please note
2164that such mappings, including case mappings, are language
2165dependent.</t>
2166
2167<t>It can be difficult to identify a resource unambiguously if too
2168many mappings are taken into consideration. However, percent-encoded
2169and not percent-encoded parts of IRIs can always be clearly distinguished.
2170Also, the regularity of UTF-8 (see <xref target="Duerst97"/>) makes the
2171potential for collisions lower than it may seem at first.</t>
2172</section>
2173
2174<section title="Upgrading Strategy">
2175<t>Where this recommendation places further constraints on software
2176for which many instances are already deployed, it is important to
2177introduce upgrades carefully and to be aware of the various
2178interdependencies.</t>
2179
2180<t>If IRIs cannot be interpreted correctly, they should not be created,
2181generated, or transported. This suggests that upgrading URI interpreting
2182software to accept IRIs should have highest priority.</t>
2183
2184<t>On the other hand, a single IRI is interpreted only by a single or
2185very few interpreters that are known in advance, although it may be
2186entered and transported very widely.</t>
2187
2188<t>Therefore, IRIs benefit most from a broad upgrade of software to be
2189able to enter and transport IRIs. However, before an
2190individual IRI is published, care should be taken to upgrade the corresponding
2191interpreting software in order to cover the forms expected to be
2192received by various versions of entry and transport software.</t>
2193
2194<t>The upgrade of generating software to generate IRIs instead of using a
2195local character encoding should happen only after the service is upgraded
2196to accept IRIs. Similarly, IRIs should only be generated when the service
2197accepts IRIs and the intervening infrastructure and protocol is known
2198to transport them safely.</t>
2199
2200<t>Software converting from URIs to IRIs for display should be upgraded
2201only after upgraded entry software has been widely deployed to the
2202population that will see the displayed result.</t>
2203
2204
2205<t>Where there is a free choice of character encodings, it is often
2206possible to reduce the effort and dependencies for upgrading to IRIs
2207by using UTF-8 rather than another encoding. For example, when a new
2208file-based Web server is set up, using UTF-8 as the character encoding
2209for file names will make the transition to IRIs easier. Likewise, when
2210a new Web form is set up using UTF-8 as the character encoding of the
2211form page, the returned query URIs will use UTF-8 as the character
2212encoding (unless the user, for whatever reason, changes the character
2213encoding) and will therefore be compatible with IRIs.</t>
2214
2215
2216<t>These recommendations, when taken together, will allow for the
2217extension from URIs to IRIs in order to handle characters other than
2218US-ASCII while minimizing interoperability problems. For
2219considerations regarding the upgrade of URI scheme definitions, see
2220<xref target="UTF8use"/>.</t>
2221
2222</section>
2223</section> <!-- guidelines -->
2224
2225<section title="IANA Considerations" anchor="iana">
2226
2227<t>RFC Editor and IANA note: Please Replace RFC XXXX with the
2228number of this document when it issues as an RFC. </t>
2229
2230<t>IANA maintains a registry of "URI schemes". A "URI scheme" also
2231serves an "IRI scheme". </t>
2232
2233<t>To clarify that the URI scheme registration process also applies to
2234IRIs, change the description of the "URI schemes" registry
2235header to say "[RFC4395] defines an IANA-maintained registry of URI
2236Schemes. These registries include the Permanent and Provisional URI
2237Schemes.  RFC XXXX updates this registry to designate that schemes may
2238also indicate their usability as IRI schemes.</t>
2239
2240<t> Update "per RFC 4395" to "per RFC 4395 and RFC XXXX".
2241</t>
2242
2243</section> <!-- IANA -->
2244   
2245<section title="Security Considerations" anchor="security">
2246<t>The security considerations discussed in <xref target="RFC3986"/>
2247also apply to IRIs. In addition, the following issues require
2248particular care for IRIs.</t>
2249<t>Incorrect encoding or decoding can lead to security problems.
2250In particular, some UTF-8 decoders do not check against overlong
2251byte sequences. As an example, a "/" is encoded with the byte 0x2F
2252both in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly
2253interpret the sequence 0xC0 0xAF as a "/". A sequence such as "%C0%AF.."
2254may pass some security tests and then be interpreted
2255as "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion
2256and checking are not done in the right order, and/or if reserved
2257characters and unreserved characters are not clearly distinguished.</t>
2258
2259<t>There are various ways in which "spoofing" can occur with IRIs.
2260"Spoofing" means that somebody may add a resource name that looks the
2261same or similar to the user, but that points to a different resource.
2262The added resource may pretend to be the real resource by looking
2263very similar but may contain all kinds of changes that may be
2264difficult to spot and that can cause all kinds of problems.
2265Most spoofing possibilities for IRIs are extensions of those for URIs.</t>
2266
2267<t>Spoofing can occur for various reasons. First, a user's normalization expectations or actual normalization
2268when entering an IRI or  transcoding an IRI from a legacy character
2269encoding do not match the normalization used on the
2270server side. Conceptually, this is no different from the problems
2271surrounding the use of case-insensitive web servers. For example,
2272a popular web page with a mixed-case name ("http://big.example.com/PopularPage.html")
2273might be "spoofed" by someone who is able to create "http://big.example.com/popularpage.html".
2274However, the use of unnormalized character sequences, and of additional
2275mappings for user convenience, may increase the chance for spoofing.
2276Protocols and servers that allow the creation of resources with
2277names that are not normalized are particularly vulnerable to such
2278attacks. This is an inherent
2279security problem of the relevant protocol, server, or resource
2280and is not specific to IRIs, but it is mentioned here for completeness.</t>
2281
2282<t>Spoofing can occur in various IRI components, such as the
2283domain name part or a path part. For considerations specific
2284to the domain name part, see <xref target="RFC3491"/>.
2285For the path part, administrators of sites that allow independent
2286users to create resources in the same sub area may have to be careful
2287to check for spoofing.</t>
2288
2289<t>Spoofing can occur because in the UCS many characters look very similar. Details are discussed in <xref target="selection"/>.
2290Again, this is very similar to spoofing possibilities on US-ASCII,
2291e.g., using "br0ken" or "1ame" URIs.</t>
2292
2293<t>Spoofing can occur when URIs with percent-encodings based on various
2294character encodings are accepted to deal with older user agents. In some
2295cases, particularly for Latin-based resource names, this is usually easy to
2296detect because UTF-8-encoded names, when interpreted and viewed as
2297legacy character encodings, produce mostly garbage.</t><t>When
2298concurrently used character encodings have a similar structure but there
2299are no characters that have exactly the same encoding, detection is more
2300difficult.</t>
2301
2302<t>Spoofing can occur with bidirectional IRIs, if the restrictions
2303in <xref target="bidi-structure"/> are not followed. The same visual
2304representation may be interpreted as different logical representations,
2305and vice versa. It is also very important that a correct Unicode bidirectional
2306implementation be used.</t><t>The use of Legacy Extended IRIs introduces additional security issues.</t>
2307</section><!-- security -->
2308
2309<section title="Acknowledgements">
2310<t>For contributions to this update, we would like to thank Ian Hickson, Michael Sperberg-McQueen, Dan Connolly, Norman Walsh, Richard Tobin, Henry S. Thomson, and the XML Core Working Group of the W3C.</t>
2311
2312<t>The discussion on the issue addressed here started a long time
2313ago. There was a thread in the HTML working
2314group in August 1995 (under the topic of "Globalizing URIs") and in the
2315www-international mailing list in July 1996 (under the topic of
2316"Internationalization and URLs"), and there were ad-hoc meetings at the Unicode
2317conferences in September 1995 and September 1997.</t>
2318
2319<t>For contributions to the previous version of this document, RFC 3987, many thanks go to
2320Francois Yergeau, Matitiahu Allouche,
2321Roy Fielding, Tim Berners-Lee, Mark Davis,
2322M.T. Carrasco Benitez, James Clark, Tim Bray, Chris Wendt, Yaron Goland,
2323Andrea Vine, Misha Wolf, Leslie Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman,
2324Russ Housley, Makoto MURATA, Steven Atkin,
2325Ryan Stansifer, Tex Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs,
2326Adam Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown,
2327Roy Badami, Jonathan Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio,
2328Chris Haynes, Walter Underwood, and many others.</t>
2329<t>A definition of HyperText Reference was initially produced by Ian Hixson,
2330and further edited by Dan Connolly and C. M. Spergerg-McQueen.</t>
2331<t>Thanks to the Internationalization Working
2332Group (I18N WG) of the World Wide Web Consortium (W3C),
2333and the members of the W3C
2334I18N Working Group and Interest Group for their contributions and their
2335work on <xref target="CharMod"/>. Thanks also go
2336to the members of many other W3C Working Groups for adopting IRIs, and to
2337the members of the Montreal IAB Workshop on Internationalization and
2338Localization for their review.</t>
2339</section>
2340
2341
2342<section title="Change Log">
2343
2344<t>Note to RFC Editor: Please completely remove this section before publication.</t>
2345
2346<section title='Changes from draft-duerst-iri-bis-07 to draft-ietf-iri-3987bis-00'>
2347     <t>Changed draft name, date, last paragraph of abstract, and titles in change log, and added this section
2348     in moving from draft-duerst-iri-bis-07 (personal submission) to draft-ietf-iri-3987bis-00 (WG document).</t>
2349</section>
2350
2351<section title="Changes from -06 to -07 of draft-duerst-iri-bis" anchor="forkChanges"><t>
2352
2353Major restructuring of IRI processing model to make scheme-specific translation necessary to handle IDNA requirements and for consistency with web implementations. </t>
2354<t>Starting with IRI, you want one of:
2355<list style="hanging">
2356<t hangText="a"> IRI components (IRI parsed into UTF8 pieces)</t>
2357<t hangText="b"> URI components (URI parsed into ASCII pieces, encoded correctly) </t>
2358<t hangText="c"> whole URI  (for passing on to some other system that wants whole URIs) </t>
2359</list></t>
2360
2361<section title="OLD WAY">
2362<t><list style="numbers">
2363
2364 <t>Pct-encoding on the whole thing to a URI.
2365 (c1) If you want a (maybe broken) whole URI, you might
2366        stop here.</t>
2367
2368 <t>Parsing the URI into URI components.
2369   (b1) If you want (maybe broken) URI components, stop here.</t>
2370
2371 <t> Decode the components (undoing the pct-encoding).
2372   (a) if you want IRI components, stop here.</t>
2373
2374 <t> reencode:  Either using a different encoding some components
2375   (for domain names, and query components in web pages, which
2376   depends on the component, scheme and context), and otherwise
2377   using pct-encoding.
2378   (b2) if you want (good) URI components, stop here.</t>
2379
2380 <t> reassemble the reencoded components.
2381   (c2) if you want a (*good*) whole URI stop here.</t>
2382</list>
2383
2384</t>
2385
2386</section>
2387
2388<section title="NEW WAY">
2389<t>
2390<list style="numbers">
2391
2392<t> Parse the IRI into IRI components using the generic syntax.
2393   (a) if you want IRI components, stop here.</t>
2394
2395<t> Encode each components, using pct-encoding, IDN encoding, or
2396         special query part encoding depending on the component
2397         scheme or context. (b) If you want URI components, stop here.</t>
2398<t> reassemble the a whole URI from URI components.
2399   (c) if you want a whole URI stop here.</t>
2400</list></t>
2401</section>
2402</section>
2403
2404<section title='Changes from -00 to -01'><t><list style="symbols">
2405  <t>Removed 'mailto:' before mail addresses of authors.</t>
2406  <t>Added "&lt;to be done&gt;" as right side of 'href-strip' rule. Fixed '|' to '/' for
2407    alternatives.</t>
2408</list></t>
2409</section>
2410
2411<section title="Changes from -05 to -06 of draft-duerst-iri-bis-00"><t><list style="symbols">
2412<t>Add HyperText Reference, change abstract, acks and references for it</t>
2413<t>Add Masinter back as another editor.</t>
2414<t>Masinter integrates HRef material from HTML5 spec.</t>
2415<t>Rewrite introduction sections to modernize.</t>
2416</list></t>
2417</section>
2418
2419<section title="Changes from -04 to -05 of draft-duerst-iri-bis"><t><list style="symbols"><t>Updated references.</t><t>Changed IPR text to pre5378Trust200902.</t></list></t>
2420</section>
2421
2422<section title="Changes from -03 to -04 of draft-duerst-iri-bis"><t><list style="symbols"><t>Added explicit abbreviation for LEIRIs.</t><t>Mentioned LEIRI references.</t><t>Completed text in LEIRI section about tag characters and about specials.</t></list></t>
2423</section>
2424
2425<section title="Changes from -02 to -03 of draft-duerst-iri-bis"><t><list style="symbols"><t>Updated some references.</t><t>Updated Michel Suginard's coordinates.</t></list></t>
2426</section>
2427
2428<section title="Changes from -01 to -02 of draft-duerst-iri-bis"><t><list style="symbols"><t>Added tag range to iprivate (issue private-include-tags-115).</t><t>Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs.</t></list></t>
2429</section>
2430<section title="Changes from -00 to -01 of draft-duerst-iri-bis"><t><list style="symbols"><t>Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI" based on input from the W3C XML Core WG. Moved the relevant subsections to the back and promoted them to a section.</t><t>Added some text re. Legacy Extended IRIs to the security section.</t><t>Added a IANA Consideration Section.</t><t>Added this Change Log Section.</t><t>Added a section about "IRIs with Spaces/Controls" (converting from a Note in RFC 3987).</t></list></t>
2431</section>
2432<section title="Changes from RFC 3987 to -00 of draft-duerst-iri-bis"><t><list><t>Fixed errata (see http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987).</t></list></t>
2433</section>
2434</section>
2435</middle>
2436
2437<back>
2438<references title="Normative References">
2439
2440<reference anchor="ASCII">
2441<front>
2442<title>Coded Character Set -- 7-bit American Standard Code for Information
2443Interchange</title>
2444<author>
2445<organization>American National Standards Institute</organization>
2446</author>
2447<date year="1986"/>
2448</front>
2449<seriesInfo name="ANSI" value="X3.4"/>
2450</reference>
2451
2452<reference anchor="ISO10646">
2453<front>
2454<title>ISO/IEC 10646:2003: Information Technology -
2455Universal Multiple-Octet Coded Character Set (UCS)</title>
2456<author>
2457<organization>International Organization for Standardization</organization>
2458</author>
2459<date month="December" year="2003"/>
2460</front>
2461<seriesInfo name="ISO" value="Standard 10646"/>
2462</reference>
2463
2464&rfc2119;
2465&rfc3490;
2466&rfc3491;
2467&rfc3629;
2468&rfc3986;
2469
2470<reference anchor="STD68">
2471<front>
2472<title abbrev="ABNF">Augmented BNF for Syntax Specifications: ABNF</title>
2473<author initials="D." surname="Crocker" fullname="Dave Crocker"><organization/></author>
2474<author initials="P." surname="Overell" fullname="Paul Overell"><organization/></author>
2475<date month="January" year="2008"/></front>
2476<seriesInfo name="STD" value="68"/><seriesInfo name="RFC" value="5234"/>
2477</reference>
2478 
2479&rfc5890;
2480&rfc5891;
2481
2482<reference anchor="UNIV4">
2483<front>
2484<title>The Unicode Standard, Version 5.1.0, defined by: The Unicode Standard,
2485Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0),
2486as amended by Unicode 4.1.0 (http://www.unicode.org/versions/Unicode5.1.0/)</title>
2487<author><organization>The Unicode Consortium</organization></author>
2488<date year="2008" month="April"/>
2489</front>
2490</reference>
2491
2492<reference anchor="UNI9" target="http://www.unicode.org/reports/tr9/tr9-13.html">
2493<front>
2494<title>The Bidirectional Algorithm</title>
2495<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
2496<date year="2004" month="March"/>
2497</front>
2498<seriesInfo name="Unicode Standard Annex" value="#9"/>
2499</reference>
2500
2501<reference anchor="UTR15" target="http://www.unicode.org/unicode/reports/tr15/tr15-23.html">
2502<front>
2503<title>Unicode Normalization Forms</title>
2504<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
2505<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2506<date year="2008" month="March"/>
2507</front>
2508<seriesInfo name="Unicode Standard Annex" value="#15"/>
2509</reference>
2510
2511</references>
2512
2513<references title="Informative References">
2514
2515<reference anchor="BidiEx" target="http://www.w3.org/International/iri-edit/BidiExamples">
2516<front>
2517<title>Examples of bidirectional IRIs</title>
2518<author><organization/></author>
2519<date year="" month=""/>
2520</front>
2521</reference>
2522
2523<reference anchor="CharMod" target="http://www.w3.org/TR/charmod-resid">
2524<front>
2525<title>Character Model for the World Wide Web: Resource Identifiers</title>
2526<author initials="M." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2527<author initials="F." surname="Yergeau" fullname="Francois Yergeau"><organization/></author>
2528<author initials="R." surname="Ishida" fullname="Richard Ishida"><organization/></author>
2529<author initials="M." surname="Wolf" fullname="Misha Wolf"><organization/></author>
2530<author initials="T." surname="Texin" fullname="Tex Texin"><organization/></author>
2531<date year="2004" month="November" day="25"/>
2532</front>
2533<seriesInfo name="World Wide Web Consortium" value="Candidate Recommendation"/>
2534</reference>
2535
2536<reference anchor="Duerst97" target="http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf">
2537<front>
2538<title>The Properties and Promises of UTF-8</title>
2539<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2540<date year="1997" month="September"/>
2541</front>
2542<seriesInfo name="Proc. 11th International Unicode Conference, San Jose" value=""/>
2543</reference>
2544
2545<reference anchor="Gettys" target="http://www.w3.org/DesignIssues/ModelConsequences">
2546<front>
2547<title>URI Model Consequences</title>
2548<author initials="J." surname="Gettys" fullname="Jim Gettys"><organization/></author>
2549<date month="" year=""/>
2550</front>
2551</reference>
2552
2553<reference anchor="HTML4" target="http://www.w3.org/TR/html401/appendix/notes.html#h-B.2">
2554<front>
2555<title>HTML 4.01 Specification</title>
2556<author initials="D." surname="Raggett" fullname="Dave Raggett"><organization/></author>
2557<author initials="A." surname="Le Hors" fullname="Arnaud Le Hors"><organization/></author>
2558<author initials="I." surname="Jacobs" fullname="Ian Jacobs"><organization/></author>
2559<date year="1999" month="December" day="24"/>
2560</front>
2561<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2562</reference>
2563
2564<reference anchor="LEIRI" target="http://www.w3.org/TR/leiri/">
2565<front>
2566<title>Legacy extended IRIs for XML resource identification</title>
2567<author initials="H." surname="Thompson" fullname="Henry Thompson"><organization/></author>
2568<author initials="R." surname="Tobin"    fullname="Richard Tobin"><organization/></author>
2569<author initials="N." surname="Walsh" fullname="Norman Walsh"><organization/></author>
2570  <date year="2008" month="November" day="3"/>
2571
2572</front>
2573<seriesInfo name="World Wide Web Consortium" value="Note"/>
2574</reference>
2575
2576
2577&rfc2045;
2578&rfc2130;
2579&rfc2141;
2580&rfc2192;
2581&rfc2277;
2582&rfc2368;
2583&rfc2384;
2584&rfc2396;
2585&rfc2397;
2586&rfc2616;
2587&rfc1738;
2588&rfc2640;
2589<reference anchor='RFC4395bis'>
2590  <front>
2591    <title>Guidelines and Registration Procedures for New URI/IRI Schemes</title>
2592    <author initials='T.' surname='Hansen' fullname="Tony Hansen"><organization/></author>
2593    <author initials='T.' surname='Hardie' fullname="Ted Hardie"><organization/></author>
2594    <author initials='L.' surname='Masinter' fullname="Larry Masinter"><organization/></author>
2595    <date year="2010" month='September' day="30"/>
2596    <workgroup>IRI</workgroup>
2597  </front>
2598  <seriesInfo name="Internet-Draft" value="draft-hansen-iri-4395bis-irireg-00"/>
2599</reference>
2600 
2601 
2602<reference anchor="UNIXML" target="http://www.w3.org/TR/unicode-xml/">
2603<front>
2604<title>Unicode in XML and other Markup Languages</title>
2605<author initials="M.J." surname="Duerst" fullname="Martin Duerst"><organization/></author>
2606<author initials="A." surname="Freytag" fullname="Asmus Freytag"><organization/></author>
2607<date year="2003" month="June" day="18"/>
2608</front>
2609<seriesInfo name="Unicode Technical Report" value="#20"/>
2610<seriesInfo name="World Wide Web Consortium" value="Note"/>
2611</reference>
2612 
2613<reference anchor="UTR36" target="http://unicode.org/reports/tr36/">
2614<front>
2615<title>Unicode Security Considerations</title>
2616<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
2617<author initials="M." surname="Suignard" fullname="Michel Suignard"><organization/></author>
2618<date year="2010" month="August" day="4"/>
2619</front>
2620<seriesInfo name="Unicode Technical Report" value="#36"/>
2621</reference>
2622
2623<reference anchor="XLink" target="http://www.w3.org/TR/xlink/#link-locators">
2624<front>
2625<title>XML Linking Language (XLink) Version 1.0</title>
2626<author initials="S." surname="DeRose" fullname="Steve DeRose"><organization/></author>
2627<author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2628<author initials="D." surname="Orchard" fullname="David Orchard"><organization/></author>
2629<date year="2001" month="June" day="27"/>
2630</front>
2631<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2632</reference>
2633
2634<reference anchor="XML1" target="http://www.w3.org/TR/REC-xml">
2635  <front>
2636    <title>Extensible Markup Language (XML) 1.0 (Forth Edition)</title>
2637    <author initials="T." surname="Bray" fullname="Tim Bray"><organization/></author>
2638    <author initials="J." surname="Paoli" fullname="Jean Paoli"><organization/></author>
2639    <author initials="C.M." surname="Sperberg-McQueen" fullname="C. M. Sperberg-McQueen">
2640      <organization/></author>
2641    <author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2642    <author initials="F." surname="Yergeau" fullname="Francois Yergeau"><organization/></author>
2643    <date day="16" month="August" year="2006"/>
2644  </front>
2645  <seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2646</reference>
2647
2648<reference anchor="XMLNamespace" target="http://www.w3.org/TR/REC-xml-names">
2649  <front>
2650    <title>Namespaces in XML (Second Edition)</title>
2651    <author initials="T." surname="Bray" fullname="Tim Bray"><organization/></author>
2652    <author initials="D." surname="Hollander" fullname="Dave Hollander"><organization/></author>
2653    <author initials="A." surname="Layman" fullname="Andrew Layman"><organization/></author>
2654    <author initials="R." surname="Tobin" fullname="Richard Tobin"><organization></organization></author><date day="16" month="August" year="2006"/>
2655  </front>
2656  <seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2657</reference>
2658
2659<reference anchor="XMLSchema" target="http://www.w3.org/TR/xmlschema-2/#anyURI">
2660<front>
2661<title>XML Schema Part 2: Datatypes</title>
2662<author initials="P." surname="Biron" fullname="Paul Biron"><organization/></author>
2663<author initials="A." surname="Malhotra" fullname="Ashok Malhotra"><organization/></author>
2664<date year="2001" month="May" day="2"/>
2665</front>
2666<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2667</reference>
2668
2669<reference anchor="XPointer" target="http://www.w3.org/TR/xptr-framework/#escaping">
2670<front>
2671<title>XPointer Framework</title>
2672<author initials="P." surname="Grosso" fullname="Paul Grosso"><organization/></author>
2673<author initials="E." surname="Maler" fullname="Eve Maler"><organization/></author>
2674<author initials="J." surname="Marsh" fullname="Jonathan Marsh"><organization/></author>
2675<author initials="N." surname="Walsh" fullname="Norman Walsh"><organization/></author>
2676<date year="2003" month="March" day="25"/>
2677</front>
2678<seriesInfo name="World Wide Web Consortium" value="Recommendation"/>
2679</reference>
2680
2681<reference anchor="HTML5" target="http://www.w3.org/TR/2009/WD-html5-20090423/">
2682<front>
2683<title>A vocabulary and associated APIs for HTML and XHTML</title>
2684<author initials="I." surname="Hickson" fullname="Ian Hickson"><organization>Google, Inc.</organization></author>
2685<author initials="D." surname="Hyatt" fullname="David Hyatt"><organization>Apple, Inc.</organization></author>
2686<date year="2009"  month="April" day="23"/>
2687</front>
2688<seriesInfo name="World Wide Web Consortium" value="Working Draft"/>
2689</reference>
2690
2691</references>
2692
2693<section title="Design Alternatives">
2694<t>This section briefly summarizes some design alternatives
2695considered earlier and the reasons why they were not chosen.</t>
2696<section title="New Scheme(s)">
2697<t>Introducing new schemes (for example, httpi:, ftpi:,...) or a
2698new metascheme (e.g., i:, leading to URI/IRI prefixes such as
2699i:http:, i:ftp:,...) was proposed to make IRI-to-URI conversion
2700scheme dependent or to distinguish between percent-encodings
2701resulting from IRI-to-URI conversion and percent-encodings from
2702legacy character encodings.</t>
2703
2704<t>New schemes are not needed to distinguish URIs from true IRIs (i.e.,
2705  IRIs that contain non-ASCII characters). The benefit of being able
2706  to detect the origin of percent-encodings is marginal, as UTF-8
2707  can be detected with very high reliability. Deploying new schemes is
2708  extremely hard, so not requiring new schemes for IRIs makes
2709  deployment of IRIs vastly easier. Making conversion scheme dependent
2710  is highly inadvisable and would be encouraged by separate schemes for IRIs.
2711  Using a uniform convention for conversion from IRIs to URIs makes
2712  IRI implementation orthogonal to the introduction of actual new
2713  schemes.</t>
2714</section>
2715<section title="Character Encodings Other Than UTF-8">
2716<t>At an early stage, UTF-7 was considered as an alternative to
2717UTF-8 when IRIs are converted to URIs. UTF-7 would not have needed
2718percent-encoding and  in most cases would have been shorter than
2719percent-encoded UTF-8.</t>
2720<t>Using UTF-8 avoids a double layering and overloading of the use of
2721   the "+" character. UTF-8 is fully compatible with US-ASCII and has
2722   therefore been recommended by the IETF, and is being used widely.</t>
2723 
2724  <t>UTF-7 has never been used much and is now clearly being
2725   discouraged. Requiring implementations to convert from UTF-8
2726   to UTF-7 and back would be an additional implementation burden.</t>
2727</section> <!-- notutf8 -->
2728<section title="New Encoding Convention">
2729<t>Instead of using the existing percent-encoding convention
2730of URIs, which is based on octets, the idea was to create a new
2731encoding convention; for example, to use "%u" to introduce
2732UCS code points.</t>
2733<t>Using the existing octet-based percent-encoding mechanism
2734does not need an upgrade of the URI syntax and does not
2735need corresponding server upgrades.</t>
2736</section> <!-- new encoding -->
2737<section title="Indicating Character Encodings in the URI/IRI">
2738<t>Some proposals suggested indicating the character encodings used
2739in an URI or IRI with some new syntactic convention in the URI itself,
2740similar to the "charset" parameter for e-mails and Web pages.
2741As an example, the label in square brackets in
2742"http://www.example.org/ros[iso-8859-1]&amp;#xE9;" indicated that
2743the following "&amp;#xE9;" had to be interpreted as iso-8859-1.</t>
2744<t>If UTF-8 is used exclusively, an upgrade to the URI syntax is not needed.
2745It avoids potentially multiple labels that have to be copied correctly
2746in all cases, even on the side of a bus or on a napkin, leading to
2747usability problems (and being prohibitively annoying).
2748Exclusively using UTF-8 also reduces transcoding errors and confusion.</t>
2749</section> <!-- indicating -->
2750</section>
2751</back>
2752</rfc>
Note: See TracBrowser for help on using the repository browser.