source: draft-ietf-iri-3987bis/draft-ietf-iri-bidi-guidelines.xml @ 92

Last change on this file since 92 was 92, checked in by duerst@…, 8 years ago

added Adil as an editor, detailed contact information still missing

File size: 18.4 KB
Line 
1<?xml version="1.0"?>
2<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
3<!ENTITY rfc2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
4<!ENTITY rfc3490 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3490.xml">
5<!ENTITY rfc3491 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3491.xml">
6<!ENTITY rfc3987 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3987.xml">
7]>
8<?rfc strict='yes'?>
9
10<?xml-stylesheet type='text/css' href='rfc2629.css' ?>
11<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
12<?rfc symrefs='yes'?>
13<?rfc sortrefs='yes'?>
14<?rfc iprnotified="no" ?>
15<?rfc toc='yes'?>
16<?rfc compact='yes'?>
17<?rfc subcompact='no'?>
18<rfc ipr="pre5378Trust200902" docName="draft-ietf-iri-bidi-guidelines-00" category="bcp" xml:lang="en">
19<front>
20<title abbrev="Bidi IRI Guidelines">Guidelines for Internationalized Resource Identifiers with Bi-directional Characters (Bidi IRIs)</title>
21
22  <author initials="M.J." surname="Duerst" fullname='Martin Duerst'>
23    <!-- (Note: Please write "Duerst" with u-umlaut wherever
24      possible, for example as "D&#252;rst" in XML and HTML) -->
25  <organization abbrev="Aoyama Gakuin University">Aoyama Gakuin University</organization>
26  <address>
27  <postal>
28  <street>5-10-1 Fuchinobe</street>
29  <city>Sagamihara</city>
30  <region>Kanagawa</region>
31  <code>229-8558</code>
32  <country>Japan</country>
33  </postal>
34  <phone>+81 42 759 6329</phone>
35  <facsimile>+81 42 759 6495</facsimile>
36  <email>duerst@it.aoyama.ac.jp</email>
37  <uri>http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/<!-- (Note: This is the percent-encoded form of an IRI)--></uri>
38  </address>
39</author>
40
41<author initials="L." surname="Masinter" fullname="Larry Masinter">
42   <organization>Adobe</organization>
43   <address>
44   <postal>
45   <street>345 Park Ave</street>
46   <city>San Jose</city>
47   <region>CA</region>
48   <code>95110</code>
49   <country>U.S.A.</country>
50   </postal>
51   <phone>+1-408-536-3024</phone>
52   <email>masinter@adobe.com</email>
53   <uri>http://larry.masinter.net</uri>
54   </address>
55</author>
56 
57<author initials="A." surname="Allawi" fullname="Adil Allawi">
58  <organization>Diwan Software Limited</organization>
59</author>
60
61<date year="2011" month="August" day="14" />
62
63<area>Applications</area>
64<workgroup>Internationalized Resource Identifiers (iri)</workgroup>
65<keyword>IRI</keyword>
66<keyword>Internationalized Resource Identifier</keyword>
67<keyword>BIDI</keyword>
68<keyword>URI</keyword>
69<keyword>URL</keyword>
70<keyword>IDN</keyword>
71
72<abstract>
73
74<t>This specification gives guidelines for selection, use, presentation of
75International Resource Identifiers (IRI) which include characters with
76in inherent right-to-left (rtl) writing direction. </t>
77</abstract>
78
79</front>
80<middle>
81
82<section title="Introduction">
83
84<t>Some UCS characters, such as those used in the Arabic and Hebrew
85scripts, have an inherent right-to-left (rtl) writing direction. IRIs
86containing these characters (called bidirectional IRIs or Bidi IRIs)
87require additional attention because of the non-trivial relation
88between logical representation (used for digital representation and
89for reading/spelling) and visual representation (used for
90display/printing).</t>
91
92<t>Because of the complex interaction between the logical representation,
93the visual representation, and the syntax of a Bidi IRI, a balance is
94needed between various requirements.
95The main requirements are<list style="hanging">
96<t hangText="1.">user-predictable conversion between visual and
97    logical representation;</t>
98<t hangText="2.">the ability to include a wide range of characters
99    in various parts of the IRI; and</t>
100<t hangText="3.">minor or no changes or restrictions for
101      implementations.</t>
102</list></t>
103<section title="Notation">
104
105<t>In this document, Bidi Notation is used for bidirectional examples: Lower case
106letters stand for Latin letters or other letters that are written left
107to right, whereas upper case letters represent Arabic or Hebrew
108letters that are written right to left.</t>
109
110<t> In this document, the key words "MUST", "MUST NOT", "REQUIRED",
111"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
112and "OPTIONAL" are to be interpreted as described in <xref
113target="RFC2119"/>.</t>
114
115</section> <!-- Notation -->
116
117</section> <!-- Introduction -->
118
119
120
121
122<section title="Logical Storage and Visual Presentation" anchor="visual">
123
124<t>When stored or transmitted in digital representation, bidirectional
125IRIs MUST be in full logical order and MUST conform to the IRI syntax
126rules (which includes the rules relevant to their scheme). This
127ensures that bidirectional IRIs can be processed in the same way as
128other IRIs.</t> <t>Bidirectional IRIs MUST be rendered by using the
129Unicode Bidirectional Algorithm <xref target="UNIV6"/>, <xref
130target="UNI9"/>.  Bidirectional IRIs MUST be rendered in the same way
131as they would be if they were in a left-to-right embedding; i.e., as
132if they were preceded by U+202A, LEFT-TO-RIGHT EMBEDDING (LRE), and
133followed by U+202C, POP DIRECTIONAL FORMATTING (PDF).  Setting the
134embedding direction can also be done in a higher-level protocol (e.g.,
135the dir='ltr' attribute in HTML).</t> 
136
137<t>There is no requirement to use the above embedding if the display
138is still the same without the embedding. For example, a bidirectional
139IRI in a text with left-to-right base directionality (such as used for
140English or Cyrillic) that is preceded and followed by whitespace and
141strong left-to-right characters does not need an embedding.  Also, a
142bidirectional relative IRI reference that only contains strong
143right-to-left characters and weak characters and that starts and ends
144with a strong right-to-left character and appears in a text with
145right-to-left base directionality (such as used for Arabic or Hebrew)
146and is preceded and followed by whitespace and strong characters does
147not need an embedding.</t>
148
149<t>In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
150sufficient to force the correct display behavior.  However, the
151details of the Unicode Bidirectional algorithm are not always easy to
152understand. Implementers are strongly advised to err on the side of
153caution and to use embedding in all cases where they are not
154completely sure that the display behavior is unaffected without the
155embedding.</t>
156
157<t>The Unicode Bidirectional Algorithm (<xref target="UNI9"/>, section
1584.3) permits higher-level protocols to influence bidirectional
159rendering. Such changes by higher-level protocols MUST NOT be used if
160they change the rendering of IRIs.</t> 
161
162<t>The bidirectional formatting characters that may be used before or
163after the IRI to ensure correct display are not themselves part of the
164IRI.  IRIs MUST NOT contain bidirectional formatting characters (LRM,
165RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of
166the IRI but do not appear themselves. It would therefore not be
167possible to input an IRI with such characters correctly.</t>
168
169</section> <!-- visual -->
170<section title="Bidi IRI Structure" anchor="bidi-structure">
171
172<t>The Unicode Bidirectional Algorithm is designed mainly for running
173text.  To make sure that it does not affect the rendering of
174bidirectional IRIs too much, some restrictions on bidirectional IRIs
175are necessary. These restrictions are given in terms of delimiters
176(structural characters, mostly punctuation such as "@", ".", ":",
177and<vspace/>"/") and components (usually consisting mostly of letters
178and digits).</t>
179
180<t>The following syntax rules from the ABNF of <xref target="RFC3987bis"/>
181 correspond to
182components for the purpose of Bidi behavior: iuserinfo, ireg-name,
183isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and
184ifragment.</t>
185
186<t>Specifications that define the syntax of any of the above
187components MAY divide them further and define smaller parts to be
188components according to this document. As an example, the restrictions
189of <xref target="RFC3490"/> on bidirectional domain names correspond
190to treating each label of a domain name as a component for schemes
191with ireg-name as a domain name.  Even where the components are not
192defined formally, it may be helpful to think about some syntax in
193terms of components and to apply the relevant restrictions.  For
194example, for the usual name/value syntax in query parts, it is
195convenient to treat each name and each value as a component. As
196another example, the extensions in a resource name can be treated as
197separate components.</t>
198
199<t>For each component, the following restrictions apply:</t>
200<t>
201<list style="hanging">
202
203<t hangText="1.">A component SHOULD NOT use both right-to-left and
204  left-to-right characters.</t>
205
206<t hangText="2.">A component using right-to-left characters SHOULD
207  start and end with right-to-left characters.</t>
208
209</list></t>
210
211<t>The above restrictions are given as "SHOULD"s, rather than as
212"MUST"s.  For IRIs that are never presented visually, they are not
213relevant.  However, for IRIs in general, they are very important to
214ensure consistent conversion between visual presentation and logical
215representation, in both directions.</t>
216
217<t><list style="hanging">
218
219<t hangText="Note:">In some components, the above restrictions may
220  actually be strictly enforced.  For example, <xref
221  target="RFC3490"></xref> requires that these restrictions apply to
222  the labels of a host name for those schemes where ireg-name is a
223  host name.  In some other components (for example, path components)
224  following these restrictions may not be too difficult.  For other
225  components, such as parts of the query part, it may be very
226  difficult to enforce the restrictions because the values of query
227  parameters may be arbitrary character sequences.</t>
228
229</list></t>
230
231<t>If the above restrictions cannot be satisfied otherwise, the
232affected component can always be mapped to URI notation using the
233general percent-encoding of IRI components, as described
234in <xref target="RFC3987bis"/>. Please note that the whole component
235has to be mapped (see also Example 9 below).</t>
236
237</section> <!-- bidi-structure -->
238
239<section title="Input of Bidi IRIs" anchor="bidiInput">
240
241<t>Bidi input methods MUST generate Bidi IRIs in logical order while
242rendering them according to <xref target="visual"/>.  During input,
243rendering SHOULD be updated after every new character is input to
244avoid end-user confusion.</t>
245
246</section> <!-- bidiInput -->
247
248<section title="Examples">
249
250<t>This section gives examples of bidirectional IRIs, in Bidi
251Notation.  It shows legal IRIs with the relationship between logical
252and visual representation and explains how certain phenomena in this
253relationship may look strange to somebody not familiar with
254bidirectional behavior, but familiar to users of Arabic and Hebrew. It
255also shows what happens if the restrictions given in <xref
256target="bidi-structure"/> are not followed. The examples below can be
257seen at <xref target="BidiEx"/>, in Arabic, Hebrew, and Bidi Notation
258variants.</t>
259
260<t>To read the bidi text in the examples, read the visual
261representation from left to right until you encounter a block of rtl
262text. Read the rtl block (including slashes and other special
263characters) from right to left, then continue at the next unread ltr
264character.</t>
265
266<t>Example 1: A single component with rtl characters is inverted:
267<vspace/>Logical representation:
268"http://ab.CDEFGH.ij/kl/mn/op.html"<vspace/>Visual representation:
269"http://ab.HGFEDC.ij/kl/mn/op.html"<vspace/> Components can be read
270one by one, and each component can be read in its natural
271direction.</t>
272
273<t>Example 2: More than one consecutive component with rtl characters
274is inverted as a whole: <vspace/>Logical representation:
275"http://ab.CDE.FGH/ij/kl/mn/op.html"<vspace/>Visual representation:
276"http://ab.HGF.EDC/ij/kl/mn/op.html"<vspace/> A sequence of rtl
277components is read rtl, in the same way as a sequence of rtl words is
278read rtl in a bidi text.</t>
279
280<t>Example 3: All components of an IRI (except for the scheme) are
281rtl.  All rtl components are inverted overall: <vspace/>Logical
282representation:
283"http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"<vspace/>Visual
284representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"<vspace/> The
285whole IRI (except the scheme) is read rtl. Delimiters between rtl
286components stay between the respective components; delimiters between
287ltr and rtl components don't move.</t>
288
289<t>Example 4: Each of several sequences of rtl components is inverted
290on its own: <vspace/>Logical representation:
291"http://AB.CD.ef/gh/IJ/KL.html"<vspace/>Visual representation:
292"http://DC.BA.ef/gh/LK/JI.html"<vspace/> Each sequence of rtl
293components is read rtl, in the same way as each sequence of rtl words
294in an ltr text is read rtl.</t>
295
296<t>Example 5: Example 2, applied to components of different kinds:
297<vspace/>Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
298<vspace/>Visual representation:
299"http://ab.cd.HG/FE/ij/kl.html"<vspace/> The inversion of the domain
300name label and the path component may be unexpected, but it is
301consistent with other bidi behavior.  For reassurance that the domain
302component really is "ab.cd.EF", it may be helpful to read aloud the
303visual representation following the bidi algorithm. After
304"http://ab.cd." one reads the RTL block "E-F-slash-G-H", which
305corresponds to the logical representation.
306</t>
307
308<t>Example 6: Same as Example 5, with more rtl components:
309<vspace/>Logical representation:
310"http://ab.CD.EF/GH/IJ/kl.html"<vspace/>Visual representation:
311"http://ab.JI/HG/FE.DC/kl.html"<vspace/> The inversion of the domain
312name labels and the path components may be easier to identify because
313the delimiters also move.</t>
314
315<t>Example 7: A single rtl component includes digits: <vspace/>Logical
316representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"<vspace/>Visual
317representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"<vspace/>
318Numbers are written ltr in all cases but are treated as an additional
319embedding inside a run of rtl characters. This is completely
320consistent with usual bidirectional text.</t>
321
322<t>Example 8 (not allowed): Numbers are at the start or end of an rtl
323component:<vspace/>Logical representation:
324"http://ab.cd.ef/GH1/2IJ/KL.html"<vspace/>Visual representation:
325"http://ab.cd.ef/LK/JI1/2HG.html"<vspace/> The sequence "1/2" is
326interpreted by the bidi algorithm as a fraction, fragmenting the
327components and leading to confusion. There are other characters that
328are interpreted in a special way close to numbers; in particular, "+",
329"-", "#", "$", "%", ",", ".", and ":".</t>
330
331<t>Example 9 (not allowed): The numbers in the previous example are
332percent-encoded: <vspace/>Logical representation:
333"http://ab.cd.ef/GH%31/%32IJ/KL.html",<vspace/>Visual representation:
334"http://ab.cd.ef/LK/JI%32/%31HG.html"</t>
335
336<t>Example 10 (allowed but not recommended): <vspace/>Logical
337representation: "http://ab.CDEFGH.123/kl/mn/op.html"<vspace/>Visual
338representation: "http://ab.123.HGFEDC/kl/mn/op.html"<vspace/>
339Components consisting of only numbers are allowed (it would be rather
340difficult to prohibit them), but these may interact with adjacent RTL
341components in ways that are not easy to predict.</t>
342
343<t>Example 11 (allowed but not recommended): <vspace/>Logical
344representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"<vspace/>Visual
345representation: "http://ab.123.HGFEDCij/kl/mn/op.html"<vspace/>
346Components consisting of numbers and left-to-right characters are
347allowed, but these may interact with adjacent RTL components in ways
348that are not easy to predict.</t>
349</section><!-- examples -->
350
351<section title="IANA Considerations" anchor="iana">
352<t>This document makes no changes to IANA registries.</t>
353</section> <!-- IANA -->
354   
355<section title="Security Considerations" anchor="security">
356<t>Confusion can occur with bidirectional IRIs, if the restrictions
357in <xref target="bidi-structure"/> are not followed. The same visual
358representation may be interpreted as different logical representations,
359and vice versa. It is also very important that a correct Unicode bidirectional
360implementation be used.</t>
361</section><!-- security -->
362
363<section title="Acknowledgements">
364<t>This document was derived from <xref target="RFC3987"/> and
365<xref target="RFC3987bis"/> and the acknowledgments of those
366documents apply.</t>
367</section><!-- acknowledgements -->
368</middle>
369
370<back>
371<references title="Normative References">
372
373      <reference anchor="RFC3987bis" 
374         target="http://tools.ietf.org/id/draft-ietf-iri-3987bis">
375         
376          <front>
377            <title>Internationalized Resource Identifiers (IRIs)</title>
378          <author initials="M." surname="Duerst"/>
379          <author initials="L." surname="Masinter" fullname="Larry Masinter"/>
380          <author initials="M." surname="Suignard"/>
381          <date year="2011" month="August" day="14"/>
382          </front>
383      </reference>
384
385
386<reference anchor="ASCII">
387<front>
388<title>Coded Character Set -- 7-bit American Standard Code for Information
389Interchange</title>
390<author>
391<organization>American National Standards Institute</organization>
392</author>
393<date year="1986"/>
394</front>
395<seriesInfo name="ANSI" value="X3.4"/>
396</reference>
397
398<reference anchor="ISO10646">
399<front>
400<title>ISO/IEC 10646:2003: Information Technology -
401Universal Multiple-Octet Coded Character Set (UCS)</title>
402<author>
403<organization>International Organization for Standardization</organization>
404</author>
405<date month="December" year="2003"/>
406</front>
407<seriesInfo name="ISO" value="Standard 10646"/>
408</reference>
409
410&rfc2119;
411&rfc3490;
412&rfc3491;
413
414
415<reference anchor="UNIV6">
416<front>
417<title>The Unicode Standard, Version 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, ISBN 978-1-936213-01-6)</title>
418<author><organization>The Unicode Consortium</organization></author>
419<date year="2010" month="October"/>
420</front>
421</reference>
422
423<reference anchor="UNI9" target="http://www.unicode.org/reports/tr9/tr9-13.html">
424<front>
425<title>The Bidirectional Algorithm</title>
426<author initials="M." surname="Davis" fullname="Mark Davis"><organization/></author>
427<date year="2004" month="March"/>
428</front>
429<seriesInfo name="Unicode Standard Annex" value="#9"/>
430</reference>
431
432</references>
433
434<references title="Informative References">
435
436<reference anchor="BidiEx" target="http://www.w3.org/International/iri-edit/BidiExamples">
437<front>
438<title>Examples of bidirectional IRIs</title>
439<author><organization/></author>
440<date year="" month=""/>
441</front>
442</reference>
443
444
445&rfc3987;
446 
447
448</references>
449
450</back>
451</rfc>
Note: See TracBrowser for help on using the repository browser.