source: draft-ietf-iri-3987bis/draft-ietf-iri-bidi-guidelines.xml @ 139

Last change on this file since 139 was 139, checked in by duerst@…, 7 years ago

added a section documenting the major changes

File size: 26.3 KB
1<?xml version="1.0"?>
2<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
3<!ENTITY rfc2119 SYSTEM "">
4<!ENTITY rfc3490 SYSTEM "">
5<!ENTITY rfc3987 SYSTEM "">
6<!ENTITY DRAFT "draft-ietf-iri-bidi-guidelines-03">
7<!ENTITY YEAR "2012">
9<?rfc strict='yes'?>
11<?xml-stylesheet type='text/css' href='rfc2629.css' ?>
12<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
13<?rfc symrefs='yes'?>
14<?rfc sortrefs='yes'?>
15<?rfc iprnotified="no" ?>
16<?rfc toc='yes'?>
17<?rfc compact='yes'?>
18<?rfc subcompact='no'?>
19<rfc ipr="pre5378Trust200902" docName="&DRAFT;"
20  category="bcp" xml:lang="en">
21  <front>
22    <title abbrev="Bidi IRI Guidelines">Guidelines for Internationalized
23      Resource Identifiers with Bi-directional Characters (Bidi IRIs)</title>
24    <author initials="M.J." isurname="Dürst" surname="Duerst" ifullname="Martin J. Dürst"
25      fullname="Martin J. Duerst (Note: Please write &quot;Duerst&quot; with u-umlaut wherever possible, for example as &quot;D&amp;#252;rst&quot; in XML and HTML.)">
26      <organization>Aoyama Gakuin University<ionly> (青山学院大学)</ionly> </organization>
27      <address>
28        <postal>
29          <street>5-10-1 Fuchinobe</street>
30          <street>Chuo-ku</street>
31          <city>Sagamihara</city>
32          <region>Kanagawa</region>
33          <code>252-5258</code>
34          <country>Japan</country>
35        </postal>
36        <phone>+81 42 759 6329</phone>
37        <facsimile>+81 42 759 6495</facsimile>
38        <email></email>
39        <uri><aonly> (Note: This is the percent-encoded form of an IRI)</aonly><ionly>ürst/</ionly></uri>
40      </address>
41    </author>
42    <author initials="L." surname="Masinter" fullname="Larry Masinter">
43      <organization>Adobe</organization>
44      <address>
45        <postal>
46          <street>345 Park Ave</street>
47          <city>San Jose</city>
48          <region>CA</region>
49          <code>95110</code>
50          <country>U.S.A.</country>
51        </postal>
52        <phone>+1-408-536-3024</phone>
53        <email></email>
54        <uri></uri>
55      </address>
56    </author>
57    <author initials="A." isurname="Allawi (عادل علاوي)" surname="Allawi"
58      ifullname="Adil Allawi (عادل علاوي)" fullname="Adil Allawi">
59    <organization>Diwan Software Limited</organization>
60      <address>
61        <postal>
62          <street>37-39 Peckham Road</street>
63          <city>London</city>
64          <code>SE5 8UH</code>
65          <country>United Kingdom</country>
66        </postal>
67        <phone>+44 7718 785850</phone>
68        <facsimile>+44 20 72525444</facsimile>
69        <email></email>
70        <uri></uri>
71      </address>
72    </author>
73    <date year="&YEAR;" month="October" />
74    <area>Applications</area>
75    <workgroup>Internationalized Resource Identifiers (iri)</workgroup>
76    <keyword>IRI</keyword>
77    <keyword>Internationalized Resource Identifier</keyword>
78    <keyword>BIDI</keyword>
79    <keyword>URI</keyword>
80    <keyword>URL</keyword>
81    <keyword>IDN</keyword>
82    <abstract>
83      <t>This specification gives guidelines for selection, use, and
84        presentation of International Resource Identifiers (IRIs) which include
85        characters with inherent right-to-left (rtl) writing direction. </t>
86    </abstract>
87  </front>
88  <middle>
89    <section title="Introduction">
90      <section title='Overview'>
91      <t>Some UCS characters, such as those used in the Arabic and Hebrew
92        scripts, have an inherent right-to-left (rtl) writing direction as
93        opposed to characters, such as those in the Latin script, that have an
94        inherent left-to-right (ltr) direction. IRIs containing rtl characters
95        (called bidirectional IRIs or Bidi IRIs) require additional attention
96        because of the non-trivial relation between their logical and visual
97        ordering. The logical order represents the order in which characters are
98        stored on computers and read by people. The visual order is the order in
99        which the characters appear (or are expected to appear) on a computer
100        display or printout.</t>
101      <t>Generally, alphabetic characters in scripts like Arabic and Hebrew are
102        drawn rtl while numbers are drawn ltr. Symbols such as slash ('/') and
103        period ('.') take their visual direction from the surrounding characters.</t>
104      <t>Because of this complex interaction between the logical representation,
105        the visual representation, and the syntax of a Bidi IRI, a balance is
106        needed between various requirements. The main requirements are: <list
107        style="hanging">
108        <t hangText="1.">user-predictable conversion between visual and logical
109          representation;</t>
110        <t hangText="2.">the ability to include a wide range of characters in
111          various parts of the IRI; and</t>
112        <t hangText="3.">minor or no changes or restrictions for
113          implementations.</t>
114        </list></t>
115        </section>
116      <section title='Availability'>
117        <t>This document is available in (line-printer ready) plaintext ASCII and in PDF.
118          It is also available in HTML from
119          <vspace/><eref target=";/pub/&DRAFT;.html"
120            >;/pub/&DRAFT;.html</eref>,
121          and in UTF-8 plaintext from
122          <vspace/><eref target=";/pub/&DRAFT;.utf8.txt"
123            >;/pub/&DRAFT;.utf8.txt</eref>.
124          While all these versions are identical in their technical content,
125          the HTML, PDF, and UTF-8 plaintext versions show non-Unicode characters directly.
126          This often makes it easier to understand examples, and readers are therefore strongly advised
127          to consult these versions in preference or as a supplement to the ASCII version.</t>
128      </section>
129      <section title="Notation">
130        <t>In this document, "Bidi Notation", abbreviated "BN" is used for the given Bidi IRI
131          examples as follows: Lower case letters a-z stand for characters that
132          are written with a left to right ordering (such as Latin characters),
133          whereas upper case letters A-Z represent characters that are written
134          right to left (such as Arabic or Hebrew characters). Numbers and
135          symbols are the same.</t>
136        <t> In this document, the key words "MUST", "MUST NOT", "REQUIRED",
138          and "OPTIONAL" are to be interpreted as described in <xref
139            target="RFC2119"/>.</t>
140      </section>
141      <!-- Notation -->
142    </section>
143    <!-- Introduction -->
144    <section title="Logical Storage and Visual Presentation" anchor="visual">
145      <t>When stored or transmitted in digital representation, Bidi IRIs MUST be
146        in full logical order and MUST conform to the IRI syntax rules (which
147        includes the rules relevant to their scheme). This ensures that
148        Bidi IRIs can be processed in the same way as other IRIs.</t>
149      <t>Bidi IRIs MUST be visually ordered by the Unicode Bidirectional
150        Algorithm <xref target="UNIV6"/>, <xref target="UNI9"/>. Bidi IRIs MUST
151        be rendered in the same way as they would be if they were in a
152        left-to-right embedding. </t>
153      <t>In conformance with the Unicode Bidirectional Algorithm, embedding MAY
154        be done in one of two ways: <list style="hanging">
155        <t hangText="1.">precede the IRI with U+202A, LEFT-TO-RIGHT EMBEDDING
156          (LRE), and follow with U+202C, POP DIRECTIONAL FORMATTING (PDF);
157          or</t>
158        <t hangText="2.">use a higher-level protocol (e.g., the dir='ltr'
159          attribute in HTML).</t>
160        </list></t>
161      <t>Preceding and following the Bidi IRI with U+200E, LEFT-TO-RIGHT MARK
162        (LRM). Is NOT RECOMMENDED as, there are cases where this may not be
163        sufficient to match full left to right embedding.</t>
164      <t>There is no requirement to use embedding if the display is still the
165        same without the embedding. For example, a Bidi IRI in a text
166        with left-to-right base directionality (such as used for English or
167        Cyrillic) that is preceded and followed by whitespace and strong
168        left-to-right characters does not need an embedding. Also, a
169        bidirectional relative IRI reference that only contains strong
170        right-to-left characters and weak characters (such as symbols) and that
171        starts and ends with a strong right-to-left character and appears in a
172        text with right-to-left base directionality (such as used for Arabic or
173        Hebrew) and is preceded and followed by whitespace and strong characters
174        does not need an embedding.</t>
175      <t>However, Implementers are, RECOMMENDED to use embedding in all cases
176        where they are not completely sure that the display behavior is
177        unaffected without the embedding.</t>
178      <t>The Unicode Bidirectional Algorithm (<xref target="UNI9"/>, section
179        4.3) permits higher-level protocols to influence bidirectional
180        rendering. Such changes by higher-level protocols MUST NOT be used if
181        they change the rendering of IRIs.</t>
182      <t>The bidirectional formatting characters that may be used before or
183        after the IRI to ensure correct display are not themselves part of the
184        IRI. IRIs MUST NOT contain bidirectional formatting characters (LRM,
185        RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of
186        the IRI but do not appear themselves. It would therefore not be possible
187        to input an IRI with such characters correctly.</t>
188    </section>
189    <!-- visual -->
190    <section title="Bidi IRI Structure" anchor="bidi-structure">
191      <t>The Unicode Bidirectional Algorithm is designed for general purpose
192        text. To make sure that it does not affect the rendering of Bidi IRIs
193        outside of the requirements of this document, some restrictions on Bidi
194        IRIs are necessary. These restrictions are given in terms of delimiters
195        (structural characters, mostly punctuation such as "@", ".", ":", and
196        "/") and components (usually consisting mostly of letters and
197        digits).</t>
198      <t>The following syntax rules from the ABNF of <xref target="RFC3987bis"/>
199        correspond to components for the purpose of Bidi behavior: iuserinfo,
200        ireg-name, isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and
201        ifragment.</t>
202      <t>Specifications that define the syntax of any of the above components
203        MAY divide them further and define smaller parts to be components
204        according to this document. As an example, the restrictions of <xref
205          target="RFC3490"/> on bidirectional domain names correspond to treating
206        each label of a domain name as a component for schemes with ireg-name as
207        a domain name. Even where the components are not defined formally, it
208        may be helpful to think about some syntax in terms of components and to
209        apply the relevant restrictions. For example, for the usual name/value
210        syntax in query parts, it is convenient to treat each name and each
211        value as a component. As another example, the extensions in a resource
212        name can be treated as separate components.</t>
213      <t>For each component, the following restrictions apply:</t>
214      <t> <list style="hanging">
215        <t hangText="1.">A component SHOULD NOT use both right-to-left and
216          left-to-right characters.</t>
217        <t hangText="2.">A component using right-to-left characters SHOULD start
218          and end with right-to-left characters.</t>
219      </list></t>
220      <t>The above restrictions are given as "SHOULD"s, rather than as "MUST"s.
221        For IRIs that are never presented visually, they are not relevant.
222        However, for IRIs in general, they are very important to ensure
223        consistent conversion between visual presentation and logical
224        representation, in both directions.</t>
225      <t><list style="hanging">
226        <t hangText="Note:">In some components, the above restrictions may
227          actually be strictly enforced. For example, <xref target="RFC3490"/>
228          requires that these restrictions apply to the labels of a host name
229          for those schemes where ireg-name is a host name. In some other
230          components (for example, path components) following these restrictions
231          may not be too difficult. For other components, such as parts of the
232          query part, it may be very difficult to enforce the restrictions
233          because the values of query parameters may be arbitrary character
234          sequences.</t>
235      </list></t>
236      <t>If the above restrictions cannot be satisfied otherwise, the affected
237        component can always be mapped to URI notation using the general
238        percent-encoding of IRI components, as described in <xref
239          target="RFC3987bis"/>. Please note that the whole component has to be
240        mapped (see also Example 9 below).</t>
241    </section>
242    <!-- bidi-structure -->
243    <section title="Input of Bidi IRIs" anchor="bidiInput">
244      <t>Bidi input methods MUST generate Bidi IRIs in logical order while
245        rendering them according to <xref target="visual"/>. During input,
246        rendering SHOULD be updated after every new character is input to avoid
247        end-user confusion.</t>
248    </section>
249    <!-- bidiInput -->
250    <section title="Examples">
251      <t>This section gives examples of Bidi IRIs in Bidi Notation. It shows
252        legal IRIs with the relationship between their logical and visual
253        representation and explains how certain phenomena in this relationship
254        may look strange to somebody not familiar with bidirectional behavior,
255        but familiar to users of Arabic and Hebrew. It also shows what happens
256        if the restrictions given in <xref target="bidi-structure"/> are not
257        followed. The examples below can be seen at <xref target="BidiEx"/>, in
258        Arabic, Hebrew, and Bidi Notation variants.</t>
259      <t>To read the bidi text in the examples, read the visual representation
260        from left to right until you encounter a block of rtl text. Read the rtl
261        block (including slashes and other special characters) from right to
262        left, then continue at the next unread ltr character.</t>
263      <t>Please note that "BN" stands for "Bidi Notation", see <eref target="Notation" />.
264        AR stands for Arabic, HE for Hebrew.</t>
266      <t>Example 1: A single component with rtl characters is inverted:
268        <vspace/>Logical representation (BN): "http://ab.CDEFGH.ij/kl/mn/op.html"
269        <vspace/>Visual representation (BN): "http://ab.HGFEDC.ij/kl/mn/op.html"
270        <ionly>
271        <vspace/>Visual representation (AR): "<span dir='ltr'>http://ab.تثجحخد.ij/kl/mn/op.html</span>"
272        <vspace/>Visual representation (HE): "<span dir='ltr'>http://ab.גדהוזח.ij/kl/mn/op.html</span>"
273        </ionly>
274        <vspace/>Components can be read one
275        by one, and each component can be read in its natural direction.</t>
277      <t>Example 2: More than one consecutive component with rtl characters is
278        inverted as a whole:
280        <vspace/>Logical representation (BN): "http://ab.CDE.FGH/ij/kl/mn/op.html"
281        <vspace/>Visual representation (BN): "http://ab.HGF.EDC/ij/kl/mn/op.html"
282        <ionly>
283          <vspace/>Visual representation (AR): "<span dir='ltr'>http://ab.تثج.حخد/ij/kl/mn/op.html</span>"
284          <vspace/>Visual representation (HE): "<span dir='ltr'>http://ab.גדה.וזח/ij/kl/mn/op.html</span>"
285        </ionly>
287        <vspace/> A sequence of rtl
288        components is read rtl, in the same way as a sequence of rtl words is
289        read rtl in a bidi text.</t>
291      <t>Example 3: All components of an IRI (except for the scheme) are rtl.
292        All rtl components are inverted overall:
294        <vspace/>Logical representation (BN): "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"
295        <vspace/>Visual representation (BN): "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"
296        <ionly>
297          <vspace/>Visual representation (AR): "<span dir='ltr'>http://اب.تث.جح/خد/ذر/زس?شص=ضط;ظع=غف#قك</span>"
298          <vspace/>Visual representation (HE): "<span dir='ltr'>http://אב.גד.הו/זח/טי/כל?מן=סע;פץ=קר#שת</span>"
299        </ionly>
301        <vspace/> The
302        whole IRI (except the scheme) is read rtl. Delimiters between rtl
303        components stay between the respective components; delimiters between
304        ltr and rtl components don't move.</t>
306      <t>Example 4: Each of several sequences of rtl components is inverted on
307        its own:
309        <vspace/>Logical representation (BN): "http://AB.CD.ef/gh/IJ/KL.html"
310        <vspace/>Visual representation (BN): "http://DC.BA.ef/gh/LK/JI.html"
311        <ionly>
312          <vspace/>Visual representation (AR): "<span dir='ltr'>http://اب.تث.ef/gh/ذر/زس.html</span>"
313          <vspace/>Visual representation (HE): "<span dir='ltr'>http://אב.גד.ef/gh/טי/כל.html</span>"
314        </ionly>
316        <vspace/> Each sequence of rtl components
317        is read rtl, in the same way as each sequence of rtl words in an ltr
318        text is read rtl.</t>
320      <t>Example 5: Example 2, applied to components of different kinds:
322        <vspace/>Logical representation (BN): ""
323        <vspace/>Visual representation (BN): ""
324        <ionly>
325          <vspace/>Visual representation (AR): "<span dir='ltr'>جح/خد/ij/kl.html</span>"
326          <vspace/>Visual representation (HE): "<span dir='ltr'>הו/זח/ij/kl.html</span>"
327        </ionly>
329        <vspace/>
330        The inversion of the domain name label and the path component may be
331        unexpected, but it is consistent with other bidi behavior. For
332        reassurance that the domain component really is "", it may be
333        helpful to read aloud the visual representation following the Unicode
334        Bidirectional Algorithm. After "" one reads the RTL block
335        "E-F-slash-G-H", which corresponds to the logical representation. </t>
337      <t>Example 6: Same as Example 5, with more rtl components:
339        <vspace/>Logical representation (BN): "http://ab.CD.EF/GH/IJ/kl.html"
340        <vspace/>Visual representation (BN): "http://ab.JI/HG/FE.DC/kl.html"
341        <ionly>
342          <vspace/>Visual representation (AR): "<span dir='ltr'>http://ab.تث.جح/خد/ذر/kl.html</span>"
343          <vspace/>Visual representation (HE): "<span dir='ltr'>http://ab.גד.הו/זח/טי/kl.html</span>"
344        </ionly>
346        <vspace/> The inversion of the domain
347        name labels and the path components may be easier to identify because
348        the delimiters also move.</t>
350      <t>Example 7: A single rtl component includes digits:
352        <vspace/>Logical representation (BN): "http://ab.CDE123FGH.ij/kl/mn/op.html"
353        <vspace/>Visual representation (BN): "http://ab.HGF123EDC.ij/kl/mn/op.html"
354        <ionly>
355          <vspace/>Visual representation (AR): "<span dir='ltr'>http://ab.تثج123حخد.ij/kl/mn/op.html</span>"
356          <vspace/>Visual representation (HE): "<span dir='ltr'>http://ab.גדה123וזח.ij/kl/mn/op.html</span>"
357        </ionly>
359        <vspace/> Numbers
360        are written ltr in all cases but are treated as an additional embedding
361        inside a run of rtl characters. This is completely consistent with usual
362        bidirectional text.</t>
364      <t>Example 8 (not allowed): Numbers are at the start or end of an rtl
365        component:
367        <vspace/>Logical representation (BN): ""
368        <vspace/>Visual representation (BN): ""
369        <ionly>
370          <vspace/>Visual representation (AR): "<span dir='ltr'>خد1/2ذر/زس.html</span>"
371          <vspace/>Visual representation (HE): "<span dir='ltr'>זח1/2טי/כל.html</span>"
372        </ionly>
374        <vspace/> The sequence "1/2" is
375        interpreted by the Bidirectional Algorithm as a fraction, fragmenting the
376        components and leading to confusion. There are other characters that are
377        interpreted in a special way close to numbers; in particular, "+", "-",
378        "#", "$", "%", ",", ".", and ":".</t>
380      <t>Example 9 (not allowed): The numbers in the previous example are
381        percent-encoded:
383        <vspace/>Logical representation (BN): ""
384        <vspace/>Visual representation (BN): ""
385        <ionly>
386          <vspace/>Visual representation (AR): "<span dir='ltr'>זח%31/%32טי/כל.html</span>"
387          <vspace/>Visual representation (HE): "<span dir='ltr'>خد%31/%32ذر/زس.html</span>"
388        </ionly>
390      </t>
392      <t>Example 10 (allowed but not recommended):
394        <vspace/>Logical representation (BN): "http://ab.CDEFGH.123/kl/mn/op.html"
395        <vspace/>Visual representation (BN): "http://ab.123.HGFEDC/kl/mn/op.html"
396        <ionly>
397          <vspace/>Visual representation (AR): "<span dir='ltr'>http://ab.تثجحخد.123/kl/mn/op.html</span>"
398          <vspace/>Visual representation (HE): "<span dir='ltr'>http://ab.גדהוזח.123/kl/mn/op.html</span>"
399        </ionly>
401        <vspace/> Components
402        consisting of only numbers are allowed (it would be rather difficult to
403        prohibit them), but these may interact with adjacent RTL components in
404        ways that are not easy to predict.</t>
406      <t>Example 11 (allowed but not recommended):
408        <vspace/>Logical representation (BN): "http://ab.CDEFGH.123ij/kl/mn/op.html"
409        <vspace/>Visual representation (BN): "http://ab.123.HGFEDCij/kl/mn/op.html"
410        <ionly>
411          <vspace/>Visual representation (AR): "<span dir='ltr'>http://ab.تثجحخد.123ij/kl/mn/op.html</span>"
412          <vspace/>Visual representation (HE): "<span dir='ltr'>http://ab.גדהוזח.123ij/kl/mn/op.html</span>"
413        </ionly>
415        <vspace/>
416        Components consisting of numbers and left-to-right characters are
417        allowed, but these may interact with adjacent RTL components in ways
418        that are not easy to predict.</t>
419    </section>
420    <!-- examples -->
421    <section title="IANA Considerations" anchor="iana">
422      <t>This document makes no changes to IANA registries.</t>
423    </section>
424    <!-- IANA -->
425    <section title="Security Considerations" anchor="security">
426      <t>Confusion can occur with bidirectional IRIs, if the restrictions in
427        <xref target="bidi-structure"/> are not followed. The same visual
428        representation may be interpreted as different logical representations,
429        and vice versa. It is also very important that a correct Unicode
430        bidirectional implementation be used.</t>
431    </section>
432    <!-- security -->
433    <section title="Acknowledgements">
434      <t>This document was derived from <xref target="RFC3987"/> and <xref
435        target="RFC3987bis"/> and the acknowledgments of those documents
436        apply.</t>
437    </section>
438    <!-- acknowledgements -->
439    <section title="Main Changes Since RFC 3987">
440      <t>This section describes the main changes since <xref target="RFC3987"></xref>.</t>       
441        <t>Note to RFC Editor: Please remove this paragraph before publication.
442          Detailled change logs are available in the IETF tools subversion repository at
445      <t><list style="symbols">
446        <t>Separated out the section on bidi in <xref target="RFC3987"/> to this document.</t>
447        <t>Added examples in Arabic and Hebrew, which can be seen in html/pdf/utf8.txt versions.</t>
448        <t>TODO: check for major changes between RFC3987 and draft -02.</t>
449      </list>
450      </t>
451    </section>
452  </middle>
453  <back>
454    <references title="Normative References">
455      <reference anchor="RFC3987bis"
456        target="">
457        <front>
458          <title>Internationalized Resource Identifiers (IRIs)</title>
459          <author initials="M." surname="Duerst"/>
460          <author initials="L." surname="Masinter" fullname="Larry Masinter"/>
461          <author initials="M." surname="Suignard"/>
462          <date year="2011" month="August" day="14"/>
463        </front>
464      </reference>
465      &rfc2119;
466      &rfc3490;
467      <reference anchor="UNIV6">
468        <front>
469          <title>The Unicode Standard, Version 6.0.0 (Mountain View, CA, The
470            Unicode Consortium, 2011, ISBN 978-1-936213-01-6)</title>
471          <author>
472            <organization>The Unicode Consortium</organization>
473          </author>
474          <date year="2010" month="October"/>
475        </front>
476      </reference>
477      <reference anchor="UNI9"
478        target="">
479        <front>
480          <title>The Unicode Bidirectional Algorithm</title>
481          <author initials="M." surname="Davis" fullname="Mark Davis">
482            <organization/>
483          </author>
484          <date year="2004" month="March"/>
485        </front>
486        <seriesInfo name="Unicode Standard Annex" value="#9"/>
487      </reference>
488    </references>
489    <references title="Informative References">
490      <reference anchor="BidiEx"
491        target="">
492        <front>
493          <title>Examples of Bidi IRIs</title>
494          <author>
495            <organization/>
496          </author>
497          <date year="" month=""/>
498        </front>
499      </reference> &rfc3987; </references>
500  </back>
Note: See TracBrowser for help on using the repository browser.