Changeset 70

Aug 12, 2011, 8:51:06 AM (8 years ago)
  • moved section on normalization and comparison to separate document
  • published as draft -06
1 edited


  • draft-ietf-iri-3987bis/draft-ietf-iri-3987bis.xml

    r69 r70  
    3333<?rfc compact='yes'?>
    3434<?rfc subcompact='no'?>
    35 <rfc ipr="pre5378Trust200902" docName="draft-ietf-iri-3987bis-05" category="std" xml:lang="en" obsoletes="3987">
     35<rfc ipr="pre5378Trust200902" docName="draft-ietf-iri-3987bis-06" category="std" xml:lang="en" obsoletes="3987">
    3737<title abbrev="IRIs">Internationalized Resource Identifiers (IRIs)</title>
    90 <date year="2011" month="March" day="29"/>
     90<date year="2011" />
    9292<workgroup>Internationalized Resource Identifiers (iri)</workgroup>
    180180difficulties. <xref target="Bidi"/> discusses the special case of
    181181bidirectional IRIs using characters from scripts written
    182 right-to-left.  <xref target="equivalence"/> discusses various forms
    183 of equivalence between IRIs. <xref target="IRIuse"/> discusses the use
     182right-to-left.  <xref target="IRIuse"/> discusses the use
    184183of IRIs in different situations.  <xref target="guidelines"/> gives
    185184additional informative guidelines.  <xref target="security"/>
    190189  For some additional background on the design of URIs and IRIs, please also see
    191190    <xref target="Gettys"/>.</t>
    192192</section> <!-- overview -->
    200200IRIs is accomplished by extending the URI syntax while retaining (and
    201201not expanding) the set of "reserved" characters, such that the syntax
    202 for any URI scheme may be uniformly extended to allow non-ASCII
     202for any URI scheme may be extended to allow non-ASCII
    203203characters. In addition, following parsing of an IRI, it is possible
    204204to construct a corresponding URI by first encoding characters outside
    403403<section title="Summary of IRI Syntax" anchor="summary">
    405 <t>IRIs are defined by extending the URI syntax in <xref
    406 target="RFC3986"/>, but extending the class of unreserved characters
    407 by adding the characters of the UCS (Universal Character Set, <xref
     405<t>The IRI syntax extends the URI syntax in <xref
     406target="RFC3986"/> by extending the class of unreserved characters,
     407primarily by adding the characters of the UCS (Universal Character Set, <xref
    408408target="ISO10646"/>) beyond U+007F, subject to the limitations given
    409409in the syntax rules below and in <xref target="limitations"/>.</t>
    596596 characters or an octet-stream representing a Unicode-based character
    597597 encoding such as UTF-8 or UTF-16) should be left as is and not
    598  normalized (see <xref target="normalization"/>).</t>
     598 normalized or changed.</t>
    600600  <t>An IRI or IRI reference is a sequence of characters from the UCS.
    601     For IRIs that are not already in a Unicode form
     601    For resource identifiers that are not already in a Unicode form
    602602    (as when written on paper, read aloud, or represented in a text stream
    603603    using a legacy character encoding), convert the IRI to Unicode.
    604604    Note that some character encodings or transcriptions can be converted
    605605    to or represented by more than one sequence of Unicode characters.
    606607    Ideally the resulting IRI would use a normalized form,
    607     such as Unicode Normalization Form C <xref target="UTR15"/>
    608     (see <xref target='ladder'/> Normalization and Comparison),
     608    such as Unicode Normalization Form C <xref target="UTR15"/>,
    609609    since that ensures a stable, consistent representation
    610610    that is most likely to produce the intended results.
    665665in question. </t>
    667 <t>For each character which is not allowed anywhere in a valid URI, apply the following steps. </t>
     667<t>For each character which is not allowed anywhere in a valid URI
     668 apply the following steps. </t>
    669669<t><list style="hanging">
    11801180</section><!-- examples -->
    11811181</section><!-- bidi -->
    1183 <section title="Normalization and Comparison" anchor="equivalence">
    1185 <t><list style="hanging"><t hangText="Note:">The structure and much of
    1186   the material for this section is taken from section 6 of <xref
    1187   target="RFC3986"></xref>; the differences are due to the specifics
    1188   of IRIs.</t></list></t>
    1190 <t>One of the most common operations on IRIs is simple comparison:
    1191 Determining whether two IRIs are equivalent, without using the IRIs to
    1192 access their respective resource(s). A comparison is performed
    1193 whenever a response cache is accessed, a browser checks its history to
    1194 color a link, or an XML parser processes tags within a
    1195 namespace. Extensive normalization prior to comparison of IRIs may be
    1196 used by spiders and indexing engines to prune a search space or reduce
    1197 duplication of request actions and response storage.</t>
    1199 <t>IRI comparison is performed for some particular purpose. Protocols
    1200 or implementations that compare IRIs for different purposes will often
    1201 be subject to differing design trade-offs in regards to how much
    1202 effort should be spent in reducing aliased identifiers. This section
    1203 describes various methods that may be used to compare IRIs, the
    1204 trade-offs between them, and the types of applications that might use
    1205 them.</t>
    1207 <section title="Equivalence">
    1209 <t>Because IRIs exist to identify resources, presumably they should be
    1210 considered equivalent when they identify the same resource. However,
    1211 this definition of equivalence is not of much practical use, as there
    1212 is no way for an implementation to compare two resources to determine
    1213 if they are "the same" unless it has full knowledge or control of
    1214 them. For this reason, determination of equivalence or difference of
    1215 IRIs is based on string comparison, perhaps augmented by reference to
    1216 additional rules provided by URI scheme definitions.  We use the terms
    1217 "different" and "equivalent" to describe the possible outcomes of such
    1218 comparisons, but there are many application-dependent versions of
    1219 equivalence.</t>
    1221 <t>Even when it is possible to determine that two IRIs are equivalent,
    1222 IRI comparison is not sufficient to determine whether two IRIs
    1223 identify different resources. For example, an owner of two different
    1224 domain names could decide to serve the same resource from both,
    1225 resulting in two different IRIs. Therefore, comparison methods are
    1226 designed to minimize false negatives while strictly avoiding false
    1227 positives.</t>
    1229 <t>In testing for equivalence, applications should not directly
    1230 compare relative references; the references should be converted to
    1231 their respective target IRIs before comparison. When IRIs are compared
    1232 to select (or avoid) a network action, such as retrieval of a
    1233 representation, fragment components (if any) should be excluded from
    1234 the comparison.</t>
    1236 <t>Applications using IRIs as identity tokens with no relationship to
    1237 a protocol MUST use the Simple String Comparison (see <xref
    1238 target="stringcomp"></xref>).  All other applications MUST select one
    1239 of the comparison practices from the Comparison Ladder (see <xref
    1240 target="ladder"></xref>.</t>
    1241 </section> <!-- equivalence -->
    1244 <section title="Preparation for Comparison">
    1245 <t>Any kind of IRI comparison REQUIRES that any additional contextual
    1246 processing is first performed, including undoing higher-level
    1247 escapings or encodings in the protocol or format that carries an
    1248 IRI. This preprocessing is usually done when the protocol or format is
    1249 parsed.</t>
    1251 <t>Examples of contextual preprocessing steps are described in <xref
    1252 target="LEIRIHREF"/>. </t>
    1254 <t>Examples of such escapings or encodings are entities and
    1255 numeric character references in <xref target="HTML4"></xref> and <xref
    1256 target="XML1"></xref>. As an example,
    1257 ";eacute;" (in HTML),
    1258 ";#233;" (in HTML or XML), and
    1259 <vspace/>";#xE9;" (in HTML or XML) are all
    1260 resolved into what is denoted in this document (see <xref
    1261 target="sec-Notation"></xref>) as ";#xE9;"
    1262 (the "&amp;#xE9;" here standing for the actual e-acute character, to
    1263 compensate for the fact that this document cannot contain non-ASCII
    1264 characters).</t>
    1266 <t>Similar considerations apply to encodings such as Transfer Codings
    1267 in HTTP (see <xref target="RFC2616"></xref>) and Content Transfer
    1268 Encodings in MIME (<xref target="RFC2045"></xref>), although in these
    1269 cases, the encoding is based not on characters but on octets, and
    1270 additional care is required to make sure that characters, and not just
    1271 arbitrary octets, are compared (see <xref
    1272 target="stringcomp"></xref>).</t>
    1274 </section> <!-- preparation -->
    1276 <section title="Comparison Ladder" anchor="ladder">
    1278 <t>In practice, a variety of methods are used to test IRI
    1279 equivalence. These methods fall into a range distinguished by the
    1280 amount of processing required and the degree to which the probability
    1281 of false negatives is reduced. As noted above, false negatives cannot
    1282 be eliminated. In practice, their probability can be reduced, but this
    1283 reduction requires more processing and is not cost-effective for all
    1284 applications.</t>
    1287 <t>If this range of comparison practices is considered as a ladder,
    1288 the following discussion will climb the ladder, starting with
    1289 practices that are cheap but have a relatively higher chance of
    1290 producing false negatives, and proceeding to those that have higher
    1291 computational cost and lower risk of false negatives.</t>
    1293 <section title="Simple String Comparison" anchor="stringcomp">
    1295 <t>If two IRIs, when considered as character strings, are identical,
    1296 then it is safe to conclude that they are equivalent.  This type of
    1297 equivalence test has very low computational cost and is in wide use in
    1298 a variety of applications, particularly in the domain of parsing. It
    1299 is also used when a definitive answer to the question of IRI
    1300 equivalence is needed that is independent of the scheme used and that
    1301 can be calculated quickly and without accessing a network. An example
    1302 of such a case is XML Namespaces (<xref
    1303 target="XMLNamespace"></xref>).</t>
    1306 <t>Testing strings for equivalence requires some basic precautions.
    1307 This procedure is often referred to as "bit-for-bit" or
    1308 "byte-for-byte" comparison, which is potentially misleading. Testing
    1309 strings for equality is normally based on pair comparison of the
    1310 characters that make up the strings, starting from the first and
    1311 proceeding until both strings are exhausted and all characters are
    1312 found to be equal, until a pair of characters compares unequal, or
    1313 until one of the strings is exhausted before the other.</t>
    1315 <t>This character comparison requires that each pair of characters be
    1316 put in comparable encoding form. For example, should one IRI be stored
    1317 in a byte array in UTF-8 encoding form and the second in a UTF-16
    1318 encoding form, bit-for-bit comparisons applied naively will produce
    1319 errors. It is better to speak of equality on a character-for-character
    1320 rather than on a byte-for-byte or bit-for-bit basis.  In practical
    1321 terms, character-by-character comparisons should be done codepoint by
    1322 codepoint after conversion to a common character encoding form.
    1324 When comparing character by character, the comparison function MUST
    1325 NOT map IRIs to URIs, because such a mapping would create additional
    1326 spurious equivalences. It follows that an IRI SHOULD NOT be modified
    1327 when being transported if there is any chance that this IRI might be
    1328 used in a context that uses Simple String Comparison.</t>
    1331 <t>False negatives are caused by the production and use of IRI
    1332 aliases. Unnecessary aliases can be reduced, regardless of the
    1333 comparison method, by consistently providing IRI references in an
    1334 already normalized form (i.e., a form identical to what would be
    1335 produced after normalization is applied, as described below).
    1336 Protocols and data formats often limit some IRI comparisons to simple
    1337 string comparison, based on the theory that people and implementations
    1338 will, in their own best interest, be consistent in providing IRI
    1339 references, or at least be consistent enough to negate any efficiency
    1340 that might be obtained from further normalization.</t>
    1341 </section> <!-- stringcomp -->
    1343 <section title="Syntax-Based Normalization">
    1345 <figure><preamble>Implementations may use logic based on the
    1346 definitions provided by this specification to reduce the probability
    1347 of false negatives. This processing is moderately higher in cost than
    1348 character-for-character string comparison. For example, an application
    1349 using this approach could reasonably consider the following two IRIs
    1350 equivalent:</preamble>
    1352 <artwork>
    1353    example://a/b/c/%7Bfoo%7D/ros&amp;#xE9;
    1354    eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9
    1355 </artwork></figure>
    1357 <t>Web user agents, such as browsers, typically apply this type of IRI
    1358 normalization when determining whether a cached response is
    1359 available. Syntax-based normalization includes such techniques as case
    1360 normalization, character normalization, percent-encoding
    1361 normalization, and removal of dot-segments.</t>
    1363 <section title="Case Normalization">
    1365 <t>For all IRIs, the hexadecimal digits within a percent-encoding
    1366 triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
    1367 should be normalized to use uppercase letters for the digits A-F.</t>
    1369 <t>When an IRI uses components of the generic syntax, the component
    1370 syntax equivalence rules always apply; namely, that the scheme and
    1371 US-ASCII only host are case insensitive and therefore should be
    1372 normalized to lowercase. For example, the URI
    1373 "HTTP://" is equivalent to
    1374 "". Case equivalence for non-ASCII characters
    1375 in IRI components that are IDNs are discussed in <xref
    1376 target="schemecomp"></xref>.  The other generic syntax components are
    1377 assumed to be case sensitive unless specifically defined otherwise by
    1378 the scheme.</t>
    1380 <t>Creating schemes that allow case-insensitive syntax components
    1381 containing non-ASCII characters should be avoided. Case normalization
    1382 of non-ASCII characters can be culturally dependent and is always a
    1383 complex operation. The only exception concerns non-ASCII host names
    1384 for which the character normalization includes a mapping step derived
    1385 from case folding.</t>
    1387 </section> <!-- casenorm -->
    1389 <section title="Character Normalization" anchor="normalization">
    1391 <t>The Unicode Standard <xref target="UNIV6"></xref> defines various
    1392 equivalences between sequences of characters for various
    1393 purposes. Unicode Standard Annex #15 <xref target="UTR15"></xref>
    1394 defines various Normalization Forms for these equivalences, in
    1395 particular Normalization Form C (NFC, Canonical Decomposition,
    1396 followed by Canonical Composition) and Normalization Form KC (NFKC,
    1397 Compatibility Decomposition, followed by Canonical Composition).</t>
    1399 <t> IRIs already in Unicode MUST NOT be normalized before parsing or
    1400 interpreting. In many non-Unicode character encodings, some text
    1401 cannot be represented directly. For example, the word "Vietnam" is
    1402 natively written "Vi&amp;#x1EC7;t Nam" (containing a LATIN SMALL
    1404 transcoding from the windows-1258 character encoding leads to
    1405 "Vi&amp;#xEA;&amp;#x323;t Nam" (containing a LATIN SMALL LETTER E WITH
    1406 CIRCUMFLEX followed by a COMBINING DOT BELOW). Direct transcoding of
    1407 other 8-bit encodings of Vietnamese may lead to other
    1408 representations.</t>
    1410 <t>Equivalence of IRIs MUST rely on the assumption that IRIs are
    1411 appropriately pre-character-normalized rather than apply character
    1412 normalization when comparing two IRIs. The exceptions are conversion
    1413 from a non-digital form, and conversion from a non-UCS-based character
    1414 encoding to a UCS-based character encoding. In these cases, NFC or a
    1415 normalizing transcoder using NFC MUST be used for interoperability. To
    1416 avoid false negatives and problems with transcoding, IRIs SHOULD be
    1417 created by using NFC. Using NFKC may avoid even more problems; for
    1418 example, by choosing half-width Latin letters instead of full-width
    1419 ones, and full-width instead of half-width Katakana.</t>
    1422 <t>As an example,
    1423 ";#xE9;sum&amp;#xE9;.html" (in XML
    1424 Notation) is in NFC. On the other hand,
    1425 ";#x301;sume&amp;#x301;.html" is not in
    1426 NFC.</t>
    1428 <t>The former uses precombined e-acute characters, and the latter uses
    1429 "e" characters followed by combining acute accents. Both usages are
    1430 defined as canonically equivalent in <xref target="UNIV6"></xref>.</t>
    1432 <t><list style="hanging">
    1434 <t hangText="Note:">
    1435 Because it is unknown how a particular sequence of characters is being
    1436 treated with respect to character normalization, it would be
    1437 inappropriate to allow third parties to normalize an IRI
    1438 arbitrarily. This does not contradict the recommendation that when a
    1439 resource is created, its IRI should be as character normalized as
    1440 possible (i.e., NFC or even NFKC). This is similar to the
    1441 uppercase/lowercase problems.  Some parts of a URI are case
    1442 insensitive (for example, the domain name). For others, it is unclear
    1443 whether they are case sensitive, case insensitive, or something in
    1444 between (e.g., case sensitive, but with a multiple choice selection if
    1445 the wrong case is used, instead of a direct negative result).  The
    1446 best recipe is that the creator use a reasonable capitalization and,
    1447 when transferring the URI, capitalization never be
    1448 changed.</t></list></t>
    1450 <t>Various IRI schemes may allow the usage of Internationalized Domain
    1451 Names (IDN) <xref target="RFC5890"/> either in the ireg-name
    1452 part or elsewhere. Character Normalization also applies to IDNs, as
    1453 discussed in <xref target="schemecomp"/>.</t>
    1454 </section> <!-- charnorm -->
    1456 <section title="Percent-Encoding Normalization">
    1458 <t>The percent-encoding mechanism (Section 2.1 of <xref
    1459 target="RFC3986"></xref>) is a frequent source of variance among
    1460 otherwise identical IRIs. In addition to the case normalization issue
    1461 noted above, some IRI producers percent-encode octets that do not
    1462 require percent-encoding, resulting in IRIs that are equivalent to
    1463 their nonencoded counterparts. These IRIs should be normalized by
    1464 decoding any percent-encoded octet sequence that corresponds to an
    1465 unreserved character, as described in section 2.3 of <xref
    1466 target="RFC3986"></xref>.</t>
    1468 <t>For actual resolution, differences in percent-encoding (except for
    1469 the percent-encoding of reserved characters) MUST always result in the
    1470 same resource.  For example, "",
    1471 "", and "", must
    1472 resolve to the same resource.</t>
    1474 <t>If this kind of equivalence is to be tested, the percent-encoding
    1475 of both IRIs to be compared has to be aligned; for example, by
    1476 converting both IRIs to URIs (see Section 3.1), eliminating escape
    1477 differences in the resulting URIs, and making sure that the case of
    1478 the hexadecimal characters in the percent-encoding is always the same
    1479 (preferably upper case). If the IRI is to be passed to another
    1480 application or used further in some other way, its original form MUST
    1481 be preserved.  The conversion described here should be performed only
    1482 for local comparison.</t>
    1484 </section> <!-- pctnorm -->
    1486 <section title="Path Segment Normalization">
    1488 <t>The complete path segments "." and ".." are intended only for use
    1489 within relative references (Section 4.1 of <xref
    1490 target="RFC3986"></xref>) and are removed as part of the reference
    1491 resolution process (Section 5.2 of <xref target="RFC3986"></xref>).
    1492 However, some implementations may incorrectly assume that reference
    1493 resolution is not necessary when the reference is already an IRI, and
    1494 thus fail to remove dot-segments when they occur in non-relative
    1495 paths.  IRI normalizers should remove dot-segments by applying the
    1496 remove_dot_segments algorithm to the path, as described in Section
    1497 5.2.4 of <xref target="RFC3986"></xref>.</t>
    1499 </section> <!-- pathnorm -->
    1500 </section> <!-- ladder -->
    1502 <section title="Scheme-Based Normalization" anchor="schemecomp">
    1504 <t>The syntax and semantics of IRIs vary from scheme to scheme, as
    1505 described by the defining specification for each
    1506 scheme. Implementations may use scheme-specific rules, at further
    1507 processing cost, to reduce the probability of false negatives. For
    1508 example, because the "http" scheme makes use of an authority
    1509 component, has a default port of "80", and defines an empty path to be
    1510 equivalent to "/", the following four IRIs are equivalent:</t>
    1512 <figure><artwork>
    1518 <t>In general, an IRI that uses the generic syntax for authority with
    1519 an empty path should be normalized to a path of "/". Likewise, an
    1520 explicit ":port", for which the port is empty or the default for the
    1521 scheme, is equivalent to one where the port and its ":" delimiter are
    1522 elided and thus should be removed by scheme-based normalization. For
    1523 example, the second IRI above is the normal form for the "http"
    1524 scheme.</t>
    1526 <t>Another case where normalization varies by scheme is in the
    1527 handling of an empty authority component or empty host
    1528 subcomponent. For many scheme specifications, an empty authority or
    1529 host is considered an error; for others, it is considered equivalent
    1530 to "localhost" or the end-user's host. When a scheme defines a default
    1531 for authority and an IRI reference to that default is desired, the
    1532 reference should be normalized to an empty authority for the sake of
    1533 uniformity, brevity, and internationalization. If, however, either the
    1534 userinfo or port subcomponents are non-empty, then the host should be
    1535 given explicitly even if it matches the default.</t>
    1537 <t>Normalization should not remove delimiters when their associated
    1538 component is empty unless it is licensed to do so by the scheme
    1539 specification. For example, the IRI "" cannot be
    1540 assumed to be equivalent to any of the examples above. Likewise, the
    1541 presence or absence of delimiters within a userinfo subcomponent is
    1542 usually significant to its interpretation.  The fragment component is
    1543 not subject to any scheme-based normalization; thus, two IRIs that
    1544 differ only by the suffix "#" are considered different regardless of
    1545 the scheme.</t>
    1547 <t>Some IRI schemes allow the usage of Internationalized Domain
    1548 Names (IDN) <xref target='RFC5890'></xref> either in their ireg-name
    1549 part or elswhere. When in use in IRIs, those names SHOULD
    1550 conform to the definition of U-Label in <xref
    1551 target='RFC5890'></xref>. An IRI containing an invalid IDN cannot
    1552 successfully be resolved. For legibility purposes, they
    1553 SHOULD NOT be converted into ASCII Compatible Encoding (ACE).</t>
    1555 <t>Scheme-based normalization may also consider IDN
    1556 components and their conversions to punycode as equivalent. As an
    1557 example, "http://r&amp;#xE9;sum&amp;#xE9;" may be
    1558 considered equivalent to
    1559 "".</t><t>Other scheme-specific
    1560 normalizations are possible.</t>
    1562 </section> <!-- schemenorm -->
    1564 <section title="Protocol-Based Normalization">
    1566 <t>Substantial effort to reduce the incidence of false negatives is
    1567 often cost-effective for web spiders. Consequently, they implement
    1568 even more aggressive techniques in IRI comparison. For example, if
    1569 they observe that an IRI such as</t>
    1571 <figure><artwork>
    1573 <t>redirects to an IRI differing only in the trailing slash</t>
    1574 <figure><artwork>
    1577 <t>they will likely regard the two as equivalent in the future.  This
    1578 kind of technique is only appropriate when equivalence is clearly
    1579 indicated by both the result of accessing the resources and the common
    1580 conventions of their scheme's dereference algorithm (in this case, use
    1581 of redirection by HTTP origin servers to avoid problems with relative
    1582 references).</t>
    1584 </section> <!-- protonorm -->
    1585 </section> <!-- equivalence -->
    1586 </section>
    15881183<section title="Use of IRIs" anchor="IRIuse">
Note: See TracChangeset for help on using the changeset viewer.