Opened 9 months ago

Last modified 7 months ago

#634 under_review defect

sub/sup regression in 3.7.0 TXT

Reported by: cabo@tzi.org Owned by:
Priority: medium Milestone:
Component: Version_3_cli_txt Version:
Keywords: Cc: lars@eggert.org, martin.thomson@gmail.com

Description

Foo<sub>bar</sub>baz

renders correctly in HTML

had a recognizable surrogate in TXT in 3.5.0: Foo_(bar)baz

is munched up in TXT in 3.7.0: Foo_barbaz

Change History (14)

comment:1 Changed 9 months ago by cabo@tzi.org

RFC 8949 says:

   superscript notation denotes exponentiation.  For example, 2 to the
   power of 64 is notated: 2^(64).  In the plain-text version of this
   specification, superscript notation is not available and therefore is
   rendered by a surrogate notation.  That notation is not optimized for
   this RFC; it is unfortunately ambiguous with C's exclusive-or (which
   is only used in the appendices, which in turn do not use
   exponentiation) and requires circumspection from the reader of the
   plain-text version.

which now would changed in a re-rendering.

E.g.,

   *  an integer in the range -2^(64)..2^(64)-1 inclusive

would become

   *  an integer in the range -2^64..2^64-1 inclusive

which may or may not mean the same thing to readers.

comment:2 Changed 8 months ago by lars@eggert.org

  • Cc lars@eggert.org added

comment:3 Changed 8 months ago by cabo@tzi.org

  • Component changed from v3 vocabulary to Version_3_cli_txt

(fixed component not to be vocabulary, which doesn't seem to get worked on)

comment:4 Changed 8 months ago by rjsparks@nostrum.com

  • Cc martin.thomson@gmail.com added
  • Status changed from new to under_review

Can you provide a real example where this has been a problem? This is a result of a requested change, driven by Martin and Lars, which was accepted by the CMT to simplify the text rendering. See #590. The reaction to this change has been positive.

This may be an edge case that we need to consider creating different behavior (when text immediately follows the sub).

comment:5 Changed 8 months ago by cabo@tzi.org

The problem is that we have a canonical form that makes the structure perfectly clear, and an HTML/PDF rendering that is not much worse.

So the TXT form is always going to be an afterthought, and each time you tweak it in one direction, it gets worse for something else.

Changing the canonical form/HTML/PDF to fix the TXT is a non-starter.
The only real solution will be adding a capability for the author to tweak the .TXT. That is against current RFCXML ideology.

comment:6 Changed 8 months ago by martin.thomson@gmail.com

That idealogy has not been strictly adhered to. Why bother pretending that it needs to. There are a few other places where XML contains instructions that are only executed for the text rendering. The same can apply here. <sup paren="true">...</sup> or equivalent seem totally inoffensive next to <ul indent="5">.

Last edited 8 months ago by martin.thomson@gmail.com (previous) (diff)

comment:7 Changed 8 months ago by cabo@tzi.org

Nice. Maybe more like <sup txtl="**" txtr=""> or <sup txtl="^(" txtr=")"> -- better give the author full control over the weirdness they need. Default could stay "^" and "" (new) or "^(" and ")" (old).

comment:8 Changed 8 months ago by martin.thomson@gmail.com

I would not include the caret in txtl, or move it to a different attribute (with a default value).

Otherwise this seems like a good direction to me, though I caution that the defaults are not as simple as that. 2<sup>64</sup> renders as 2^64, but 2<sup>n+r</sup> renders as 2^(n+r). That means there can't be a fixed default for any attributes.

comment:9 Changed 8 months ago by cabo@tzi.org

OK, <sup op="^" l="(" r=")">, where the default of op is "^" (for sup, and "_" for sub) and the default of l and r is #implied (i.e., to be computed in the preptool based on the complexity of the element content).

Last edited 8 months ago by cabo@tzi.org (previous) (diff)

comment:10 Changed 8 months ago by cabo@tzi.org

The brokenness of #590 is already being discussed there under "space".
It is not just the element content of the sub/sup that can cause a need for paren packing, it is also the incompatibility between the content and the right context.
That seems extremely simple to fix in the code that guesses the default values for l/r, without the (also desirable) vocabulary fixes that are being discussed here.

comment:11 Changed 8 months ago by rjsparks@nostrum.com

Per CMT discussion today, we'll pursue the simple paren="true" mod on the existing implementation, but not try to go down the path of finer-grained control.

comment:12 Changed 8 months ago by mahoney@nostrum.com

RFC 9043 (currently in AUTH48) would benefit from this paren="true" enhancement.

In RFC 9043, a_b is constructed a<sub>b</sub> and represents the value of a sequence. slice_x is a variable name. It is unclear in the text file which is a subscript and which is not. (I also mentioned this in ticket #574).

https://www.rfc-editor.org/v3test/rfc9043.xml
https://www.rfc-editor.org/v3test/rfc9043.txt

comment:13 Changed 7 months ago by cabo@tzi.org

Re the CMT result (comment 11):

op="**" would have completely resolved the ugliness in RFC 8949.

I cannot agree with the decision not to provide that.

comment:14 Changed 7 months ago by cabo@tzi.org

Re RFC-to-be 9043:
You undo the paren regression by inserting a U+2009 (zero-width space) into the subscript.
No paren= needed.

Note: See TracTickets for help on using tickets.