Opened 3 years ago

Last modified 3 years ago

#22 new defect

Deployment feasibility

Reported by: wes@… Owned by: draft-ietf-tsvwg-l4s-arch@…
Priority: major Milestone: L4S Suite - WGLC Preparation
Component: l4s-arch Version:
Severity: - Keywords:
Cc:

Description (last modified by wes@…)

There are general concerns with deployment feasibility.

Incremental deployment possibilities are not clear.

This was discussed at IETF 105 (issue G). Jonathan Morton specifically mentioned it as a concern at the open microphone.

This is related to (at least) issues A, B, and H.

It might be helpful to hear more about this from operators, vendors, etc. who have interest and/or plans for L4S.

Change History (5)

comment:1 Changed 3 years ago by wes@…

  • Description modified (diff)

comment:2 Changed 3 years ago by chromatix99@…

Resolution of this issue is dependent on resolving issues #16, #20, #21 at minimum.

comment:3 Changed 3 years ago by ietf@…

Issue #16 is the only identified incremental deployment issue. The phrase "general concerns with deployment feasibility" adds nothing. Unless these other concerns are articulated, this issue needs to be closed.

Similarly, I have just added comments to issues #20 and #21 that they are also redundant given issue #16.

Please, let's focus on the real issue (#16).

comment:4 Changed 3 years ago by jholland@…

Bob has pointed out that the semantic change to CE can cause reordering of CE packets in a dualq, and also that reordering for non-RACK stacks causes problems because of the dupacks. This seems like it could have a bearing on the incremental deployability question until RACK is more ubiquitous. (Consider especially an upload from a dualq-fq_codel-home topology with an asymmetric link.)

I don't recall seeing test results examining the issue or trying to determine the scope of the problem, were there any?

comment:5 Changed 3 years ago by moeller0@…

NOTE: this has been moved to https://trac.ietf.org/trac/tsvwg/ticket/28#ticket as a separate issue. I will leave this here just to maintain history, but I would appreciate if any potential discussion could happen at #28, thanks.

IMHO the fact that the L4S reference scheduler does not manage to guarantee robust equal sharing between L4S' two traffic categories/queues, is a deployment issue.

Both the L4S team and the SCE team have independently confirmed that at low RTTs the dual queue coupled AQM will fail to share equitable between the two queues, but will give a clear latency and bandwidth advantage to L4S traffic at the expense of traffic in the non-L4S queue.* L4S team data: https://l4s.cablelabs.com/l4s-testing/key_plots/batch-l4s-s1-2-cubic-vs-prague-50Mbit-0ms_var.png https://l4s.cablelabs.com/l4s-testing/key_plots/batch-l4s-s1-2-cubic-vs-prague-50Mbit-0ms_var.png ratio L4S/non-L4S ~ 7:1 SCE team data: https://www.heistp.net/downloads/sce-l4s-bakeoff/bakeoff-2019-11-11T090559-r2/l4s-s1-2/batch-l4s-s1-2-cubic-vs-prague-50Mbit-0ms_fixed.png https://www.heistp.net/downloads/sce-l4s-bakeoff/bakeoff-2019-11-11T090559-r2/l4s-s1-2/batch-l4s-s1-2-cubic-vs-prague-50Mbit-0ms_fixed.png ratio L4S/non-L4S ~ 7:1

This issue is especially relevant with the internet wide scope of the desired dual queue AQM roll-out, and can hence not be solved by affected end-users by imply not using L4S/ETC(1) flows unilaterally, and the fact that end-user to CDN RTTs are getting shorter and shorter.

I would like to see this being addressed in the L4S arch draft and the dualq draft and implementation, please. So either this failure is accepted as a bug and hopefully fixed soon, or the drafts need to make explicitly clear that L4S/dualq do not aim for equitable sharing (which IMHO should pretty much rule out to deploy dualq into the wider internet). As it stands the draft text might be technically correct but also misleading as it gives the impression that equitable sharing is one of the goals, while hedging with "under roughly equal conditions" (without mentioning that dualq will make sure that (minimum RTT) L4S and non-L4S traffic will never see roughly equal conditions).

Addendum: this RTT dependent failure to equitably share between the two queues is also documented in http://bobbriscoe.net/projects/latency/dctth_journal_draft20190726.pdf figure 8, in the "Normalized rate per flow" panels, note how DUALPI2 (DCTCP+Cubic) is doing a considerably worse job than FQCODEL (DCTCP+Cubic) across a wide range of RTT differences, but for my point comparing the data points where both traffic types have a 5ms RTT is sufficient. I note that this paper uses DCTCP (although that is not in scope for internet-wide roll-out) the issue really is independent of the precise flows in the two queues, as it is the job of the dual queue system to properly share bandwidth even when adversarial/non-responsive flows enter the mix (and all examples given already show a catastrophic failure with responsive non-adversarial traffic). According to members of the L4S team this failure is long known, but I reject the notion that a long dcumented bug is a "feature" and hence would like to hear plans about how to address that issue inside L4S (that is expecting all TCPs to be exchanged to fix this issue is not an option, especially not for moving this experiment into the wider internet, where basically 100% of TCP endpoints will be not-aware of L4S).

*) This issue is also not helped by the default choice of 1 and 15 ms AQM (acceptable standing queue) target delays, since theory predicts that for the targeted ~100ms internet-scale RTT for 1/sqrt(p) traffic a target of 5ms will be sufficient, and for the considerably shorter RTTs in the situation that highlights dualq's failure the 1/sqrt(p) target should be well below 1ms.

Last edited 3 years ago by moeller0@… (previous) (diff)
Note: See TracTickets for help on using tickets.