Opened 2 years ago

#28 new defect

the dual queue coupled AQM implementation or design is not fit for its intended purpose

Reported by: moeller0@… Owned by: draft-ietf-tsvwg-aqm-dualq-coupled@…
Priority: major Milestone:
Component: aqm-dualq-coupled Version:
Severity: - Keywords: dualq L4S


The current L4S reference scheduler the dual queue coupled AQM as implemented and tested does not seem to guarantee robust equal sharing between L4S' two traffic categories/queues.

Both the L4S team and the SCE team have independently confirmed that at low RTTs the dual queue coupled AQM will fail to share equitable between the two queues, but will give a clear latency and bandwidth advantage to L4S traffic at the expense of traffic in the non-L4S queue.* L4S team data: ​ ratio L4S/non-L4S ~ 7:1 SCE team data: ​ ratio L4S/non-L4S ~ 7:1

This issue is especially relevant with the internet wide scope of the desired dual queue AQM roll-out, and can hence not be solved by affected end-users by imply not using L4S/ETC(1) flows unilaterally, and the fact that end-user to CDN RTTs are getting shorter and shorter.

I would like to see this being addressed in the L4S arch draft and the dualq draft and implementation, please. So either this failure is accepted as a bug and hopefully fixed soon, or the drafts need to make explicitly clear that L4S/dualq do not aim for equitable sharing (which IMHO should pretty much rule out to deploy dualq into the wider internet). As it stands the draft text might be technically correct but also misleading as it gives the impression that equitable sharing is one of the goals, while hedging with "under roughly equal conditions" (without mentioning that dualq will make sure that (minimum RTT) L4S and non-L4S traffic will never see roughly equal conditions).

This RTT dependent failure to equitably share between the two queues is also documented in ​ figure 8, in the "Normalized rate per flow" panels, note how DUALPI2 (DCTCP+Cubic) is doing a considerably worse job than FQCODEL (DCTCP+Cubic) across a wide range of RTT differences, but for my point comparing the data points where both traffic types have a 5ms RTT is sufficient. I note that this paper uses DCTCP (although that is not in scope for internet-wide roll-out) the issue really is independent of the precise flows in the two queues, as it is the job of the dual queue system to properly share bandwidth even when adversarial/non-responsive flows enter the mix (and all examples given already show a catastrophic failure with responsive non-adversarial traffic).

I also question the proposal to describe this AQM failure as a consequence of the lack of RTT-independence of the flows in at least the L4S queue, and trying to work-around the issue by increasing the RTT-independence of traffic qualifying for the L4S queue (especially under the light that the L4S team seems to reject any notion the L4S-queue should be admission controlled/policed based on whether flows in the queue behave according to the L4S traffic requirements). I believe it it to be self-evident and obvious that a traffic-class-isolator that will only isolate well-behaved traffic seems like a waste of time and effort, so I argue for changing the L4S requirements for L4S-compliant AQMs so that such an AQM MUST assure equitable sharing at least between the two traffic classes or if desired between individual flows and I argue for fixing the dual queue coupled AQM to do exactly that.

According to members of the L4S team this failure is long known, but I reject the notion that a long dcumented bug is a "feature" and hence would like to hear plans about how to address that issue inside L4S (that is expecting all TCPs to be exchanged to fix this issue is not an option, especially not for moving this experiment into the wider internet, where basically 100% of TCP endpoints will be not-aware of L4S).

This point has been raised as a comment to issue #22 initially, but it really is a separate issue which IMHO affects the deployment feasibility but is not about deployment feasibility per se, hence a new issue to focus discussion about the issue as well remedies.

*) This issue is also not helped by the default choice of 1 and 15 ms AQM (acceptable standing queue) target delays, since theory predicts that for the targeted ~100ms internet-scale RTT for 1/sqrt(p) traffic a target of 5ms will be sufficient, and for the considerably shorter RTTs in the situation that highlights dualq's failure the 1/sqrt(p) target should be well below 1ms.

Change History (0)

Note: See TracTickets for help on using tickets.