Opened 10 years ago

Last modified 9 years ago

#54 reopened technical

ITR failures and implications on connectivity restoration time (review by Y. Rekhter)

Reported by: wmhaddad@… Owned by:
Priority: major Component: draft-ietf-lisp
Severity: - Keywords:
Cc:

Description

A scenario where an ITR fails would result in shifting a large fraction or all the flows that used to go through one ITR to another ITR. Note that in this scenario the arrival rate
of the flows at the new ITR is determined by the total number of
flows being shifted (e.g., in the case of an ITR failure it is the
number of flows that used to go through the old ITR), and not by
the steady state arrival rate of flows. Therefore, in this scenario
the rate with which the new ITR has to resolve EID-RLOC mapping is
determined not by the arrival rate of flows in the steady state,
but by the total number of flows being moved to the new ITR, and
the latter may be significantly higher than the former.

Even if we assume that all flows are TCP-based (no UDP-based flows
at all), then while the arrival rate of the first few packets at
the beginning of such flows is determined by the SYN retransmission
time (and thus could be fairly low), this is not the case for the
arrival rate of the packets in the middle of such flows, which could
be Mb/sec, or even Gb/sec. Thus the amount of dropped packets (if
ITR drops packets while resolving EID-RLOC mapping) or buffered
packets (if ITR buffers packets while resolving EID-RLOC mapping)
in a situation where a failure results in a whole bunch of existing
flows being rerouted to a new ITR is going to me much higher than
in the steady state. Note than none of this occurs with today's
routing. That means that in the presence of ITR failure LISP would
negatively impact connectivity restoration time relative to what
we have today, as well as the service availability.

An approach, where instead of dropping packets, an implementation
would queue packets would require additional memory on the router,
which at the minimum would increase the cost (and therefore the
price) of the router. Moreover, in the scenario where a failure of
one ITR would result in shifting all the flows that used to go
through that ITR to another ITR, if the number of flows is large
and/or flows are of high bandwidth it may be unfeasible to queue
all these packets on the new ITR.

One can not claim that the problem of how to handle data while
resolving EID-RLOC mapping is a local problem for which there are
local incremental fixes, as there are no practical solutions to
this problem on the table at all, and thus one can not claim that
such problem could be solved by local incremental fixes. Since this
problem is likely to manifest itself most visibly at large scale
deployment, it would be inappropriate to wait with solving this
problem until large scale deployment - this problem needs to be
solved before any large scale deployment of LISP, or any large scale
deployment of LISP should be put on hold until a practical solution
to this problem is developed.

Change History (3)

comment:1 Changed 9 years ago by luigi@…

  • Resolution set to fixed
  • Status changed from new to resolved

The issue has been fixed in version -12 of the draft as described in section B.1:

  • Tracker item 54. Added text to the new section titled "Packets Egressing a LISP Site" to describe the implications when two or more ITRs exist at a site where only one ITR is used for egress traffic and when there is a shift of traffic to the others, how the map-cache will need to be populated in those new egress ITRs.

comment:2 Changed 9 years ago by luigi@…

  • Status changed from resolved to closed

comment:3 Changed 9 years ago by yakov@…

  • Resolution fixed deleted
  • Status changed from closed to reopened

I disagree with the claim that the issues raised in in this ticket "is not anticipated ... be a problem". To the contrary, ITR failure could negatively impact connectivity restoration time relative to what we have today, as well as the service availability. Thus I propose to replace

While this is not anticipated this will be a problem, the deployment
and experimentation will determine if there is an issue requiring
more attention.

with the following:

ITR failure could negatively impact connectivity restoration time, as well as the
service availability relative to what we have with the current routing system.
Procedures for addressing is are for further study.

Note: See TracTickets for help on using tickets.