Over the past few years, Equal Cost Multi Path (ECMP) has seen a very significant increase in popularity. More importantly, the movement to spine and leaf based designs in the data center relies heavily on ECMP functionality for increased bandwidth. @mbushong earlier discussed the problem of dealing with aggregates, and ECMP is a tool of aggregates.

ECMP as the name suggests, is a mechanism to distribute traffic across multiple paths of the same "network" cost. It's most well known use is at L3, using IP paths of equal cost towards a destination. The concept is simple: a router performs a routing table lookup on the destination IP address of an IP packet. The result of this lookup will be a next hop router or the destination. In the case of ECMP, the result of this lookup may be a set of routers, each one of them a legitimate equal cost next hop towards the destination. The router will now perform a calculation on a part of the IP packet header to determine which of these multiple next hops to send this packet to. The result of the calculation is normalized to the amount of next hops available, and the packet is on its way to the router in the spot of the normalized result. This process is then repeated at every router along the way.

The key to ECMP is the calculation that determines which of the next hops to use. And its achilles heel at the same time. ECMP assumes that the best utilization of multiple links towards a destination is the most even spread of flows across all of them. The most basic calculation takes the source and destination IP address from the packet, performs some hash calculation (very simple because it needs to be performed for every packet at extremely high packet rates), normalizes the result to the amount of next hops, then uses that next hop. This ensures that traffic traveling between routers (yes, old school style I am going to insist on calling them routers) will be balanced for each source and destination IP address pair. A fairly coarse definition of a flow.

It is easy to see what happens if there is a significant discrepancy in the amount of bandwidth used between certain source/destination pairs versus others. And this is where the first issue appears: an assumption of uniformity, or at least a normal distribution of bandwidth utilization across all ECMP flows. To improve the chances of a normal distribution, a more granular definition of flows can be created by adding fields from the packet header to the calculation, most notably the source and destination UDP and TCP port numbers. After this, individual application flows between the same two devices will be treated separately in ECMP path placement.

(In the above I am totally ignoring the possibility of sending traffic across all equal cost links in a packet by packet round robin fashion. It creates a pretty good distribution, but many TCP implementations and applications are extremely sensitive to out of order packets and as a result anything that can result in re-ordering of packets is a big no no. Still. In 2013.)

All of this still assumes that the totality of the flows has a normal bandwidth distribution so that every ECMP path gets about an even share of high-bandwidth and low-bandwidth flows, and everything in between. The increased use of network based storage has dramatically increased the amount of high-bandwidth flows. Mix those with flows that are typically short lived or low bandwidth (keepalives, SIP signaling, trade transactions, you name it) and these large and sometimes persistent flows (sometimes referred to as elephant flows) may become quite the boat anchor stuck to specific links due to the ECMP calculations.

Clever as we engineers are, we have found ways around it. We split big elephant flows up into smaller flows. We run them as parallel TCP connections on iSCSI. We implement a proprietary mechanism to exchange traffic between multiple source and destination addresses. We design extensions to tunnel protocols to create entropy, specifically so that the network can spread the traffic. Yes, we have taught the application and its underlying connection infrastructure that there is such a thing called ECMP. We have brought a specific limitation of the network into the servers and applications.

ECMP is another example of the incrementalism that has held us back for so long. We have increased the amount of ECMP next-hops in our forwarding ASICs. We have created the ability to do weighted ECMP, giving preference to certain of our ECMP next-hops. We are specifically constructing networks that create optimal topologies for ECMP. There is absolutely a place for ECMP in our networks, but it should be a tool in the toolbox, not the key to our network.

We are teaching our applications how to behave uniformly. Or normal. And that's not normal. We should teaching the network how to serve the applications instead. However demanding or quirky they decide to be.

Showing 4 comments
  • Reply

    ECMP has been a great tool so far, but has you mentioned over the years it has tried to adapt, try to guess. The ASIC must have the capability/functionality but agree he should serve the application and for this application should be able to inform the network on what service he is looking for.

    • Marten Terpstra
      Reply

      Christophe, thank you for your comments. I think we agree that there is a need to evolve from where we are…

  • nic
    Reply

    Great post. Let’s figure out who’s fault this is! As you point out, ECMP has deficiencies, which you also point out are the result of the deficiencies of TCP not being able to accept out-of-order packet delivery. However, a review of TCP will show that its deficiency is the cause of the underlying assumptions about the link-level flow control of the network, naming dropping flow control (“Ah, my buffer is full…. oh well drop the packet”).

    My conjecture is that the ideal mechanism for ECMP is packet-based dynamic load balancing. It is easy to get there if we design the system from the ground up.

    First we start by creating a network with proper link-level flow control (e.g. credits, on/off, etc.). Under this design, packet drops are extremely rare (only under hardware failures). Second, we design our transport mechanism to be tolerant to out-of-order packet delivery. This is fairly easy because we know that all packets will eventually arrive (except under extremely rare hardware failure conditions). Third, we can now implement ECMP on a packet-by-packet basis and we can choose the least congested path on each decision.

    In theory, this would work great. ECMP is used mainly for single administrative domains like data centers. There is no good reason that these systems can’t be built with a different network technology than exists outside of the administrative domain.

    • Marten Terpstra
      Reply

      Nic, thank you for your elaborate response. There is no question that TCP functionality (which in turn is based on the assumption that the underlying layers are unreliable) is part of the problem. Credits based network exist (Fiber Channel, Infiniband and others less popular) but are fading fast in favor of ethernet, mostly due to market dynamics, not superiority of technology. Ethernet of course has added a rather clumsy layer of lossless capabilities to ease some of this, but the network is and will be unreliable for many moons to come.

      One of our other issues with ECMP is the EC part of its name. Why can I only use equal paths? In our solution, we can use unequal paths because the constraints I have expressed may value total bandwidth more than the shortest path. As I concluded in my article, ECMP is of course a useful tool, but we should be careful to use it as the foundational technologies for our network architectures of the future.

      Thanks again for your thoughtful response!

Leave a Comment