Over the past few years, Equal Cost Multi Path (ECMP) has seen a very significant increase in popularity. More importantly, the movement to spine and leaf based designs in the data center relies heavily on ECMP functionality for increased bandwidth. @mbushong earlier discussed the problem of dealing with aggregates, and ECMP is a tool of aggregates.
ECMP as the name suggests, is a mechanism to distribute traffic across multiple paths of the same "network" cost. It's most well known use is at L3, using IP paths of equal cost towards a destination. The concept is simple: a router performs a routing table lookup on the destination IP address of an IP packet. The result of this lookup will be a next hop router or the destination. In the case of ECMP, the result of this lookup may be a set of routers, each one of them a legitimate equal cost next hop towards the destination. The router will now perform a calculation on a part of the IP packet header to determine which of these multiple next hops to send this packet to. The result of the calculation is normalized to the amount of next hops available, and the packet is on its way to the router in the spot of the normalized result. This process is then repeated at every router along the way.
The key to ECMP is the calculation that determines which of the next hops to use. And its achilles heel at the same time. ECMP assumes that the best utilization of multiple links towards a destination is the most even spread of flows across all of them. The most basic calculation takes the source and destination IP address from the packet, performs some hash calculation (very simple because it needs to be performed for every packet at extremely high packet rates), normalizes the result to the amount of next hops, then uses that next hop. This ensures that traffic traveling between routers (yes, old school style I am going to insist on calling them routers) will be balanced for each source and destination IP address pair. A fairly coarse definition of a flow.
It is easy to see what happens if there is a significant discrepancy in the amount of bandwidth used between certain source/destination pairs versus others. And this is where the first issue appears: an assumption of uniformity, or at least a normal distribution of bandwidth utilization across all ECMP flows. To improve the chances of a normal distribution, a more granular definition of flows can be created by adding fields from the packet header to the calculation, most notably the source and destination UDP and TCP port numbers. After this, individual application flows between the same two devices will be treated separately in ECMP path placement.
(In the above I am totally ignoring the possibility of sending traffic across all equal cost links in a packet by packet round robin fashion. It creates a pretty good distribution, but many TCP implementations and applications are extremely sensitive to out of order packets and as a result anything that can result in re-ordering of packets is a big no no. Still. In 2013.)
All of this still assumes that the totality of the flows has a normal bandwidth distribution so that every ECMP path gets about an even share of high-bandwidth and low-bandwidth flows, and everything in between. The increased use of network based storage has dramatically increased the amount of high-bandwidth flows. Mix those with flows that are typically short lived or low bandwidth (keepalives, SIP signaling, trade transactions, you name it) and these large and sometimes persistent flows (sometimes referred to as elephant flows) may become quite the boat anchor stuck to specific links due to the ECMP calculations.
Clever as we engineers are, we have found ways around it. We split big elephant flows up into smaller flows. We run them as parallel TCP connections on iSCSI. We implement a proprietary mechanism to exchange traffic between multiple source and destination addresses. We design extensions to tunnel protocols to create entropy, specifically so that the network can spread the traffic. Yes, we have taught the application and its underlying connection infrastructure that there is such a thing called ECMP. We have brought a specific limitation of the network into the servers and applications.
ECMP is another example of the incrementalism that has held us back for so long. We have increased the amount of ECMP next-hops in our forwarding ASICs. We have created the ability to do weighted ECMP, giving preference to certain of our ECMP next-hops. We are specifically constructing networks that create optimal topologies for ECMP. There is absolutely a place for ECMP in our networks, but it should be a tool in the toolbox, not the key to our network.
We are teaching our applications how to behave uniformly. Or normal. And that's not normal. We should teaching the network how to serve the applications instead. However demanding or quirky they decide to be.