RSS

Equal-Cost Multipath (ECMP)

ECMP is coming to Ouroboros (finally)

Some recent news – Multi-Path TCP (MPTCP) implementation is landing in mainstream Linux kernel 5.6 – finally got me to integrate the equal-cost multipath (ECMP) implementation from Nick Aerts’s master thesis into Ouroboros. And working on the ECMP implementation in gives me an excuse to rant a little bit about MPTCP.

The first question that comes to mind is: Why is it called multi-path TCP? IP is routing packets, not TCP, and there are equal-cost multipath options for IP in both IS-IS and OSPF. Maybe multi-flow TCP would be a better name? This would also be more transparent to the fact that running MPTCP over longer hops will make less sense, since the paths are more likely to converge over the same link.

So why is there a need for multi-path TCP? The answer, of course, is that the Internet Protocol routes packets between IP endpoints, which are interfaces, not hosts. So, if a server is connected over 4 interfaces, ECMP routing will not be of any help if one of them goes down. The TCP connections will time out. Multipath TCP, however, is actually establishing 4 subflows, each over a different interface. If an interface goes down, MPTCP will still have 3 subflows ready. The application is listening the the main TCP connection, and will not notice a TCP-subflow timing out1.

This brings us, of course, to the crux of the problem. IP names the point of attachment; IP addresses are assigned to interfaces. Another commonly used workaround is a virtual IP interface on the loopback, but then you need a lot of additional configuration (and if that were the perfect solution, one wouldn’t need MPTCP!). MPTCP avoids the network configuration mess, but does require direct modification in the application using additions to the sockets API in the form of a bunch of (ugly) setsockopts.

Now this is a far from ideal situation, but given its constraints, MPTCP is a workable engineering solution that will surely see its uses. It’s strange that it took years for MPTCP to get to this stage.

Now, of course, Ouroboros does not assign addresses to points-of-attachments ( flow endpoints). It doesn’t even assign addresses to hosts/nodes! Instead, the address is derived from the forwarding protocol machines inside each node. (For the details, see the article). The net effect is that an ECMP routing algorithm can cleanly handle hosts with multiple interfaces. Details about the routing algorithm are not exposed to application APIs. Instead, Ouroboros applications request an implementation-independent service.

The ECMP patch for Ouroboros is coming soon. Once it’s available I will also add a couple of tutorials on it.

Peace.

Dimitri


  1. Question: Why are the subflows not UDP? That would avoid a lot of duplicated overhead (sequence numbers etc)… Would it be too messy on the socket API side? [return]