aboutsummaryrefslogtreecommitdiff
path: root/content/en/blog/20200212-ecmp.md
blob: 019b40df759d5ab2ac258eca68fde82eafec9aa5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
date: 2020-02-12
title: "Equal-Cost Multipath (ECMP)"
linkTitle: "Adding Equal-Cost multipath (ECMP)"
description: "ECMP is coming to Ouroboros (finally)"
author: Dimitri Staessens
---

Some recent news -- Multi-Path TCP (MPTCP) implementation is [landing
in mainstream Linux kernel
5.6](https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.6-Starts-Multipath-TCP)
-- finally got me to integrate the equal-cost multipath (ECMP)
implementation from [Nick Aerts's master
thesis](https://lib.ugent.be/nl/catalog/rug01:002494958) into
Ouroboros. And working on the ECMP implementation in gives me an
excuse to rant a little bit about MPTCP.

The first question that comes to mind is: _Why is it called
multi-**path** TCP_? IP is routing packets, not TCP, and there are
equal-cost multipath options for IP in both [IS-IS and
OSPF](https://tools.ietf.org/html/rfc2991). Maybe _multi-flow TCP_
would be a better name? This would also be more transparent to the
fact that running MPTCP over longer hops will make less sense, since
the paths are more likely to converge over the same link.

So _why is there a need for multi-path TCP_? The answer, of course, is
that the Internet Protocol routes packets between IP endpoints, which
are _interfaces_, not _hosts_. So, if a server is connected over 4
interfaces, ECMP routing will not be of any help if one of them goes
down. The TCP connections will time out. Multipath TCP, however, is
actually establishing 4 subflows, each over a different interface. If
an interface goes down, MPTCP will still have 3 subflows ready. The
application is listening the the main TCP connection, and will not
notice a TCP-subflow timing out[^1].

This brings us, of course, to the crux of the problem. IP names the
[point of attachment](https://tools.ietf.org/html/rfc1498); IP
addresses are assigned to interfaces. Another commonly used workaround
is a virtual IP interface on the loopback, but then you need a lot of
additional configuration (and if that were the perfect solution, one
wouldn't need MPTCP!). MPTCP avoids the network configuration mess,
but does require direct modification in the application using
[additions to the sockets
API](https://tools.ietf.org/html/draft-hesmans-mptcp-socket-03) in the
form of a bunch of (ugly) setsockopts.

Now this is a far from ideal situation, but given its constraints,
MPTCP is a workable engineering solution that will surely see its
uses. It's strange that it took years for MPTCP to get to this stage.

Now, of course, Ouroboros does not assign addresses to
points-of-attachments ( _flow endpoints_). It doesn't even assign
addresses to hosts/nodes! Instead, the address is derived from the
forwarding protocol machines inside each node. (For the details, see
the [article](https://arxiv.org/pdf/2001.09707.pdf)). The net effect
is that an ECMP routing algorithm can cleanly handle hosts with
multiple interfaces. Details about the routing algorithm are not
exposed to application APIs. Instead, Ouroboros applications request
an implementation-independent _service_.

The ECMP patch for Ouroboros is coming _soon_. Once it's available I
will also add a couple of tutorials on it.

Peace.

Dimitri

[^1]: Question: Why are the subflows not UDP? That would avoid a lot of duplicated overhead (sequence numbers etc)... Would it be too messy on the socket API side?