aboutsummaryrefslogtreecommitdiff
path: root/content/en/blog/20210402-multicast.md
blob: cc868a220023c75cab4719ebae70be861a88b575 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
---
date: 2021-04-02
title: "How does Ouroboros do anycast and multicast?"
linkTitle: "Does Ouroboros do (any,multi)-cast?"
description: >
     Searching for the answer to the question: why do packet networks work?
author: Dimitri Staessens
---

```
Nothing is as practical as a good theory
  -- Kurt Lewin
```

How does Ouroboros handle routing and how is it different from the
Internet? How does it do multicast? That's a good subject for a blog
post! I assume the reader to be a bit knowledgeable about the Internet
Protocol (IP) suite. I limit this discussion to IPv4, but generally
speaking it's also applicable to IPv6. Hope you enjoy the read.

Network communications is commonly split up into four classes based on
the delivery model of the message. If it is a single source sending to
a single receiver, it is called _unicast_. This is the way most of the
traffic on the Internet works: a packet is forwarded to a specific
destination IP address. This process is then called _unicast routing_.
If a sender is transmitting a message to _all_ nodes in a network,
it's called _broadcast_. To do this efficiently, the network will run
a bunch of protocols to construct some form of _spanning tree_ between
the nodes in the network a process referred to as _broadcast
routing_. If the destination is a specific set of receivers, it's
called _multicast_. Broadcast routing is augmented with a protocol to
create groups of nodes, the so-called multicast group, to again create
some form of a spanning tree between the group members, called
_multicast routing_. The last class is _anycast_, when the destination
of the communication is _any_ single member of a group, usually the
closest.

Usually these concepts are explained in an Internet/IP setting where
the destinations are (groups of) IP addresses, but the concepts can
also be generalized towards the naming system: resolving a (domain)
name to a set of addresses, for instance, which can then be used in a
multicast implementation called _multidestination
routing_. Multidestination routing (i.e. specifying a bunch of
destination addresses in a packet) doesn't scale well.

Can we contemplate other classes? Randomcast (sending to a random
destination)?  Or stupidcast (sending to all destinations that don't
need to receive the message)? All kidding aside, the 4 classes above
are likely to be all the _sensible_ delivery models.

### Conundrum, part I

During the development of Ouroboros, it became clearer and clearer to
us that the distinction based on the delivery model is not a
_fundamental_ one. If I have to make a -- definitely imperfect --
analogy, it's a bit like classifying animals by the number of eyes
they have. Where two eyes is unicast, more is multicast and composite
eyes broadcast. Now, it will tell you _something useful_ about the
animals if they are in the 2, multi or composite-eye class, but it's
not how biologists classify animals. Some animal orders -- spiders --
have members with 2, 4, 6 and 8 eyes. There are deeper, more
meaningful distinctions that can be made on more fundamental grounds,
such as whether the species has a backbone or not, that traces back
their evolution. What are those fundamental differences for networks?

Take a minute to contemplate the following little conundrum. Take a
network that is linear, e.g.

```
source - node - node - node - node - destination
```

and imagine observing a packet traveling over every _link_ on this
linear network, from source to destination. Was that communication
anycast, unicast, multicast or broadcast? Now this may seem like a
silly question, but it should become clearer why it's relevant, and --
in fact -- fundamental. I will come back to this at the end of this
post.

But first, let's have a look at how it's done, shall we?

### Unicast

This is the basics. IP routers will forward packets based on the
destination IP address (not in any special range) in their header to
the host (in effect: an interface) that has been assigned that IP
address. The forwarding is based on a forwarding table that is
constructed using some routing protocol (OSPF/IS-IS/BGP/...).  I'll
assume you know how this works, and if not, there are plenty of online
resources on these protocols.

On unicast in Ouroboros, I will be pretty brief: it operates in a
similar way as unicast IP: packets are forwarded to a destination
address, and the layer uses some algorithm to build a forwarding table
(or more general, a _forwarding function_). In the current
implementation, unicast is based on IS-IS link-state routing with
support for ECMP (equal-cost multipath). The core difference with IP
is that there are _no_ special case addresses: an address is _always_
uniquely assigned to a single network node. To scale the layer, there
can be different _levels_ of (link-state) routing in a layer. It's
very interesting in its own right, but I'll focus on the _modus
operandi_ in this post, which is: packets get forwarded based on an
address. I'll take a more in-depth look into Ouroboros addressing in
(maybe the next) post (or you can find it in the
[paper](https://arxiv.org/abs/2001.09707).

### Anycast

IP anycast is a funny thing. It's pretty simple: it's just unicast,
but multiple nodes (interfaces) in the network have the same address,
and the shortest path algorithm used in the routing protocol will
forward the packet to the nearest node (interface) with that
address. The network is otherwise completely oblivious; there is no
such thing as an _anycast address_, it's a concept in the mind of
network engineers.

Now, if your reaction is _that can't possibly work_, you're absolutely
right! Equal length paths can lead to _route flapping_, where some
packets would be delivered _over here_ and other packets _over
there_. That's why IP anycast is not something that _anyone_ can do. I
can't run this server somewhere in Germany and a clone of it in
Denver, and yet another clone in Singapore, and give them the same IP
address. IP anycast is therefore highly restricted to some select
cases, most notably DNS, NTP and some big Content Delivery Networks
(CDNs). There is a certain level of trust needed between BGP peers,
and BGP routers are monitored to remove routes that exhibit
flapping. In addition, NTP and DNS use protocols that are UDP-based
with a simple request-response mechanism, so sending subsequent
packets to a different server isn't a big problem. CDN providers go to
great _traffic engineering_ lengths to configure their peering
relations in such a way that the anycast routes are stable. IP anycast
"works" because there are a lot of assumptions and it's engineered --
mostly through human interactions -- into a safe zone of
operations[^1]. In the case of DNS in particular, IP anycast is an
essential part of the Internet. Being close to a root DNS server
impacts response times! The alternative would be to specify a bunch of
alternate servers to try. But it's easier to remember
[9.9.9.9](https://www.quad9.net/) than a whole list of random IP
addresses where you have to figure out where they are! IP anycast also
offers some protection against network failures in case the closest
server becomes unreachable, but this benefit is relatively small as
the convergence times of the routing protocols (OSPF/BGP) are on the
order of minutes (and should be). That's why most hosts usually have 2
DNS servers configured, because relying on anycast could mean a couple
of minutes without DNS.

Now, about anycast in Ouroboros, I can again be brief: I won't allow
multiple nodes with the same address in a layer in the prototype, as
this doesn't _scale_. Anycast is supported by name resolution. A
service can be registered at different locations (addresses) and
resolving such a name will return a (subset of) address(es) from the
locations. If a flow allocation fails to the closest address, it can
be repeated to the next one. Name resolution is an inherent function
of a _unicast layer_, and currently implemented as a Distributed Hash
Table (DHT). When joining a network (we call this operation
_enrolment_, Kademlia calls it _join_), a list of known DHT node
addresses is passed. The DHT stores its <name, address> entries
in multiple locations (in the prototype this number is 8) and the
randomness of the DHT hash assignment in combination with caching
ensures the _proximity_ of the most popular lookups with reasonable
probability.

### Broadcast

IP broadcast is also a funny thing, as it's not really IP that's doing
the broadcasting. It's a coating of varnish on top of _layer 2_
broadcast. So let's first look at Ethernet broadcast.

Ethernet broadcast does what you would expect from a broadcasting
solution. Note that switched Ethernets are confined to a loop-free
topology by grace of the (Rapid) Spanning-Tree Protocol. A message to
the reserved MAC address FF:FF:FF:FF:FF:FF will be _broadcasted_ by
every intermediate Layer 2 node to all nodes (interfaces) that are
connected on that Ethernet segment.  If VLANs are defined, the
broadcast is confined to all nodes (interfaces) on that
VLAN. Quite nice, no objections _your honor_!

The semantics of IP broadcast are related to the scope of the
underlying _layer 2_ network. An IP broadcast address is the last "IP
address" in a _subnet_. So, for instance, in the 192.168.0.0/255
subnet, the IP broadcast address is 192.168.0.255. When sending a
datagram to that IP broadcast destination, the Ethernet layer will be
sending it to FF:FF:FF:FF:FF:FF, and every node _on that Ethernet_
which has an IP address in the 192.168.0.0/24 network will receive it.
You'd be forgiven to think that an IP broadcast to 255.255.255.255
should be spread to every host on the Internet, but for obvious
reasons that's not the case. The semantic of 0.0.0.0/0 is to mean your
own local IP subnet on that specific interface. The DHCP protocol, for
instance, makes use of this. A last thing to mention is that, in
theory, you could send IP broadcast messages to a _different_ subnet,
but few routers allows this, because it invites some very obvious
[smurf attacks](https://en.wikipedia.org/wiki/Smurf_attack).
Excuse me for finding it more than mildly amusing that standards
originally
[_required_](https://tools.ietf.org/html/rfc2644)
routers to forward directed broadcast packets!

So, in practice,IP broadcast is a _passthrough_ interface towards
layer 2 (Ethernet, Wi-Fi, ...) broadcast.

In Ouroboros -- like in Ethernet -- broadcast is a rather simple
affair. It is facilitated by the _broadcast layer_, for which each
node implements a _broadcast function_: what comes in on one flow,
goes out on all others. The implementation is a stateless layer that
-- also like Ethernet -- requires the graph to be a tree. But it has
no addresses -- in fact, it doesn't even have a _header_ at all!
Access control is part of _enrolment_, where participants in the
broadcast layer get read and/or write access to the broadcast layer
based on credentials. Every message on a broadcast layer is actually
broadcast. This is the only way -- at least that I know of -- to make
a broadcast layer _scaleable_ to billions of receivers![^2]

So here is the first clue to the answer to the little conundrum at the
beginning of this post. The Ouroboros model makes a distinction
between _unicast layers_ and _broadcast layers_, based on the _nature
of the algorithm_ applied to the message. If it's based on a
_destination address_ in the message, we call the algorithm
_FORWARDING_, and if it's sending on all interfaces except the
incoming one, we call the algorithm _FLOODING_.

An application like 'ping', where one broadcasts a message to a bunch
of remotes, and each one responds back requires _both_ a broadcast
layer and a unicast layer of (at least) the same size, with the 'ping'
application using both[^3]. Tools like _ping_ and _traceroute_ and
_nmap_ are administrative tools which reveal network information. They
should only be available to _administrators_.

It's not prohibited to implement an IPCP that does both broadcast (to
the complete layer) and unicast. In fact, the unicast IPCP in the
prototype in some sense does it, as we only figured out broadcast
layers _after_ we implemented the link-state protocol, which is
effectively broadcasting link-state messages within the unicast
layer. All it would take is to implement the _flow\_join()_ API in the
unicast IPCP and send those packets like we send Link-State
Messages. But I won't do it, for a number of reasons: the first is
that it is rare to have broadcast layers and unicast layers to be
exactly the same size. Usually broadcast layers will be much
smaller. The second is that, in the current implementation, the
link-state messages are stateful: they have the source address and a
sequence number to be able to deal with loops in the meshed
topology. This doesn't scale to the _full_ unicast layer. To create a
scaleable _do-both-unicast-and-broadcast_ layer, it would require to
create a "virtual tree-topology-network" within the unicast layer,
which is adjacency management. This would require an adjacency
management module (running something like a hypothetical RSTP that is
able to scale to billions of nodes) as part of the unicast
IPCP. Adjacency management is functionality that was removed -- we
called it the _graph adjaceny manager_ and the logic put _outside_ of
the IPCP and replaced with a _connect_ API so it could be scripted as
part of network management.  And the third, and most important, is
that we like the prototype to reflect the _model_, as it is more
educational. Unicast layers and broadcast layers _are_ different
layers. Always have been, and always will be. Combining them in an
implementation only obfuscates this fact. To make a long (and probably
confusing) story short, combining unicast and broadcast in a single
IPCP _can_ be done, but at present I don't see any real benefit in
doing it, and I'm pretty sure it will be hard to avoid
[_making a mess_](https://www.cs.utexas.edu/users/EWD/transcriptions/EWD13xx/EWD1304.html)
out of it.

This transitions us nicely into multicast. Because combining unicast
and multicast in the same layer is exactly what IP tries to do.

### Multicast

Before looking at IP, let's first have a look at how one would do
multicast in Ethernet, because it's simpler.

The simplest way to achieve multicast within Ethernet 802.1Q is using
a VLAN: Configure VLANs on the end hosts and switch, and then just
broadcast packets on that VLAN. The Ethernet II (MAC) frame will look
like this:

```
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
| FF:FF:FF:FF:FF:FF | SRC | 0x8100 | TCI | Ethertype | ..
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
```

The 0x8100 _en lieu_ of the Ethertype specifies that it's a VLAN, the
Tag Control Information (TCI) has 12 bits that specify a VLAN ID, so
one can have 4096 parallel VLANs. There are two fields needed to
achieve the multicast nature of the communications: The destination
set to the broadcast address, and a VLAN ID that will only be assigned
to members of the group.

Now, it won't come as a surprise to you, but IP multicast _is_ a funny
thing. The gist of it is that there are protocols that do group
management and protocols that assist in building a (spanning) tree.
There is a range of IP addresses, 224.0.0.0 -- 239.255.255.255 (or in
binary: starting with the 1110), called _class D_, which are only
allowed as destination addresses. This _class D_ is further subdivided
in different ranges for different functionalities, such as
source-specific multicast. An IPv4 multicast packet can be
distinguished by a single field: a destination address in the _class
D_ range.

If we compare this with Ethernet above, the _class D_ IP address is
behaving more like the VLAN ID than the destination MAC _address_. The
reason IP doesn't need an extra destination address is that the
_broadcast_ functionality is _implied_ by the _class D_ address range,
whereas a VLAN also supports unicast via MAC learning.

Ethernet actually also has a special address range for multicast,
01:00:5E:00:00:00 to 01:00:5E:7F:FF:FF, that copies the last 23 bits
of the IP multicast address when that host joins the multicast
tree. The reasoning behind it is this: if there are multiple endpoints
for an IP multicast tree on the _same_ Ethernet segment, instead of
the IP router sending individual unicast messages to each of them,
that last "hop" can use a single Ethernet broadcast message.

Next Ouroboros. From the discussion of the Ouroboros broadcast layer,
you probably already figured out how Ouroboros does multicast. The
same as broadcast! There is literally _zero_ difference. The only
difference between multicast and broadcast is in the eye of the
beholder when comparing a unicast layer and a broadcast layer.

There is something else to remember about (Ouroboros) broadcast
layers: the broadcast function is _stateless_, and _all_ broadcast
IPCPs are _identical_ in function. The reason I mention this, is in
relation the problem that I just mentioned above. What if I have a
broadcast layer, for which a number of endpoints are also connected
over a _lower_ broadcast layer? Can we, like IP/Ethernet, leverage
this? And the answer is: no, there is no sharing of information
between layers, and broadcast layers have no state. But we don't
really need to!  If there is a device with a broadcast IPCP in a lower
broadcast layer, just add a broadcast IPCP to the higher level
broadcast layer! It's not a matter of functionality, since the
functionality for the higher level broadcast layer is _exactly_ the
same as the lower one.

While I am not eager to mix broadcast and unicast in a single IPCP
program, I have few objections for creating a single program that
behaves like multiple IPCPs of the same type. Especially for the
stateless broadcast IPCP it would be rather trivial to make a single
program that implements parallel broadcast IPCPs. And allowing
something like _sublayers_ (like VLANs, with a single tag) is also
something that can be considered for optimization purposes.

### Conundrum, part II

Now, let's look back at our little riddle, with a packet observed to
move from source to destination over a linear network.

```
source - node - node - node - node - destination
```

Now, if we pedantically apply the definition of one-to-one
communication given in most textbooks, it is unicast, since it has
only a single source and a single destination. But to know what's
going on at the routing level, can not be known. But I hope you gave
it some thought about what information you'd need to be _able to
tell_.

Let's start with Ethernet. The Ethernet standard says that all MAC
addresses are unique, so it's not anycast, and there is no _real_
difference between multicast and broadcast. So, if the address is not
the broadcast address or in some special range, it's _unicast_, else
it's multi/broadcast. But really? What if the nodes were hubs instead
of switches?

What about IP? Bit harder. If it was anycast, it wouldn't have reached
the destination if there was another node with the same address in
this particular setup. But in a general IP network, it's not really
possible to tell the difference between unicast and anycast without
looking at all reachable node addresses. To know if it was broadcast
or multicast, it would suffice to know the destination address in the
packet.

For Ouroboros, all you'd need to know what was going on is the type of
layer. To detect anycast, one would need to query the directory to
check if it returns a single or multiple destination addresses (since
we don't allow _anycast routing_), and, like Ethernet in a way, it
makes the distinction between multicast and broadcast rather moot.

### The Ouroboros model

In a nutshell, what does the Ouroboros model say?

First, all communications is composed of either unicast or broadcast,
and these two types of communications are fundamentally different and
give rise to distinct types of layers. In a _unicast_ layer, nodes
implement _FORWARDING_, which moves packets based on a destination
address. In a _broadcast_ layer, nodes implement _FLOODING_, which
sends incoming packets out on all links except for the incoming link.

If we leave out the physical layer (wiring, spectrum tuning etc),
constructing a layer goes through 2 phases: adding a node to a network
layer (enrolment) and adding links (by which I mean allowed
adjacencies) between that node and other members of the layer
(adjacency management). After this the node becomes active in the
network. During enrolment, nodes can be authenticated and permissions
are acquired such as read/write access. Both types of layers go
through this phase. A unicast layer, may, in addition, periodically
disseminate information that enables _FORWARDING_. We call this
dissemination function _ROUTING_, but if you know a better word that
avoids confusion, we'll take it. _ROUTING_ is distinct from adjacency
management, in the sense that adjacency management is administrative,
and tells the networks which links it is allowed to use, which links
_exist_. _ROUTING- will make use of these links and make decisions
when they are unavailable, for instance due to failures.

Let's apply the Ouroboros model to Ethernet. Ethernet implements both
types of layers. Enrolment and topology management are taken care of
by the _spanning tree protocol_. It might be tempting to think that
STP does _only_ topology management, but that's not really the
case. Just add a new _root bridge_ to an Ethernet network: at some
point, that network will go down completely. The default operation of
Ethernet is as a _broadcast layer_: the default function applied to a
packet is _FLOODING_. To allow unicast, Ethernet implements _ROUTING_
via _MAC learning_. MAC learning is thus a highly specific routing
protocol, that piggybacks its dissemination information on user
frames, and calculates the forwarding tables based on the knowledge
that the underlying topology is a _tree_. This brings a caveat: it
only detects sending interfaces. If there are receivers on an Ethernet
that never send a frame (but for which the senders know the MAC
address), that traffic will always be broadcast. And in any case, the
_first_ packet will always be broadcast.

Next, VLAN. In Ouroboros speak, a VLAN is an implementation detail
(important, of course) to logically combine parallel Ethernets. VLANs
are independent layers, and indeed, must be enrolled (VLAN IDs set on
bridge interfaces) and will keep their own states of (R)STP and MAC
learning. Without VLAN, Ethernet is thus a single broadcast layer and
a single multicast layer. With VLAN, Ethernet is potentially 4096
broadcast layers and 4096 unicast layers.

If we apply the Ouroboros model to IP, we again see that IP tries to
implement both unicast and broadcast. A lot is configured
manually. Enrolment and adjacency management are basically assigning
IP addresses and, in addition, adding BGP routes and rules. IP has two
levels of _ROUTING_, one is inside an autonomous system using
link-state protocols such as OSPF, and on top there is BGP, which is
disseminating routes as path vectors. Multicast in IP is building a
broadcast layer, which is identified using a "multicast address",
which is really the name of that broadcast layer. Enrolment into this
virtual broadcast layer is handled via protocols such as IGMP, with
adjacencies managed in many possible ways that involve calculating
spanning trees based on internal topology information from OSPF or
other means. The tree is then grafted into the routing table by
labeling outgoing interfaces with the name of the broadcast
layer. Yes, _that_ is what adding a multicast destination address to
an IP forwarding table is _really_ doing! It's just hidden in plain
sight!

Now, my claim is that the Ouroboros model can be applied to _any_
packet network technology. To conclude this post, let's take a real
tricky one: Multi-Protocol Label Switching (MPLS).

MPLS looks very different from Ethernet and IP. It doesn't have
addresses at all, but uses _labels_, and it can swap and stack labels.

Now, surely, MPLS doesn't fit the unicast layer, which says that every
node gets an address, and forwards based on the destination address.
Here's the solution to MPLS: it is a set of broadcast layers! The
labels are a distributed way of identifying the layer _by its links_,
instead of a single identifier for the whole layer, like a VLAN or a
multicast IP address. RSVP / LDP (and their traffic engineering -TE
cousins) are protocols that do enrolment and adjacency management.

I hope this gave you a bit of an insight into the Ouroboros view of
the world. Course materials on computer networks consist of a
compendium of technologies and _how_ they work. The Ouroboros model is
an attempt to figure out _why_ they work.

Stay curious.

Dimitri


[^1]: I'm sure someone has or will propose some AI to solve it.

[^2]: Individual links on a broadcast layer can be protected with
      retransmission and using multi-path routing in the underlying
      unicast layer.

[^3]: Now that I'm writing this, I put it on my todo list to
      implement this into the oping application.