aboutsummaryrefslogtreecommitdiff
path: root/content/en/docs/Concepts/ouroboros-model.md
blob: 65f0ea030b4695ed9fc4be278d960ba0a2592a2e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
---
title: "The Ouroboros model"
author: "Dimitri Staessens"

date:  2020-04-07
weight: 2
description: >
   A conceptual approach to packet networking fundamentals
---

```
Computer science is as much about computers as astronomy is
about telescopes.
  -- Edsger Wybe Dijkstra
```

The model for computer networks underlying the Ouroboros prototype is
the result of a long process of gradual increases in my understanding
of the core principles that underly computer networks, starting from
my work on traffic engineering packet-over-optical networks using
Generalized Multi-Protocol Label Switching (G/MPLS) and Path
Computation Element (PCE), then Software Defined Networks (SDN), the
work with Sander investigating the Recursive InterNetwork Architecture
(RINA) and finally our implementation of what would become the
Ouroboros Prototype. The way it is presented here is not a reflection
of this long process, but a crystalization of my current understanding
of the Ouroboros model.

I'll start with the very basics, assuming no delay on links and
infinite capacity, and then gradually add delay, link capacity,
failures, etc to assess their impact and derive _what_ needs to be
added _where_ in order to come to the complete Ouroboros model.

The main objective of the definitions -- and the Ouroboros model as a
whole -- is to __separate mechanism__ (the _what_) __from policy__
(the _how_) so that we have objective definitions and a _consistent_
framework for _reasoning_ about functions and protocols in computer
networks.

### The importance of first principles

One word of caution, because this model might read like I'm
"reinventing the wheel" and we already know how to do everything that
is written here. Of course we do! The point is that the model reduces
networking to its _essence_, to its fundamental parts.

After studying most courses on Computer Networks, I could name the 7
layers of the OSI model, I know how to draw TCP 3-way handshakes,
could detail 5 different TCP congestion control mechanisms, calculate
optimal IP subnets given a set of underlying Local Area Networks, draw
UDP headers, chain firewall rules in iptables, calculate CRC
checksums, and derive spanning trees given MAC addresses of Ethernet
bridges. But after all that, I still feel such courses teach about as
much about computer networks as cookbooks teach about chemistry. I
wanted to go beyond technology and the rote knowledge of _how things
work_ to establish a thorough understanding of _why they work_.

During most of my PhD work at the engineering department, I spent my
research time on modeling telecommunications networks and computer
networks as _graphs_. The nodes represented some switch or router --
either physical or virtual --, the links represented a cable or wire
-- again either physical or virtual -- and then the behaviour of
various technologies were simulated on those graphs to develop
algorithms that analyze some behaviour or optimize some or other _key
performance indicator_ (KPI). This line of reasoning, starting from
_networked devices_ is how a lot of research on computer networks is
conducted. But what happens if we turn this upside down, and develop a
_universal_ model for computer networks starting from _first
principles_?

This sums up my problem with computer networks today: not everything
in their workings can be fully derived from first principles. It also
sums up why I was attracted to RINA: it was the first time I saw a
network architecture as the result of a solid attempt to derive
everything from first principles. And it’s also why Ouroboros is not
RINA: RINA still contains things that can’t be derived from first
principles.

### Two types of layers

The Ouroboros model postulates that there are only 2 scalable methods
of distributing packets in a network layer: _FORWARDING_ packets based
on some label [^1], or _FLOODING_ packets on all links but the
incoming link.

We call an element that forwards a __forwarding element__,
implementing a _packet forwarding function_ (PFF). The PFF has as
input a destination name for another forwarding element (represented
as a _vertex_), and as output a set of output links (represented
as _arcs_) on which the incoming packet with that label is to be
forwarded on. The destination name needs to be in a packet header.

We call an element that floods a __flooding element__, and it
implements a packet flooding function. The flooding element is
completely stateless, and has a input the incoming arc, and as output
all non-incoming arcs. Packets on a broadcast layer do not need a
header at all.

Forwarding elements are _equal_, and need to be named, flooding
elements are _identical_ and do not need to be named.

{{<figure width="40%" src="/docs/concepts/model_elements.png">}}

Peering relationships are only allowed between forwarding elements, or
between flooding elements, but never between a forwarding element and
a flooding element. We call a connected graph consisting of nodes that
hold forwarding elements a __unicast layer__, and similary we call a
connected _tree_[^2] consisting of nodes that house a flooding element
a __broadcast layer__.

The objective for the Ouroboros model is to hold for _all_ packet
networks; our __conjecture__ is that __all scalable packet-switched
network technologies can be decomposed into finite sets of unicast and
broadcast layers__. Implementations of unicast and broadcast layers
can be easily found in TCP/IP, Recursive InterNetworking Architecture
(RINA), Delay Tolerant Networks (DTN), Ethernet, VLANs, Loc/Id split
(LISP),...  [^3]. The Ouroboros _model_ by itself is not
recursive. What is known as _recursive networking_ is a choice to use
a single standard API for interacting with all the implementatations
of unicast layers and a single standard API for interacting with all
implementations of broadcast layers[^4].

### The unicast layer

A unicast layer is a collection of interconnected forwarding
elements. A unicast layer provides a best-effort unicast packet
service between two endpoints in the layer. We call the abstraction of
this point-to-point unicast service a flow. A flow in itself has no
guarantees in terms of reliability [^5].

{{<figure width="70%" src="/docs/concepts/unicast_layer.png">}}

A representation of a unicast layer is drawn above, with a flow
between the _green_ (bottom left) and _red_ (top right) forwarding
elements.

The forwarding function operates in such a way that, given the label
of the destination forwarding element (in the case of the figure, a
_red_ label), the packet will move to the destination forwarding
element (_red_) in a _deliberate_ manner. The paper has a precise
mathematical definition, but qualitatively, our definition of
_FORWARDING_ ensures that the trajectory that packets follow through a
network layer between source and destination

* doesn't need to use the 'shortest' path
* can use multiple paths
* can use different paths for different packets between the same
  source-destination pair
* can involve packet duplication
* will not have non-transient loops[^6] [^7]

The first question is: _what information does that forwarding function
need in order to work?_ Mathematically, the answer is that all
forwarding elements needs to know the values of a valid __distance
function__[^8] between themselves and the destination forwarding
element, and between all of their neighbors and the destination
forwarding element. The PFF can then select a (set of) link(s) to any
of its neighbors that is closer to the destination forwarding element
according to the chosen distance function and send the packet on these
link(s).  Thus, while the __forwarding elements need to be _named___,
the __links between them need to be _measured___. This can be either
explicit by assigning a certain weight to a link, or implicit and
inferred from the distance function itself.

The second question is: _how will that forwarding function know this
distance information_? There are a couple of different possible
answers, which are all well understood. I'll briefly summarize them
here.

A first approach is to use a coordinate space for the names of the
forwarding elements. For instance, if we use the GPS coordinates of
the machine in which they reside as a name, then we can apply some
basic geometry to _calculate_ the distances based on this name
only. This simple GPS example has pitfalls, but it has been proven
that any connected finite graph has a greedy embedding in the
hyperbolic plane. The obvious benefit of such so-called _geometric
routing_ approaches is that they don't require any dissemination of
information beyond the mathematical function to calculate distances,
the coordinate (name) and the set of neighboring forwarding
elements. In such networks, this information is disseminated during
initial exchanges when a new forwarding element joins a unicast layer
(see below).

A second approach is to disseminate the values of the distance
function to all destinations directly, and constantly updating your
own (shortest) distances from these values received from other
forwarding elements. This is a very well-known mechanism and is
implemented by what is known as _distance vector_ protocols. It is
also well-known that the naive approach of only disseminating the
distances to neighbors can run into a _count to infinity_ issue when
links go down. To alleviate this, _path vector_ protocols include a
full path to every destination (making them a bit less scaleable), or
distance vector protocols are augmented with mechanisms to avoid
transient loops and the resulting count-to-infinity (e.g. Babel).

The third approach is to disseminate the link weights of neighboring
links. From this information, each forwarding element can build a view
of the network graph and again calculate the necessary distances that
the forwarding function needs. This mechanism is implemented in
so-called _link-state_ protocols.

I will also mention MAC learning here. MAC learning is a bit
different, in that it is using piggybacked information from the actual
traffic (the source MAC address) and the knowledge that the adjacency
graph is a _tree_ as input for the forwarding function.

There is plenty more to say about this, and I will, but first, I will
need to introduce some other concepts, most notably the broadcast
layer.

### The broadcast layer

A broadcast layer is a collection of interconnected nodes that house
flooding elements. The node can have either, both or neither of the
sender and receiver role. A broadcast layer provides a best-effort
broadcast packet service from sender nodes to all (receiver) nodes in
the layer.

{{<figure width="70%" src="/docs/concepts/broadcast_layer.png">}}

Our simple definition of _FLOODING_ -- given a set of adjacent links,
send packets received on a link in the set on all other links in the
set -- has a huge implication the properties of a fundamental
broadcast layer: the graph always is a _tree_, or packets could travel
along infinite trajectories with loops [^9].

### Building layers

We now define 2 fundamental operations for constructing packet network
layers: __enrollment__ and __adjacency management__. These operations
are very broadly defined, and can be implemented in a myriad of
ways. These operations can be implemented through manual configuration
or automated protocol interactions. They can be skipped (no-operation,
(nop)) or involve complex operations such as authentication. The main
objective here is just to establish some common terminology for these
operations.

The first mechanism, enrollment, adds a (forwarding or flooding)
element to a layer; it prepares a node to act as a functioning element
of the layer, establishes its name (in case of a unicast layer). In
addition, it may exchange some key parameters (for instance a distance
function for a unicast layer) it can involve authentication, and
setting roles and permissions. __Bootstrapping__ is a special case of
enrollment for the _first_ node in a layer. The inverse operation is
called _unenrollment_.

After enrollment, we may add peering relationships by _creating
adjacencies_ between forwarding elements in a unicast layer or between
flooding elements in a broadcast layer. This will establish neighbors
and in case of a unicast layer, may addinitionally define link
weights. The inverse operations is called _tearing down adjacencies_
between elements. Together, these operations will be referred to as
_adjacency management_.

Operations such as merging and splitting layers can be decomposed into
these two operations. This doesn't mean that merge operations
shouldn't be researched. To the contrary, optimizing this will be
instrumental for creating networks on a global scale.

For the broadcast layer, we already have most ingredients in
place. Now we will focus on the unicast layer.

### Scaling the unicast layer

Let's look at how to scale implementations of the packet forwarding
function (PFF). On the one hand, in distance vector, path vector and
link state, the PFF is implemented as a _table_. We call it the packet
forwarding table (PFT). On the other hand, geometric routing doesn't
need a table and can implement the PFF as a mathematical equation
operating on the _forwarding element names_. In this respect,
geometric routing looks like a magic bullet to network scalability --
the space complexity is O(1) -- but there are many challenges relating
to the complexity of calculating greedy embeddings of graphs that are
not static (a changing network where nodes enter and leave and links
can fail) that currently make these solutions impractical at scale. We
will focus on the solutions that use a PFT.

They way the unicast layer is now defined, the PFT scales _linearly_
with the number of forwarding elements (n) in the layer, its space
complexity is O(n)[^10]. The obvious solution to any student of
computer networks is to use a scheme like IP and Classless InterDomain
Routing (CIDR) where the hosts _addresses_ are subnetted, allowing for
entries in the PFT to be aggregated, drastically reducing its space
complexity, in theory at least, to O(log(n)).

Sure, that _is_ the solution, but not so fast! When building a model,
each element in the model should be well-defined and named at most
once. Synonyms for human use are allowed and useful, but they are
conveniences, not part of the functioning of the model. If we
subdivide the name of the forwarding element in different subnames, we
have to ask ourselves what element in the model each subname of that
name is naming! In the geographic routing example above, we dragged
the Earth into the model, and used GPS coordinates (latitude and
longitude) in the name. But where do subnets come from? What do we
drag into our model, if anything, to create them?

{{<figure width="70%" src="/docs/concepts/unicast_layer_bc_pft.png">}}

Let's recap what a simple unicast layer that uses forwarding elements
with packet forwarding table looks like in the model.  First we have
the unicast layer itself, consisting of a set of forwarding elements
with defined adjacencies. Recall that the necessary and sufficient
condition for the unicast layer to be able to forward packets between
any (source, sink)-pair is that all forwarding engines can deduce the
values of a distance function between themselves and the sink, and
between each of their neighbors and the sink. This means that such a
unicast layer requires an additional (optional) element that
distributes this routing information. Let's call it the __Routing
Element__, and assume that it implements a simple link-state routing
protocol. The RE is drawn as a turquoise element accompanying each
forwarding element in the figure above. Now, each routing element
needs to disseminate information to _all_ other nodes in the layer, in
other words, it needs to _broadcast_ link state information. The RE is
inside of a unicast layer, and unicast layers don't do that, so the
REs will need the help of a broadcast layer. That is what is drawn in
the figure above. Now, at first this may look weird, but it is
_exactly_ what an IP network does (within a subnet)! With the Open
Shortest Path First (OSPF) protocol, it uses IP multicast between all
OSPF routers in the domain, and with IS-IS, it leverages the layer 2
network as the broadcast layer. I will refer to my blog post on
[multicast](/blog/2021/04/02/how-does-ouroboros-do-anycast-and-multicast/)
if you need a bit more elaboration on this.


[UNDER CONSTRUCTION...]


[^1]: This identifier can be thought of as an address, the identified
      node is a _forwarding element_.

[^2]: A tree is a connected graph with N vertices and N-1 edges.

[^3]: I've already explored how some technologies map to the Ouroboros
      model in my blog post on
      [unicast vs multicast](/blog/2021/04/02/how-does-ouroboros-do-anycast-and-multicast/).

[^4]: Of course, once the model is properly understood and a
      green-field scenario is considered, recursive networking is the
      obvious choice, and so the Ouroboros prototype _is_ a recursive
      network.

[^5]: This is where Ouroboros is similar to IP, and differs from RINA.
      RINA layers (DIFs) aim to provide reliability as part of the
      service (flow). We found this approach in RINA to be severely
      flawed, preventing RINA to be a _universal_ model for all
      networking and IPC. RINA can be modeled as an Ouroboros network,
      but Ouroboros cannot be modeled as a RINA network. I've written
      about this in more detail about this in my blog post on
      [Ouroboros vs RINA](/blog/2021/03/20/how-does-ouroboros-relate-to-rina-the-recursive-internetwork-architecture/).

[^6]: Transient loops are loops that occur due to forwarding functions
      momentarily having different views of the network graph, for
      instance due to delays in disseminating information on
      unavailable links.

[^7]: Some may think that it's possible to build a network layer that
      forwards packets in a way that _deliberately_ takes a couple of
      loops between a set of nodes and then continues forwarding to
      the destination, violating the definition of _FORWARDING_. It's
      not possible, because based on the destination address alone,
      there is no way to know whether that packet came from the loop
      or not. _"But if I add a token/identifier/cookie to the packet
      header"_ -- yes, that is possible, and it may _look like that
      packet is traversing a loop_ in the network, but it doesn't
      violate the definition. The question is: what is that
      token/identifier/cookie naming? It can be only one of a couple
      of things: a forwarding element, a link or the complete
      layer. Adding a token and the associated logic to process it,
      will be equivalent to adding nodes to the layer (modifying the
      node name space to include that token) or adding another
      layer. In essence, the implementation of the nodes on the loop
      will be doing something like this:

      ```
      if logic_based_on_token:
          # behave like node (token, X)
      else if logic_based_on_token:
          # behave like node (token, Y)
      else  # and so on
      ```

      When taking the transformation into account the resulting
      layer(s) will follow the fundamental model as it is presented
      above. Also observe that adding such tokens may drastically
      increase the address space in the fundemental representation.

[^8]: For the mathematically inclined, the exact formulation is in the
      [paper](https://arxiv.org/pdf/2001.09707.pdf) section 2.4

[^9]: Is it possible to broadcast on a non-tree graph by pruning in
      some way, shape or form? There are some things to
      consider. First, if the pruning is done to eliminate links in
      the graph, let's say in a way that STP prunes links on an
      Ethernet or VLAN, then this is operation is equivalent creating
      a new broadcast layer. We call this enrollment and adjacency
      management. This will be explained in the next sections. Second
      is trying to get around loops by adding the name of the (source)
      node plus a token/identifier/cookie as a packet header in order
      to detect packets that have traveled in a loop, and dropping
      them when they do. This kind of network doesn't fit neither the
      broadcast layer nor the unicast layer. But the thing is: it also
      _doesn't scale_, as all packets need to be tracked, at least in
      theory, forever. Assuming packet ordering is preserved inside a
      layer a big no-no. Another line of thinking may be to add a
      decreasing counter to avoid loops, but it goes down a similar
      rabbit hole. How large to set the counter? This also doesn't
      scale. Such things may work for some use cases, but they
      don't work _in general_.

[^10]:In addition to the size of the packet forwarding tables, link
      state, path vector and distance vector protocols are also
      limited in size because of time delays in disseminating link
      state information between the nodes, and the amount to be
      disseminated. We will address this a bit later in the discourse.