The Ouroboros model
Computer science is as much about computers as astronomy is about telescopes. -- Edsger Wybe Dijkstra
The model for computer networks underlying the Ouroboros prototype is the result of a long process of gradual increases in my understanding of the core principles that underly computer networks, starting from my work on traffic engineering packet-over-optical networks using Generalized Multi-Protocol Label Switching (G/MPLS) and Path Computation Element (PCE), then Software Defined Networks (SDN), the work with Sander investigating the Recursive InterNetwork Architecture (RINA) and finally our implementation of what would become the Ouroboros Prototype. The way it is presented here is not a reflection of this long process, but a crystalization of my current understanding of the Ouroboros model.
I’ll start with the very basics, assuming no delay on links and infinite capacity, and then gradually add delay, link capacity, failures, etc to assess their impact and derive what needs to be added where in order to come to the complete Ouroboros model.
The main objective of the definitions – and the Ouroboros model as a whole – is to separate mechanism (the what) from policy (the how) so that we have objective definitions and a consistent framework for reasoning about functions and protocols in computer networks.
The importance of first principles
One word of caution, because this model might read like I’m “reinventing the wheel” and we already know how to do everything that is written here. Of course we do! The point is that the model reduces networking to its essence, to its fundamental parts.
After studying most courses on Computer Networks, I could name the 7 layers of the OSI model, I know how to draw TCP 3-way handshakes, could detail 5 different TCP congestion control mechanisms, calculate optimal IP subnets given a set of underlying Local Area Networks, draw UDP headers, chain firewall rules in iptables, calculate CRC checksums, and derive spanning trees given MAC addresses of Ethernet bridges. But after all that, I still feel such courses teach about as much about computer networks as cookbooks teach about chemistry. I wanted to go beyond technology and the rote knowledge of how things work to establish a thorough understanding of why they work.
During most of my PhD work at the engineering department, I spent my research time on modeling telecommunications networks and computer networks as graphs. The nodes represented some switch or router – either physical or virtual –, the links represented a cable or wire – again either physical or virtual – and then the behaviour of various technologies were simulated on those graphs to develop algorithms that analyze some behaviour or optimize some or other key performance indicator (KPI). This line of reasoning, starting from networked devices is how a lot of research on computer networks is conducted. But what happens if we turn this upside down, and develop a universal model for computer networks starting from first principles?
This sums up my problem with computer networks today: not everything in their workings can be fully derived from first principles. It also sums up why I was attracted to RINA: it was the first time I saw a network architecture as the result of a solid attempt to derive everything from first principles. And it’s also why Ouroboros is not RINA: RINA still contains things that can’t be derived from first principles.
Two types of layers
The Ouroboros model postulates that there are only 2 scalable methods of distributing packets in a network layer: FORWARDING packets based on some label, or FLOODING packets on all links but the incoming link.
We call an element that forwards a forwarding element, implementing a packet forwarding function (PFF). The PFF has as input a destination name for another forwarding element (represented as a vertex), and as output a set of output links (represented as arcs) on which the incoming packet with that label is to be forwarded on. The destination name needs to be in a packet header.
We call an element that floods a flooding element, and it implements a packet flooding function. The flooding element is completely stateless, and has a input the incoming arc, and as output all non-incoming arcs. Packets on a broadcast layer do not need a header at all.
Forwarding elements are equal, and need to be named, flooding elements are identical and do not need to be named1.
Peering relationships are only allowed between forwarding elements, or between flooding elements, but never between a forwarding element and a flooding element. We call a connected graph consisting of nodes that hold forwarding elements a unicast layer, and similary we call a connected tree2 consisting of nodes that house a flooding element a broadcast layer.
The objective for the Ouroboros model is to hold for all packet networks; our conjecture is that all scalable packet-switched network technologies can be decomposed into finite sets of unicast and broadcast layers. Implementations of unicast and broadcast layers can be easily found in TCP/IP, Recursive InterNetworking Architecture (RINA), Delay Tolerant Networks (DTN), Ethernet, VLANs, Loc/Id split (LISP),… 3. The Ouroboros model by itself is not recursive. What is known as recursive networking is a choice to use a single standard API for interacting with all the implementatations of unicast layers and a single standard API for interacting with all implementations of broadcast layers4.
The unicast layer
A unicast layer is a collection of interconnected nodes that implement forwarding elements. A unicast layer provides a best-effort unicast packet service between two endpoints in the layer. We call the abstraction of this point-to-point unicast service a flow. A flow in itself has no guarantees in terms of reliability 5.
A representation of a very simple unicast layer is drawn above, with a flow between the green (bottom left) and red (top right) forwarding elements.
The forwarding function operates in such a way that, given the label of the destination forwarding element (in the case of the figure, a red label), the packet will move to the destination forwarding element (red) in a deliberate manner. The paper has a precise mathematical definition, but qualitatively, our definition of FORWARDING ensures that the trajectory that packets follow through a network layer between source and destination
- doesn’t need to use the ‘shortest’ path
- can use multiple paths
- can use different paths for different packets between the same source-destination pair
- can involve packet duplication
- will not have non-transient loops67
The first question is: what information does that forwarding function need in order to work? Mathematically, the answer is that all forwarding elements needs to know the values of a valid distance function8 between themselves and the destination forwarding element, and between all of their neighbors and the destination forwarding element. The PFF can then select a (set of) link(s) to any of its neighbors that is closer to the destination forwarding element according to the chosen distance function and send the packet on these link(s). Thus, while the forwarding elements need to be _named_, the links between them need to be _measured_. This can be either explicit by assigning a certain weight to a link, or implicit and inferred from the distance function itself.
The second question is: how will that forwarding function know this distance information? There are a couple of different possible answers, which are all well understood. I’ll briefly summarize them here.
A first approach is to use a coordinate space for the names of the forwarding elements. For instance, if we use the GPS coordinates of the machine in which they reside as a name, then we can apply some basic geometry to calculate the distances based on this name oexanly. This simple GPS example has pitfalls, but it has been proven that any connected finite graph has a greedy embedding in the hyperbolic plane. The obvious benefit of such so-called geometric routing approaches is that they don’t require any dissemination of information beyond the mathematical function to calculate distances, the coordinate (name) and the set of neighboring forwarding elements. In such networks, this information is disseminated during initial exchanges when a new forwarding element joins a unicast layer (see below).
A second approach is to disseminate the values of the distance function to all destinations directly, and constantly updating your own (shortest) distances from these values received from other forwarding elements. This is a very well-known mechanism and is implemented by what is known as distance vector protocols. It is also well-known that the naive approach of only disseminating the distances to neighbors can run into a count to infinity issue when links go down. To alleviate this, path vector protocols include a full path to every destination (making them a bit less scaleable), or distance vector protocols are augmented with mechanisms to avoid transient loops and the resulting count-to-infinity (e.g. Babel).
The third approach is to disseminate the link weights of neighboring links. From this information, each forwarding element can build a view of the network graph and again calculate the necessary distances that the forwarding function needs. This mechanism is implemented in so-called link-state protocols.
I will also mention MAC learning here. MAC learning is a bit different, in that it is using piggybacked information from the actual traffic (the source MAC address) and the knowledge that the adjacency graph is a tree as input for the forwarding function.
There is plenty more to say about this, and I will, but first, I will need to introduce some other concepts, most notably the broadcast layer.
The broadcast layer
A broadcast layer is a collection of interconnected nodes that house flooding elements. The node can have either, both or neither of the sender and receiver role. A broadcast layer provides a best-effort broadcast packet service from sender nodes to all (receiver) nodes in the layer.
Our simple definition of FLOODING – given a set of adjacent links, send packets received on a link in the set on all other links in the set – has a huge implication the properties of a fundamental broadcast layer: the graph always is a tree, or packets could travel along infinite trajectories with loops 9.
We now define 2 fundamental operations for constructing packet network layers: enrollment and adjacency management. These operations are very broadly defined, and can be implemented in a myriad of ways. These operations can be implemented through manual configuration or automated protocol interactions. They can be skipped (no-operation, (nop)) or involve complex operations such as authentication. The main objective here is just to establish some common terminology for these operations.
The first mechanism, enrollment, adds a (forwarding or flooding) element to a layer; it prepares a node to act as a functioning element of the layer, establishes its name (in case of a unicast layer). In addition, it may exchange some key parameters (for instance a distance function for a unicast layer) it can involve authentication, and setting roles and permissions. Bootstrapping is a special case of enrollment for the first node in a layer. The inverse operation is called unenrollment.
After enrollment, we may add peering relationships by creating adjacencies between forwarding elements in a unicast layer or between flooding elements in a broadcast layer. This will establish neighbors and in case of a unicast layer, may addinitionally define link weights. The inverse operations is called tearing down adjacencies between elements. Together, these operations will be referred to as adjacency management.
Operations such as merging and splitting layers can be decomposed into these two operations. This doesn’t mean that merge operations shouldn’t be researched. To the contrary, optimizing this will be instrumental for creating networks on a global scale.
For the broadcast layer, we already have most ingredients in place. Now we will focus on the unicast layer.
Scaling the unicast layer
Let’s look at how to scale implementations of the packet forwarding function (PFF). On the one hand, in distance vector, path vector and link state, the PFF is implemented as a table. We call it the packet forwarding table (PFT). On the other hand, geometric routing doesn’t need a table and can implement the PFF as a mathematical equation operating on the forwarding element names. In this respect, geometric routing looks like a magic bullet to routing table scalability – it doesn’t need one – but there are many challenges relating to the complexity of calculating greedy embeddings of graphs that are not static (a changing network where routers and end-hosts enter and leave, and links can fail and return after repair) that currently make these solutions impractical at scale. We will focus on the solutions that use a PFT.
They way the unicast layer is now defined, the PFT scales linearly with the number of forwarding elements (n) in the layer, its space complexity is O(n)10. The obvious solution to any student of computer networks is to use a scheme like IP and Classless InterDomain Routing (CIDR) where the hosts addresses are subnetted, allowing for entries in the PFT to be aggregated, drastically reducing its space complexity, in theory at least, to O(log(n)). So we should not use arbitrary names for the forwarding elements, but give them an address!
Sure, that is the solution, but not so fast! When building a model, each element in the model should be well-defined and named at most once. Synonyms for human use are allowed and useful, but they are conveniences, not part of the functioning of the model. If we subdivide the name of the forwarding element in different subnames, as is done in hierarchical addressing, we have to ask ourselves what element in the model each subname that name is naming! In the geographic routing example above, we dragged the Earth into the model, and used GPS coordinates (latitude and longitude) in the name. But where do subnets come from, and what are addresses? What do we drag into our model, if anything, to create them?
A quick recap
Let’s recap what a simple unicast layer that uses forwarding elements with packet forwarding table looks like in the model. First we have the unicast layer itself, consisting of a set of forwarding elements with defined adjacencies. Recall that the necessary and sufficient condition for the unicast layer to be able to forward packets between any (source, sink)-pair is that all forwarding engines can deduce the values of a distance function between themselves and the sink, and between each of their neighbors and the sink. This means that such a unicast layer requires an additional (optional) element that distributes this routing information. Let’s call it the Routing Element11, and assume that it implements a simple link-state routing protocol. The RE is drawn as a turquoise element accompanying each forwarding element in the figure above. Now, each routing element needs to disseminate information to all other nodes in the layer, in other words, it needs to broadcast link state information. The RE is inside of a unicast layer, and unicast layers don’t do that, so the REs will need the help of a broadcast layer. That is what is drawn in the figure above. Now, at first this may look weird, but an IP network does this too! For instance, the Open Shortest Path First (OSPF) protocol uses IP multicast between OSPF routers. I will refer to my blog post on multicast if you like a bit more elaboration on how this maps to the IP world.
Subdividing the unicast layer
In the paper we call these elements data transfer protocol machines, but I think this terminology is clearer.[return]
- A tree is a connected graph with N vertices and N-1 edges. [return]
I’ve already explored how some technologies map to the Ouroboros model in my blog post on unicast vs multicast.[return]
Of course, once the model is properly understood and a green-field scenario is considered, recursive networking is the obvious choice, and so the Ouroboros prototype is a recursive network.[return]
This is where Ouroboros is similar to IP, and differs from RINA. RINA layers (DIFs) aim to provide reliability as part of the service (flow). We found this approach in RINA to be severely flawed, preventing RINA to be a universal model for all networking and IPC. RINA can be modeled as an Ouroboros network, but Ouroboros cannot be modeled as a RINA network. I’ve written about this in more detail about this in my blog post on Ouroboros vs RINA.[return]
Transient loops are loops that occur due to forwarding functions momentarily having different views of the network graph, for instance due to delays in disseminating information on unavailable links.[return]
Some may think that it’s possible to build a network layer that forwards packets in a way that deliberately takes a couple of loops between a set of nodes and then continues forwarding to the destination, violating the definition of FORWARDING. It’s not possible, because based on the destination address alone, there is no way to know whether that packet came from the loop or not. “But if I add a token/identifier/cookie to the packet header” – yes, that is possible, and it may look like that packet is traversing a loop in the network, but it doesn’t violate the definition. The question is: what is that token/identifier/cookie naming? It can be only one of a couple of things: a forwarding element, a link or the complete layer. Adding a token and the associated logic to process it, will be equivalent to adding nodes to the layer (modifying the node name space to include that token) or adding another layer. In essence, the implementation of the nodes on the loop will be doing something like this:
if logic_based_on_token: # behave like node (token, X) else if logic_based_on_token: # behave like node (token, Y) else # and so on
When taking the transformation into account the resulting layer(s) will follow the fundamental model as it is presented above. Also observe that adding such tokens may drastically increase the address space in the ouroboros representation.[return]
For the mathematically inclined, the exact formulation is in the paper section 2.4[return]
Is it possible to broadcast on a non-tree graph by pruning in some way, shape or form? There are some things to consider. First, if the pruning is done to eliminate links in the graph, let’s say in a way that STP prunes links on an Ethernet or VLAN, then this is operation is equivalent creating a new broadcast layer. We call this enrollment and adjacency management. This will be explained in the next sections. Second is trying to get around loops by adding the name of the (source) node plus a token/identifier/cookie as a packet header in order to detect packets that have traveled in a loop, and dropping them when they do. This kind of network doesn’t fit neither the broadcast layer nor the unicast layer. But the thing is: it also doesn’t scale, as all packets need to be tracked, at least in theory, forever. Assuming packet ordering is preserved inside a layer a big no-no. Another line of thinking may be to add a decreasing counter to avoid loops, but it goes down a similar rabbit hole. How large to set the counter? This also doesn’t scale. Such things may work for some use cases, but they don’t work in general.[return]
In addition to the size of the packet forwarding tables, link state, path vector and distance vector protocols are also limited in size because of time delays in disseminating link state information between the nodes, and the amount to be disseminated. We will address this a bit later in the discourse.[return]
The functionality of the Routing Element is often implemented as an unfortunate human engineer that has to subject himself to one of the most inhuman ordeals imaginable: manually calculating and typing IP destinations and netmasks into the routing tables of a wonky piece of hardware using the most ill-designed command line interface seen this side of 1974.[return]