RSS

How does Ouroboros do anycast and multicast?

Searching for the answer to the question: why do packet networks work?
Nothing is as practical as a good theory
  -- Kurt Lewin

How does Ouroboros handle routing and how is it different from the Internet? How does it do multicast? That’s a good subject for a blog post! I assume the reader to be a bit knowledgeable about the Internet Protocol (IP) suite. I limit this discussion to IPv4, but generally speaking it’s also applicable to IPv6. Hope you enjoy the read.

Network communications is commonly split up into four classes based on the delivery model of the message. If it is a single source sending to a single receiver, it is called unicast. This is the way most of the traffic on the Internet works: a packet is forwarded to a specific destination IP address. This process is then called unicast routing. If a sender is transmitting a message to all nodes in a network, it’s called broadcast. To do this efficiently, the network will run a bunch of protocols to construct some form of spanning tree between the nodes in the network a process referred to as broadcast routing. If the destination is a specific set of receivers, it’s called multicast. Broadcast routing is augmented with a protocol to create groups of nodes, the so-called multicast group, to again create some form of a spanning tree between the group members, called multicast routing. The last class is anycast, when the destination of the communication is any single member of a group, usually the closest.

Usually these concepts are explained in an Internet/IP setting where the destinations are (groups of) IP addresses, but the concepts can also be generalized towards the naming system: resolving a (domain) name to a set of addresses, for instance, which can then be used in a multicast implementation called multidestination routing. Multidestination routing (i.e. specifying a bunch of destination addresses in a packet) doesn’t scale well.

Can we contemplate other classes? Randomcast (sending to a random destination)? Or stupidcast (sending to all destinations that don’t need to receive the message)? All kidding aside, the 4 classes above are likely to be all the sensible delivery models.

Conundrum, part I

During the development of Ouroboros, it became clearer and clearer to us that the distinction based on the delivery model is not a fundamental one. If I have to make a – definitely imperfect – analogy, it’s a bit like classifying animals by the number of eyes they have. Where two eyes is unicast, more is multicast and composite eyes broadcast. Now, it will tell you something useful about the animals if they are in the 2, multi or composite-eye class, but it’s not how biologists classify animals. Some animal orders – spiders – have members with 2, 4, 6 and 8 eyes. There are deeper, more meaningful distinctions that can be made on more fundamental grounds, such as whether the species has a backbone or not, that traces back their evolution. What are those fundamental differences for networks?

Take a minute to contemplate the following little conundrum. Take a network that is linear, e.g.

source - node - node - node - node - destination

and imagine observing a packet traveling over every link on this linear network, from source to destination. Was that communication anycast, unicast, multicast or broadcast? Now this may seem like a silly question, but it should become clearer why it’s relevant, and – in fact – fundamental. I will come back to this at the end of this post.

But first, let’s have a look at how it’s done, shall we?

Unicast

This is the basics. IP routers will forward packets based on the destination IP address (not in any special range) in their header to the host (in effect: an interface) that has been assigned that IP address. The forwarding is based on a forwarding table that is constructed using some routing protocol (OSPF/IS-IS/BGP/…). I’ll assume you know how this works, and if not, there are plenty of online resources on these protocols.

On unicast in Ouroboros, I will be pretty brief: it operates in a similar way as unicast IP: packets are forwarded to a destination address, and the layer uses some algorithm to build a forwarding table (or more general, a forwarding function). In the current implementation, unicast is based on IS-IS link-state routing with support for ECMP (equal-cost multipath). The core difference with IP is that there are no special case addresses: an address is always uniquely assigned to a single network node. To scale the layer, there can be different levels of (link-state) routing in a layer. It’s very interesting in its own right, but I’ll focus on the modus operandi in this post, which is: packets get forwarded based on an address. I’ll take a more in-depth look into Ouroboros addressing in (maybe the next) post (or you can find it in the paper.

Anycast

IP anycast is a funny thing. It’s pretty simple: it’s just unicast, but multiple nodes (interfaces) in the network have the same address, and the shortest path algorithm used in the routing protocol will forward the packet to the nearest node (interface) with that address. The network is otherwise completely oblivious; there is no such thing as an anycast address, it’s a concept in the mind of network engineers.

Now, if your reaction is that can’t possibly work, you’re absolutely right! Equal length paths can lead to route flapping, where some packets would be delivered over here and other packets over there. That’s why IP anycast is not something that anyone can do. I can’t run this server somewhere in Germany and a clone of it in Denver, and yet another clone in Singapore, and give them the same IP address. IP anycast is therefore highly restricted to some select cases, most notably DNS, NTP and some big Content Delivery Networks (CDNs). There is a certain level of trust needed between BGP peers, and BGP routers are monitored to remove routes that exhibit flapping. In addition, NTP and DNS use protocols that are UDP-based with a simple request-response mechanism, so sending subsequent packets to a different server isn’t a big problem. CDN providers go to great traffic engineering lengths to configure their peering relations in such a way that the anycast routes are stable. IP anycast “works” because there are a lot of assumptions and it’s engineered – mostly through human interactions – into a safe zone of operations1. In the case of DNS in particular, IP anycast is an essential part of the Internet. Being close to a root DNS server impacts response times! The alternative would be to specify a bunch of alternate servers to try. But it’s easier to remember 9.9.9.9 than a whole list of random IP addresses where you have to figure out where they are! IP anycast also offers some protection against network failures in case the closest server becomes unreachable, but this benefit is relatively small as the convergence times of the routing protocols (OSPF/BGP) are on the order of minutes (and should be). That’s why most hosts usually have 2 DNS servers configured, because relying on anycast could mean a couple of minutes without DNS.

Now, about anycast in Ouroboros, I can again be brief: I won’t allow multiple nodes with the same address in a layer in the prototype, as this doesn’t scale. Anycast is supported by name resolution. A service can be registered at different locations (addresses) and resolving such a name will return a (subset of) address(es) from the locations. If a flow allocation fails to the closest address, it can be repeated to the next one. Name resolution is an inherent function of a unicast layer, and currently implemented as a Distributed Hash Table (DHT). When joining a network (we call this operation enrolment, Kademlia calls it join), a list of known DHT node addresses is passed. The DHT stores its <name, address> entries in multiple locations (in the prototype this number is 8) and the randomness of the DHT hash assignment in combination with caching ensures the proximity of the most popular lookups with reasonable probability.

Broadcast

IP broadcast is also a funny thing, as it’s not really IP that’s doing the broadcasting. It’s a coating of varnish on top of layer 2 broadcast. So let’s first look at Ethernet broadcast.

Ethernet broadcast does what you would expect from a broadcasting solution. Note that switched Ethernets are confined to a loop-free topology by grace of the (Rapid) Spanning-Tree Protocol. A message to the reserved MAC address FF:FF:FF:FF:FF:FF will be broadcasted by every intermediate Layer 2 node to all nodes (interfaces) that are connected on that Ethernet segment. If VLANs are defined, the broadcast is confined to all nodes (interfaces) on that VLAN. Quite nice, no objections your honor!

The semantics of IP broadcast are related to the scope of the underlying layer 2 network. An IP broadcast address is the last “IP address” in a subnet. So, for instance, in the 192.168.0.0/24 subnet, the IP broadcast address is 192.168.0.255. When sending a datagram to that IP broadcast destination, the Ethernet layer will be sending it to FF:FF:FF:FF:FF:FF, and every node on that Ethernet which has an IP address in the 192.168.0.0/24 network will receive it. You’d be forgiven to think that an IP broadcast to 255.255.255.255 should be spread to every host on the Internet, but for obvious reasons that’s not the case. The semantic of 0.0.0.0/0 is to mean your own local IP subnet on that specific interface. The DHCP protocol, for instance, makes use of this. A last thing to mention is that, in theory, you could send IP broadcast messages to a different subnet, but few routers allows this, because it invites some very obvious smurf attacks. Excuse me for finding it more than mildly amusing that standards originally required routers to forward directed broadcast packets!

So, in practice,IP broadcast is a passthrough interface towards layer 2 (Ethernet, Wi-Fi, …) broadcast.

In Ouroboros – like in Ethernet – broadcast is a rather simple affair. It is facilitated by the broadcast layer, for which each node implements a broadcast function: what comes in on one flow, goes out on all others. The implementation is a stateless layer that – also like Ethernet – requires the graph to be a tree. But it has no addresses – in fact, it doesn’t even have a header at all! Access control is part of enrolment, where participants in the broadcast layer get read and/or write access to the broadcast layer based on credentials. Every message on a broadcast layer is actually broadcast. This is the only way – at least that I know of – to make a broadcast layer scaleable to billions of receivers!2

So here is the first clue to the answer to the little conundrum at the beginning of this post. The Ouroboros model makes a distinction between unicast layers and broadcast layers, based on the nature of the algorithm applied to the message. If it’s based on a destination address in the message, we call the algorithm FORWARDING, and if it’s sending on all interfaces except the incoming one, we call the algorithm FLOODING.

An application like ‘ping’, where one broadcasts a message to a bunch of remotes, and each one responds back requires both a broadcast layer and a unicast layer of (at least) the same size, with the ‘ping’ application using both3. Tools like ping and traceroute and nmap are administrative tools which reveal network information. They should only be available to administrators.

It’s not prohibited to implement an IPCP that does both broadcast (to the complete layer) and unicast. In fact, the unicast IPCP in the prototype in some sense does it, as we only figured out broadcast layers after we implemented the link-state protocol, which is effectively broadcasting link-state messages within the unicast layer. All it would take is to implement the flow_join() API in the unicast IPCP and send those packets like we send Link-State Messages. But I won’t do it, for a number of reasons: the first is that it is rare to have broadcast layers and unicast layers to be exactly the same size. Usually broadcast layers will be much smaller. The second is that, in the current implementation, the link-state messages are stateful: they have the source address and a sequence number to be able to deal with loops in the meshed topology. This doesn’t scale to the full unicast layer. To create a scaleable do-both-unicast-and-broadcast layer, it would require to create a “virtual tree-topology-network” within the unicast layer, which is adjacency management. This would require an adjacency management module (running something like a hypothetical RSTP that is able to scale to billions of nodes) as part of the unicast IPCP. Adjacency management is functionality that was removed – we called it the graph adjaceny manager and the logic put outside of the IPCP and replaced with a connect API so it could be scripted as part of network management. And the third, and most important, is that we like the prototype to reflect the model, as it is more educational. Unicast layers and broadcast layers are different layers. Always have been, and always will be. Combining them in an implementation only obfuscates this fact. To make a long (and probably confusing) story short, combining unicast and broadcast in a single IPCP can be done, but at present I don’t see any real benefit in doing it, and I’m pretty sure it will be hard to avoid making a mess out of it.

This transitions us nicely into multicast. Because combining unicast and multicast in the same layer is exactly what IP tries to do.

Multicast

Before looking at IP, let’s first have a look at how one would do multicast in Ethernet, because it’s simpler.

The simplest way to achieve multicast within Ethernet 802.1Q is using a VLAN: Configure VLANs on the end hosts and switch, and then just broadcast packets on that VLAN. The Ethernet II (MAC) frame will look like this:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
| FF:FF:FF:FF:FF:FF | SRC | 0x8100 | TCI | Ethertype | ..
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The 0x8100 en lieu of the Ethertype specifies that it’s a VLAN, the Tag Control Information (TCI) has 12 bits that specify a VLAN ID, so one can have 4096 parallel VLANs. There are two fields needed to achieve the multicast nature of the communications: The destination set to the broadcast address, and a VLAN ID that will only be assigned to members of the group.

Now, it won’t come as a surprise to you, but IP multicast is a funny thing. The gist of it is that there are protocols that do group management and protocols that assist in building a (spanning) tree. There is a range of IP addresses, 224.0.0.0 – 239.255.255.255 (or in binary: starting with the 1110), called class D, which are only allowed as destination addresses. This class D is further subdivided in different ranges for different functionalities, such as source-specific multicast. An IPv4 multicast packet can be distinguished by a single field: a destination address in the class D range.

If we compare this with Ethernet above, the class D IP address is behaving more like the VLAN ID than the destination MAC address. The reason IP doesn’t need an extra destination address is that the broadcast functionality is implied by the class D address range, whereas a VLAN also supports unicast via MAC learning.

Ethernet actually also has a special address range for multicast, 01:00:5E:00:00:00 to 01:00:5E:7F:FF:FF, that copies the last 23 bits of the IP multicast address when that host joins the multicast tree. The reasoning behind it is this: if there are multiple endpoints for an IP multicast tree on the same Ethernet segment, instead of the IP router sending individual unicast messages to each of them, that last “hop” can use a single Ethernet broadcast message.

Next Ouroboros. From the discussion of the Ouroboros broadcast layer, you probably already figured out how Ouroboros does multicast. The same as broadcast! There is literally zero difference. The only difference between multicast and broadcast is in the eye of the beholder when comparing a unicast layer and a broadcast layer.

There is something else to remember about (Ouroboros) broadcast layers: the broadcast function is stateless, and all broadcast IPCPs are identical in function. The reason I mention this, is in relation the problem that I just mentioned above. What if I have a broadcast layer, for which a number of endpoints are also connected over a lower broadcast layer? Can we, like IP/Ethernet, leverage this? And the answer is: no, there is no sharing of information between layers, and broadcast layers have no state. But we don’t really need to! If there is a device with a broadcast IPCP in a lower broadcast layer, just add a broadcast IPCP to the higher level broadcast layer! It’s not a matter of functionality, since the functionality for the higher level broadcast layer is exactly the same as the lower one.

While I am not eager to mix broadcast and unicast in a single IPCP program, I have few objections for creating a single program that behaves like multiple IPCPs of the same type. Especially for the stateless broadcast IPCP it would be rather trivial to make a single program that implements parallel broadcast IPCPs. And allowing something like sublayers (like VLANs, with a single tag) is also something that can be considered for optimization purposes.

Conundrum, part II

Now, let’s look back at our little riddle, with a packet observed to move from source to destination over a linear network.

source - node - node - node - node - destination

Now, if we pedantically apply the definition of one-to-one communication given in most textbooks, it is unicast, since it has only a single source and a single destination. But to know what’s going on at the routing level, can not be known. But I hope you gave it some thought about what information you’d need to be able to tell.

Let’s start with Ethernet. The Ethernet standard says that all MAC addresses are unique, so it’s not anycast, and there is no real difference between multicast and broadcast. So, if the address is not the broadcast address or in some special range, it’s unicast, else it’s multi/broadcast. But really? What if the nodes were hubs instead of switches?

What about IP? Bit harder. If it was anycast, it wouldn’t have reached the destination if there was another node with the same address in this particular setup. But in a general IP network, it’s not really possible to tell the difference between unicast and anycast without looking at all reachable node addresses. To know if it was broadcast or multicast, it would suffice to know the destination address in the packet.

For Ouroboros, all you’d need to know what was going on is the type of layer. To detect anycast, one would need to query the directory to check if it returns a single or multiple destination addresses (since we don’t allow anycast routing), and, like Ethernet in a way, it makes the distinction between multicast and broadcast rather moot.

The Ouroboros model

In a nutshell, what does the Ouroboros model say?

First, all communications is composed of either unicast or broadcast, and these two types of communications are fundamentally different and give rise to distinct types of layers. In a unicast layer, nodes implement FORWARDING, which moves packets based on a destination address. In a broadcast layer, nodes implement FLOODING, which sends incoming packets out on all links except for the incoming link.

If we leave out the physical layer (wiring, spectrum tuning etc), constructing a layer goes through 2 phases: adding a node to a network layer (enrolment) and adding links (by which I mean allowed adjacencies) between that node and other members of the layer (adjacency management). After this the node becomes active in the network. During enrolment, nodes can be authenticated and permissions are acquired such as read/write access. Both types of layers go through this phase. A unicast layer, may, in addition, periodically disseminate information that enables FORWARDING. We call this dissemination function ROUTING, but if you know a better word that avoids confusion, we’ll take it. ROUTING is distinct from adjacency management, in the sense that adjacency management is administrative, and tells the networks which links it is allowed to use, which links exist. ROUTING will make use of these links and make decisions when they are unavailable, for instance due to failures.

Let’s apply the Ouroboros model to Ethernet. Ethernet implements both types of layers. Enrolment and topology management are taken care of by the spanning tree protocol. It might be tempting to think that STP does only topology management, but that’s not really the case. Just add a new root bridge to an Ethernet network: at some point, that network will go down completely. The default operation of Ethernet is as a broadcast layer: the default function applied to a packet is FLOODING. To allow unicast, Ethernet implements ROUTING via MAC learning. MAC learning is thus a highly specific routing protocol, that piggybacks its dissemination information on user frames, and calculates the forwarding tables based on the knowledge that the underlying topology is a tree. This brings a caveat: it only detects sending interfaces. If there are receivers on an Ethernet that never send a frame (but for which the senders know the MAC address), that traffic will always be broadcast. And in any case, the first packet will always be broadcast.

Next, VLAN. In Ouroboros speak, a VLAN is an implementation detail (important, of course) to logically combine parallel Ethernets. VLANs are independent layers, and indeed, must be enrolled (VLAN IDs set on bridge interfaces) and will keep their own states of ®STP and MAC learning. Without VLAN, Ethernet is thus a single broadcast layer and a single multicast layer. With VLAN, Ethernet is potentially 4096 broadcast layers and 4096 unicast layers.

If we apply the Ouroboros model to IP, we again see that IP tries to implement both unicast and broadcast. A lot is configured manually. Enrolment and adjacency management are basically assigning IP addresses and, in addition, adding BGP routes and rules. IP has two levels of ROUTING, one is inside an autonomous system using link-state protocols such as OSPF, and on top there is BGP, which is disseminating routes as path vectors. Multicast in IP is building a broadcast layer, which is identified using a “multicast address”, which is really the name of that broadcast layer. Enrolment into this virtual broadcast layer is handled via protocols such as IGMP, with adjacencies managed in many possible ways that involve calculating spanning trees based on internal topology information from OSPF or other means. The tree is then grafted into the routing table by labeling outgoing interfaces with the name of the broadcast layer. Yes, that is what adding a multicast destination address to an IP forwarding table is really doing! It’s just hidden in plain sight!

Now, my claim is that the Ouroboros model can be applied to any packet network technology. To conclude this post, let’s take a real tricky one: Multi-Protocol Label Switching (MPLS).

MPLS looks very different from Ethernet and IP. It doesn’t have addresses at all, but uses labels, and it can swap and stack labels.

Now, surely, MPLS doesn’t fit the unicast layer, which says that every node gets an address, and forwards based on the destination address. Here’s the solution to MPLS: it is a set of broadcast layers! The labels are a distributed way of identifying the layer by its links, instead of a single identifier for the whole layer, like a VLAN or a multicast IP address. RSVP / LDP (and their traffic engineering -TE cousins) are protocols that do enrolment and adjacency management.

I hope this gave you a bit of an insight into the Ouroboros view of the world. Course materials on computer networks consist of a compendium of technologies and how they work. The Ouroboros model is an attempt to figure out why they work.

Stay curious.

Dimitri


  1. I’m sure someone has or will propose some AI to solve it. [return]
  2. Individual links on a broadcast layer can be protected with retransmission and using multi-path routing in the underlying unicast layer.

    [return]
  3. Now that I’m writing this, I put it on my todo list to implement this into the oping application.

    [return]