diff options
Diffstat (limited to 'content/en/blog/news/20201219-congestion-avoidance.md')
-rw-r--r-- | content/en/blog/news/20201219-congestion-avoidance.md | 313 |
1 files changed, 0 insertions, 313 deletions
diff --git a/content/en/blog/news/20201219-congestion-avoidance.md b/content/en/blog/news/20201219-congestion-avoidance.md deleted file mode 100644 index 7391091..0000000 --- a/content/en/blog/news/20201219-congestion-avoidance.md +++ /dev/null @@ -1,313 +0,0 @@ ---- -date: 2020-12-19 -title: "Exploring Ouroboros with wireshark" -linkTitle: "Exploring Ouroboros with wireshark " -description: "" -author: Dimitri Staessens ---- - -I recently did some -[quick tests](/blog/2020/12/12/congestion-avoidance-in-ouroboros/#mb-ecn-in-action) -with the new congestion avoidance implementation, and thought to -myself that it was a shame that Wireshark could not identify the -Ouroboros flows, as that could give me some nicer graphs. - -Just to be clear, I think generic network tools like tcpdump and -wireshark -- however informative and nice-to-use they are -- are a -symptom of a lack of network security. The whole point of Ouroboros is -that it is _intentionally_ designed to make it hard to analyze network -traffic. Ouroboros is not a _network stack_[^1]: one can't simply dump -a packet from the wire and derive the packet contents all the way up -to the application by following identifiers for protocols and -well-known ports. Using encryption to hide the network structure from -the packet is shutting the door after the horse has bolted. - -To write an Ouroboros dissector, one needs to know the layered -structure of the network at the capturing point at that specific point -in time. It requires information from the Ouroboros runtime on the -capturing machine and at the exact time of the capture, to correctly -analyze traffic flows. I just wrote a dissector that works for my -specific setup[^2]. - -## Congestion avoidance test - -First, a quick refresh on the experiment layout, it's the the same -4-node experiment as in the -[previous post](/blog/2020/12/12/congestion-avoidance-in-ouroboros/#mb-ecn-in-action) - -{{<figure width="80%" src="/blog/news/20201219-exp.svg">}} - -I tried to draw the setup as best as I can in the figure above. - -There are 4 rack mounted 1U servers, connected over Gigabit Ethernet -(GbE). Physically there is a big switch connecting all of them, but -each "link" is separated as a port-based VLAN, so there are 3 -independent Ethernet segments. We create 3 ethernet _layers_, drawn -in a lighter gray, with a single unicast layer -- consisting of 4 -unicast IPC processes (IPCPs) -- on top, drawn in a darker shade of -gray. The link between the router and server has been capped to 100 -megabit/s using ```ethtool```[^3], and traffic is captured on the -Ethernet NIC at the "Server" node using ```tcpdump```. All traffic is -generated with our _constant bit rate_ ```ocbr``` tool trying to send -about 80 Mbit/s of application-level throughput over the unicast -layer. - -{{<figure width="80%" src="/blog/news/20201219-congestion.png">}} - -The graph above shows the bandwidth -- as captured on the congested -100Mbit Ethernet link --, separated for each traffic flow, from the -same pcap capture as in my previous post. A flow can be identified by -a (destination address, endpoint ID)-pair, and since the destination -is all the same, I could filter out the flows by simply selecting them -based on the (64-bit) endpoint identifier. - -What you're looking at is that first, a flow (green starts), at around -T=14s, a new flow enters (red) that stops at around T=24s. At around -T=44s, another flow enters (blue) for about 14 seconds, and finally, a -fourth (orange) flow enters at T=63s. The first (green) flow exits at -around T=70s, leaving all the available bandwidth for the orange flow. - -The most important thing that I wanted to check is that when there are -multiple flows, _if_ and _how fast_ they would converge to the same -bandwidth. I'm not dissatisfied with the initial result: the answers -seem to be _yes_ and _pretty fast_, with no observable oscillation to -boot[^4] - -## Protocol overview - -Now, the wireshark dissector can be used to present some more details -about the Ouroboros protocols in a familiar setting -- make it more -accessible to some -- so, let's have a quick look. - -The Ouroboros network protocol has -[5 fields](/docs/concepts/protocols/#network-protocol): - -``` -| DST | TTL | QOS | ECN | EID | -``` - -which we had to map to the Ethernet II protocol for our ipcpd-eth-dix -implementation. The basic Ethernet II MAC (layer-2) header is pretty -simple. It has 2 6-byte addresses (dst, src) and a 2-byte Ethertype. - -Since Ethernet doesn't do QoS or congestion, the main missing field -here is the EID. We could have mapped it to the Ethertype, but we -noticed that a lot of routers and switches drop unknown Ethertypes -(and, for the purposes of this blog post here: it would have all but -prevented to write the dissector). So we made the ethertype -configurable per layer (so it can be set to a value that is not -blocked by the network), and added 2 16-bit fields after the Ethernet -MAC header for an Ouroboros layer: - -* Endpoint ID **eid**, which works just like in the unicast layer, to - identify the N+1 application (in our case: a data transfer flow and - a management flow for a unicast IPC process). - -* A length field **len**, which is needed because Ethernet NICs pad - frames that are smaller than 64 bytes in length with trailing zeros - (and we receive these zeros in our code). A length field is present - in Ethernet type I, but since most "Layer 3" protocols also had a - length field, it was re-purposed as Ethertype in Ethernet II. The - value of the **len** field is the length of the **data** payload. - -The Ethernet layer that spans that 100Mbit link has Ethertype 0xA000 -set (which is the Ouroboros default), the Ouroboros plugin hooks into -that ethertype. - -On top of the Ethernet layer, we have a unicast, layer with the 5 -fields specified above. The dissector also shows the contents of the -flow allocation messages, which are (currently) sent to EID = 0. - -So, the protocol header as analysed in the experiment is, starting -from the "wire": - -``` -+---------+---------+-----------+-----+-----+------ -| dst MAC | src MAC | Ethertype | eid | len | data /* ETH LAYER */ -+---------+---------+-----------+-----+-----+------ - - <IF eid != 0 > /* eid == 0 -> ipcpd-eth flow allocator, */ - /* this is not analysed */ - -+-----+-----+-----+-----+-----+------ -| DST | QOS | TTL | ECN | EID | DATA /* UNICAST LAYER */ -+-----+-----+-----+-----+-----+------ - - <IF EID == 0> /* EID == 0 -> flow allocator */ - -+-----+-------+-------+------+------+-----+-------------+ -| SRC | R_EID | S_EID | CODE | RESP | ECE | ... QOS ....| /* FA */ -+-----+-------+-------+------+------+-----+-------------+ -``` - -## The network protocol - -{{<figure width="80%" src="/blog/news/20201219-ws-0.png">}} - -We will first have a look at packets captured around the point in time -where the second (red) flow enters the network, about 14 seconds into -the capture. The "N+1 Data" packets in the image above all belong to -the green flow. The ```ocbr``` tool that we use sends 1000-byte data -units that are zeroed-out. The packet captured on the wire is 1033 -bytes in length, so we have a protocol overhead of 33 bytes[^5]. We -can break this down to: - -``` - ETHERNET II HEADER / 14 / - 6 bytes Ethernet II dst - 6 bytes Ethernet II src - 2 bytes Ethernet II Ethertype - OUROBOROS ETH-DIX HEADER / 4 / - 2 bytes eid - 2 byte len - OUROBOROS UNICAST NETWORK HEADER / 15 / - 4 bytes DST - 1 byte QOS - 1 byte TTL - 1 byte ECN - 8 bytes EID - --- TOTAL / 33 / - 33 bytes -``` - -The **Data (1019 bytes)** reported by wireshark is what Ethernet II -sees as data, and thus includes the 19 bytes for the two Ouroboros -headers. Note that DST length is configurable, currently up to 64 -bits. - -Now, let's have a brief look at the values for these fields. The -**eid** is 65, this means that the _data-transfer flow_ established -between the unicast IPCPs on the router and the server (_uni-r_ and -_uni-s_ in our experiment figure) is identified by endpoint id 65 in -the eth-dix IPCP on the Server machine. The **len** is 1015. Again, no -surprises, this is the length of the Ouroboros unicast network header -(15 bytes) + the 1000 bytes payload. - -**DST**, the destination address is 4135366193, a 32-bit address -that was randomly assigned to the _uni-s_ IPCP. The QoS cube is 0, -which is the default best-effort QoS class. *TTL* is 59. The starting -TTL is configurable for a layer, the default is 60, and it was -decremented by 1 in the _uni-r_ process on the router node. The packet -experienced no congestion (**ECN** is 0), and the endpoint ID is a -64-bit random number, 475...56. This endpoint ID identifies the flow -endpoint for the ```ocbr``` server. - -## The flow request - -{{<figure width="80%" src="/blog/news/20201219-ws-1.png">}} - -The first "red" packet that was captured is the one for the flow -allocation request, **FLOW REQUEST**[^6]. As mentioned before, the -endpoint ID for the flow allocator is 0. - -A rather important remark is in place here: Ouroboros does not allow a -UDP-like _datagram service_ from a layer. With which I mean: fabricate -a packet with the correct destination address and some known EID and -dump it in the network. All traffic that is offered to an Ouroboros -layer requires a _flow_ to be allocated. This keeps the network layer -in control its resources; the protocol details inside a layer are a -secret to that layer. - -Now, what about that well-known EID=0 for the flow allocator (FA)? And -the directory (Distributed Hash Table, DHT) for that matter, which is -currently on EID=1? Doesn't that contradict the "no datagram service" -statement above? Well, no. These components are part of the layer and -are thus inside the layer. The DHT and FA are internal -components. They are direct clients of the Data Transfer component. -The globally known EID for these components is an absolute necessity -since they need to be able to reach endpoints more than a hop -(i.e. a flow in a lower layer) away. - -Let's now look inside that **FLOW REQUEST** message. We know it is a -request from the **msg code** field[^7]. - -This is the **only** packet that contains the source (and destination) -address for this flow. There is a small twist, this value is decoded -with different _endianness_ than the address in the DT protocol output -(probably a bug in my dissector). The source address 232373199 in the -FA message corresponds to the address 3485194509 in the DT protocol -(and in the experiment image at the top): the source of our red flow -is the "Client 2" node. Since this is a **FLOW REQUEST**, the remote -endpoint id is not yet known, and set to 0[^8. The source endpoint ID --- a 64-bit randomly generated value unique to the source IPC -process[^9] -- is sent to the remote. The other fields are not -relevant for this message. - -## The flow reply - -{{<figure width="80%" src="/blog/news/20201219-ws-2.png">}} - -Now, the **FLOW REPLY** message for our request. It originates our -machine, so you will notice that the TTL is the starting value of 60. -The destination address is what we sent in our original **FLOW -REQUEST** -- add some endianness shenanigans. The **FLOW REPLY** -mesage response sends the newly generated source endpoint[^10] ID, and -this packet is the **only** packet that contains both endpoint IDs -for this flow. - -## Congestion / flow update - -{{<figure width="80%" src="/blog/news/20201219-ws-3.png">}} - -Now a quick look at the congestion avoidance mechanisms. The -information for the Additive Increase / Multiple Decrease algorithm is -gathered from the **ECN** field in the packets. When both flows are -active, they experience congestion since the requested bandwidth from -the two ```ocbr``` clients (180Mbit) exceeds the 100Mbit link, and the -figure above shows a packet marked with an ECN value of 11. - -{{<figure width="80%" src="/blog/news/20201219-ws-4.png">}} - -When the packets on a flow experience congestion, the flow allocator -at the endpoint (the one our _uni-s_ IPCP) will update the sender with -an **ECE** _Explicit Congestion Experienced_ value; in this case, 297. -The higher this value, the quicker the sender will decrease its -sending rate. The algorithm is explained a bit in my previous -post. - -That's it for today's post, I hope it provides some new insights how -Ouroboros works. As always, stay curious. - -Dimitri - -[^1]: Neither is RINA, for that matter. - -[^2]: This quick-and-dirty dissector is available in the - ouroboros-eth-uni branch on my - [github](https://github.com/dstaesse/wireshark/) - -[^3]: The prototype is able to handle Gigabit Ethernet, this is mostly - to make the size of the capture files somewhat manageable. - -[^4]: Of course, this needs more thorough evaluation with more - clients, distributions on the latency, different configurations - for the FRCP protocol in the N+1 and all that jazz. I have, - however, limited amounts of time to spare and am currently - focusing on building and documenting the prototype and tools so - that more thorough evaluations can be done if someone feels like - doing them. - -[^5]: A 4-byte Ethernet Frame Check Sequence (FCS) is not included in - the 'bytes on the wire'. As a reference, the minimum overhead - for this kind of setup using UDP/IPv4 is 14 bytes Ethernet + 20 - bytes IPv4 + 8 bytes UDP = 42 bytes. - -[^6]: Actually, in a larger network there could be some DHT traffic - related to resolving the address, but in such a small network, - the DHT is basically a replicated database between all 4 nodes. - -[^7]: The reason it's not the first field in the protocol has to to - with performance of memory alignment in x86 architectures. - -[^8]: We haven't optimised the FA protocol not to send fields it - doesn't need for that particular message type -- yet. - -[^9]: Not the host machine, but that particular IPCP on the host - machine. You can have multiple IPCPs for the same layer on the - same machine, but in this case, expect correlation between their - addresses. 64-bits / IPCP should provide some security against - remotes trying to hack into another service on the same host by - guessing EIDs. - -[^10]: This marks the point in space-time where I notice the - misspelling in the dissector.
\ No newline at end of file |