From 20df52a54fc03ef067cb4bce3e176e19129b4a84 Mon Sep 17 00:00:00 2001 From: Dimitri Staessens Date: Sun, 14 Feb 2021 17:42:22 +0100 Subject: content: Move releases to docs and add 0.18 notes --- content/en/blog/20191006-new-site.md | 7 + content/en/blog/20200116-hn.md | 30 ++ content/en/blog/20200212-ecmp.md | 68 ++++ content/en/blog/20200216-ecmp.md | 118 +++++++ content/en/blog/20200502-frcp.md | 236 ++++++++++++++ content/en/blog/20200507-python-lb.png | Bin 0 -> 218383 bytes content/en/blog/20200507-python.md | 74 +++++ content/en/blog/20201212-congestion-avoidance.md | 358 +++++++++++++++++++++ content/en/blog/20201212-congestion.png | Bin 0 -> 54172 bytes content/en/blog/20201219-congestion-avoidance.md | 313 ++++++++++++++++++ content/en/blog/20201219-congestion.png | Bin 0 -> 189977 bytes content/en/blog/20201219-exp.svg | 1 + content/en/blog/20201219-ws-0.png | Bin 0 -> 419135 bytes content/en/blog/20201219-ws-1.png | Bin 0 -> 432812 bytes content/en/blog/20201219-ws-2.png | Bin 0 -> 428663 bytes content/en/blog/20201219-ws-3.png | Bin 0 -> 417961 bytes content/en/blog/20201219-ws-4.png | Bin 0 -> 423835 bytes content/en/blog/_index.md | 8 +- content/en/blog/news/20191006-new-site.md | 7 - content/en/blog/news/20200116-hn.md | 30 -- content/en/blog/news/20200212-ecmp.md | 68 ---- content/en/blog/news/20200216-ecmp.md | 118 ------- content/en/blog/news/20200502-frcp.md | 236 -------------- content/en/blog/news/20200507-python-lb.png | Bin 218383 -> 0 bytes content/en/blog/news/20200507-python.md | 74 ----- .../en/blog/news/20201212-congestion-avoidance.md | 358 --------------------- content/en/blog/news/20201212-congestion.png | Bin 54172 -> 0 bytes .../en/blog/news/20201219-congestion-avoidance.md | 313 ------------------ content/en/blog/news/20201219-congestion.png | Bin 189977 -> 0 bytes content/en/blog/news/20201219-exp.svg | 1 - content/en/blog/news/20201219-ws-0.png | Bin 419135 -> 0 bytes content/en/blog/news/20201219-ws-1.png | Bin 432812 -> 0 bytes content/en/blog/news/20201219-ws-2.png | Bin 428663 -> 0 bytes content/en/blog/news/20201219-ws-3.png | Bin 417961 -> 0 bytes content/en/blog/news/20201219-ws-4.png | Bin 423835 -> 0 bytes content/en/blog/news/_index.md | 5 - content/en/blog/releases/_index.md | 8 - content/en/blog/releases/upcoming.md | 7 - content/en/docs/Releases/0_18.md | 109 +++++++ content/en/docs/Releases/_index.md | 6 + 40 files changed, 1321 insertions(+), 1232 deletions(-) create mode 100644 content/en/blog/20191006-new-site.md create mode 100644 content/en/blog/20200116-hn.md create mode 100644 content/en/blog/20200212-ecmp.md create mode 100644 content/en/blog/20200216-ecmp.md create mode 100644 content/en/blog/20200502-frcp.md create mode 100644 content/en/blog/20200507-python-lb.png create mode 100644 content/en/blog/20200507-python.md create mode 100644 content/en/blog/20201212-congestion-avoidance.md create mode 100644 content/en/blog/20201212-congestion.png create mode 100644 content/en/blog/20201219-congestion-avoidance.md create mode 100644 content/en/blog/20201219-congestion.png create mode 100644 content/en/blog/20201219-exp.svg create mode 100644 content/en/blog/20201219-ws-0.png create mode 100644 content/en/blog/20201219-ws-1.png create mode 100644 content/en/blog/20201219-ws-2.png create mode 100644 content/en/blog/20201219-ws-3.png create mode 100644 content/en/blog/20201219-ws-4.png delete mode 100644 content/en/blog/news/20191006-new-site.md delete mode 100644 content/en/blog/news/20200116-hn.md delete mode 100644 content/en/blog/news/20200212-ecmp.md delete mode 100644 content/en/blog/news/20200216-ecmp.md delete mode 100644 content/en/blog/news/20200502-frcp.md delete mode 100644 content/en/blog/news/20200507-python-lb.png delete mode 100644 content/en/blog/news/20200507-python.md delete mode 100644 content/en/blog/news/20201212-congestion-avoidance.md delete mode 100644 content/en/blog/news/20201212-congestion.png delete mode 100644 content/en/blog/news/20201219-congestion-avoidance.md delete mode 100644 content/en/blog/news/20201219-congestion.png delete mode 100644 content/en/blog/news/20201219-exp.svg delete mode 100644 content/en/blog/news/20201219-ws-0.png delete mode 100644 content/en/blog/news/20201219-ws-1.png delete mode 100644 content/en/blog/news/20201219-ws-2.png delete mode 100644 content/en/blog/news/20201219-ws-3.png delete mode 100644 content/en/blog/news/20201219-ws-4.png delete mode 100644 content/en/blog/news/_index.md delete mode 100644 content/en/blog/releases/_index.md delete mode 100644 content/en/blog/releases/upcoming.md create mode 100644 content/en/docs/Releases/0_18.md create mode 100644 content/en/docs/Releases/_index.md diff --git a/content/en/blog/20191006-new-site.md b/content/en/blog/20191006-new-site.md new file mode 100644 index 0000000..c04ff2d --- /dev/null +++ b/content/en/blog/20191006-new-site.md @@ -0,0 +1,7 @@ +--- +date: 2019-10-06 +title: "New Website" +linkTitle: "New Ouroboros website" +description: "Announcing the new website" +author: Dimitri Staessens +--- diff --git a/content/en/blog/20200116-hn.md b/content/en/blog/20200116-hn.md new file mode 100644 index 0000000..b80a7bd --- /dev/null +++ b/content/en/blog/20200116-hn.md @@ -0,0 +1,30 @@ +--- +date: 2020-01-16 +title: "Getting back to work" +linkTitle: "Getting back to work" +description: "Show HN - Ouroboros" +author: Dimitri Staessens +--- + +Yesterday there was a bit of an unexpected spike in interest in +Ouroboros following a [post on +HN](https://news.ycombinator.com/item?id=22052416). I'm really +humbled by the response and grateful to all the people that show +genuine interest in this project. + +I fully understand that people would like to know a lot more details +about Ouroboros than the current site provides. It was the top +priority on the todo list, and this new interest gives me some +additional motivation to get to it. There's a lot to Ouroboros that's +not so trivial, which makes writing clear documentation a tricky +thing to do. + +I will also tackle some of the questions from the HN in a series of +blog posts in the next few days, replacing the (very old and outdated) +FAQ section. I hope these will be useful. + +Again thank you for your interest. + +Sincerely, + +Dimitri diff --git a/content/en/blog/20200212-ecmp.md b/content/en/blog/20200212-ecmp.md new file mode 100644 index 0000000..019b40d --- /dev/null +++ b/content/en/blog/20200212-ecmp.md @@ -0,0 +1,68 @@ +--- +date: 2020-02-12 +title: "Equal-Cost Multipath (ECMP)" +linkTitle: "Adding Equal-Cost multipath (ECMP)" +description: "ECMP is coming to Ouroboros (finally)" +author: Dimitri Staessens +--- + +Some recent news -- Multi-Path TCP (MPTCP) implementation is [landing +in mainstream Linux kernel +5.6](https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.6-Starts-Multipath-TCP) +-- finally got me to integrate the equal-cost multipath (ECMP) +implementation from [Nick Aerts's master +thesis](https://lib.ugent.be/nl/catalog/rug01:002494958) into +Ouroboros. And working on the ECMP implementation in gives me an +excuse to rant a little bit about MPTCP. + +The first question that comes to mind is: _Why is it called +multi-**path** TCP_? IP is routing packets, not TCP, and there are +equal-cost multipath options for IP in both [IS-IS and +OSPF](https://tools.ietf.org/html/rfc2991). Maybe _multi-flow TCP_ +would be a better name? This would also be more transparent to the +fact that running MPTCP over longer hops will make less sense, since +the paths are more likely to converge over the same link. + +So _why is there a need for multi-path TCP_? The answer, of course, is +that the Internet Protocol routes packets between IP endpoints, which +are _interfaces_, not _hosts_. So, if a server is connected over 4 +interfaces, ECMP routing will not be of any help if one of them goes +down. The TCP connections will time out. Multipath TCP, however, is +actually establishing 4 subflows, each over a different interface. If +an interface goes down, MPTCP will still have 3 subflows ready. The +application is listening the the main TCP connection, and will not +notice a TCP-subflow timing out[^1]. + +This brings us, of course, to the crux of the problem. IP names the +[point of attachment](https://tools.ietf.org/html/rfc1498); IP +addresses are assigned to interfaces. Another commonly used workaround +is a virtual IP interface on the loopback, but then you need a lot of +additional configuration (and if that were the perfect solution, one +wouldn't need MPTCP!). MPTCP avoids the network configuration mess, +but does require direct modification in the application using +[additions to the sockets +API](https://tools.ietf.org/html/draft-hesmans-mptcp-socket-03) in the +form of a bunch of (ugly) setsockopts. + +Now this is a far from ideal situation, but given its constraints, +MPTCP is a workable engineering solution that will surely see its +uses. It's strange that it took years for MPTCP to get to this stage. + +Now, of course, Ouroboros does not assign addresses to +points-of-attachments ( _flow endpoints_). It doesn't even assign +addresses to hosts/nodes! Instead, the address is derived from the +forwarding protocol machines inside each node. (For the details, see +the [article](https://arxiv.org/pdf/2001.09707.pdf)). The net effect +is that an ECMP routing algorithm can cleanly handle hosts with +multiple interfaces. Details about the routing algorithm are not +exposed to application APIs. Instead, Ouroboros applications request +an implementation-independent _service_. + +The ECMP patch for Ouroboros is coming _soon_. Once it's available I +will also add a couple of tutorials on it. + +Peace. + +Dimitri + +[^1]: Question: Why are the subflows not UDP? That would avoid a lot of duplicated overhead (sequence numbers etc)... Would it be too messy on the socket API side? \ No newline at end of file diff --git a/content/en/blog/20200216-ecmp.md b/content/en/blog/20200216-ecmp.md new file mode 100644 index 0000000..ce632c9 --- /dev/null +++ b/content/en/blog/20200216-ecmp.md @@ -0,0 +1,118 @@ +--- +date: 2020-02-16 +title: "Equal-Cost Multipath (ECMP) routing" +linkTitle: "Equal-Cost multipath (ECMP) example" +description: "A very quick example of ECMP" +author: Dimitri Staessens +--- + +As promised, I added equal cost multipath routing to the Ouroboros +unicast IPCP. I will add some more explanations later when it's fully +tested and merge into the master branch, but you can already try it. +You will need to pull the _be_ branch. You will also need to have +_fuse_ installed to monitor the flows from _/tmp/ouroboros/_. The +following script will bootstrap a 4-node unicast network on your +machine that routes using ECMP: + +```bash +#!/bin/bash + +# create a local IPCP. This emulates the "Internet" +irm i b t local n local l local + +#create the first unicast IPCP with ecmp +irm i b t unicast n uni.a l net routing ecmp + +#bind the unicast IPCP to the names net and uni.a +irm b i uni.a n net +irm b i uni.a n uni.a + +#register these 2 names in the local IPCP +irm n r net l local +irm n r uni.a l local + +#create 3 more unicast IPCPs, and enroll them with the first +irm i e t unicast n uni.b l net +irm b i uni.b n net +irm b i uni.b n uni.b +irm n r uni.b l local + +irm i e t unicast n uni.c l net +irm b i uni.c n net +irm b i uni.c n uni.c +irm n r uni.c l local + +irm i e t unicast n uni.d l net +irm b i uni.d n net +irm b i uni.d n uni.d +irm n r uni.d l local + +#connect uni.b to uni.a this creates a DT flow and a mgmt flow +irm i conn name uni.b dst uni.a + +#now do the same for the others, creating a square +irm i conn name uni.c dst uni.b +irm i conn name uni.d dst uni.c +irm i conn name uni.d dst uni.a + +#register the oping application at 4 different locations +#this allows us to check the multipath implementation +irm n r oping.a i uni.a +irm n r oping.b i uni.b +irm n r oping.c i uni.c +irm n r oping.d i uni.d + +#bind oping program to oping names +irm b prog oping n oping.a +irm b prog oping n oping.b +irm b prog oping n oping.c +irm b prog oping n oping.d + +#good to go! +``` + +In order to test the setup, start an irmd (preferably in a terminal so +you can see what's going on). In another terminal, run the above +script and then start an oping server: + +```bash +$ ./ecmpscript +$ oping -l +Ouroboros ping server started. +``` + +This single server program will accept all flows for oping from any of +the unicast IPCPs. Ouroboros _multi-homing_ in action. + +Open another terminal, and type the following command: + +```bash +$ watch -n 1 'grep "sent (packets)" /tmp/ouroboros/uni.a/dt.*/6* | sed -n -e 1p -e 7p' +``` + +This will show you the packet statistics from the 2 data transfer +flows from the first IPCP (uni.a). + +On my machine it looks like this: + +``` +Every 1,0s: grep "sent (packets)" /tmp/ouroboros/uni.a/dt.*/6* | sed -n -e 1p -e 7p + +/tmp/ouroboros/uni.a/dt.1896199821/65: sent (packets): 10 +/tmp/ouroboros/uni.a/dt.1896199821/67: sent (packets): 6 +``` + +Now, from yet another terminal, run connect an oping client to oping.c +(the client should attach to the first IPCP, so oping.c should be the +one with 2 equal cost paths) and watch both counters increase: + +```bash +oping -n oping.c -i 100ms +``` + +When you do this to the other destinations (oping.b and oping.d) you +should see only one of the flow counters increasing. + +Hope you enjoyed this little demo! + +Dimitri diff --git a/content/en/blog/20200502-frcp.md b/content/en/blog/20200502-frcp.md new file mode 100644 index 0000000..28c5794 --- /dev/null +++ b/content/en/blog/20200502-frcp.md @@ -0,0 +1,236 @@ +--- +date: 2020-05-02 +title: "Flow and Retransmission Control Protocol (FRCP) implementation" +linkTitle: "Flow and Retransmission Control Protocol (FRCP)" +description: "A quick demo of FRCP" +author: Dimitri Staessens +--- + +With the longer weekend I had some fun implementing (parts of) the +[Flow and Retransmission Control Protocol (FRCP)](/docs/concepts/protocols/#flow-and-retransmission-control-protocol-frcp) +to the point that it's stable enough to bring you a very quick demo of it. + +FRCP is the Ouroboros alternative to TCP / QUIC / LLC. It assures +delivery of packets when the network itself isn't very reliable. + +The setup is simple: we run Ouroboros over the Ethernet loopback +adapter _lo_, +``` +systemctl restart ouroboros +irm i b t eth-dix l dix n dix dev lo +``` +to which we add some impairment +[_qdisc_](http://man7.org/linux/man-pages/man8/tc-netem.8.html): + +``` +$ sudo tc qdisc add dev lo root netem loss 8% duplicate 3% reorder 10% delay 1 +``` + +This causes the link to lose, duplicate and reorder packets. + +We can use the oping tool to uses different [QoS +specs](https://ouroboros.rocks/cgit/ouroboros/tree/include/ouroboros/qos.h) +and watch the behaviour. Quality-of-Service (QoS) specs are a +technology-agnostic way to request a network service (current +status - not finalized yet). I'll also capture tcpdump output. + +We start an oping server and tell Ouroboros for it to listen to the _name_ "oping": +``` +#bind the program oping to the name oping +irm b prog oping n oping +#register the name oping in the Ethernet layer that is attached to the loopback +irm n r oping l dix +#run the oping server +oping -l +``` + +We'll now send 20 pings. If you try this, it can be that the flow +allocation fails, due to the loss of a flow allocation packet (a bit +similar to TCP losing the first SYN). The oping client currently +doesn't retry flow allocation. The default payload for oping is 64 +bytes (of zeros); oping waits 2 seconds for all packets it has +sent. It doesn't detect duplicates. + +Let's first look at the _raw_ QoS cube. That's like best-effort +UDP/IP. In Ouroboros, however, it doesn't require a packet header at +all. + +First, the output of the client using a _raw_ QoS cube: +``` +$ oping -n oping -c 20 -i 200ms -q raw +Pinging oping with 64 bytes of data (20 packets): + +64 bytes from oping: seq=0 time=0.880 ms +64 bytes from oping: seq=1 time=0.742 ms +64 bytes from oping: seq=4 time=1.303 ms +64 bytes from oping: seq=6 time=0.739 ms +64 bytes from oping: seq=6 time=0.771 ms [out-of-order] +64 bytes from oping: seq=6 time=0.789 ms [out-of-order] +64 bytes from oping: seq=7 time=0.717 ms +64 bytes from oping: seq=8 time=0.759 ms +64 bytes from oping: seq=9 time=0.716 ms +64 bytes from oping: seq=10 time=0.729 ms +64 bytes from oping: seq=11 time=0.720 ms +64 bytes from oping: seq=12 time=0.718 ms +64 bytes from oping: seq=13 time=0.722 ms +64 bytes from oping: seq=14 time=0.700 ms +64 bytes from oping: seq=16 time=0.670 ms +64 bytes from oping: seq=17 time=0.712 ms +64 bytes from oping: seq=18 time=0.716 ms +64 bytes from oping: seq=19 time=0.674 ms +Server timed out. + +--- oping ping statistics --- +20 packets transmitted, 18 received, 2 out-of-order, 10% packet loss, time: 6004.273 ms +rtt min/avg/max/mdev = 0.670/0.765/1.303/0.142 ms +``` + +The _netem_ did a good job of jumbling up the traffic! There were a +couple out-of-order, duplicates, and quite some packets lost. + +Let's dig into an Ethernet frame captured from the "wire". The most +interesting thing its small total size: 82 bytes. + +``` +13:37:25.875092 00:00:00:00:00:00 (oui Ethernet) > 00:00:00:00:00:00 (oui Ethernet), ethertype Unknown (0xa000), length 82: + 0x0000: 0042 0040 0000 0001 0000 0011 e90c 0000 .B.@............ + 0x0010: 0000 0000 203f 350f 0000 0000 0000 0000 .....?5......... + 0x0020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ + 0x0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ + 0x0040: 0000 0000 +``` + +The first 12 bytes are the two MAC addresses (all zeros), then 2 bytes +for the "Ethertype" (the default for an Ouroboros layer is 0xa000, so +you can create more layers and seperate them by Ethertype[^1]. The +Ethernet Payload is thus 68 bytes. The Ouroboros _ipcpd-eth-dix_ adds +and extra header of 4 bytes with 2 extra "fields". The first field we +needed to take over from our [Data +Transfer](/docs/concepts/protocols/) protocol: the Endpoint Identifier +that identifies the flow. The _ipcpd-eth-dix_ has two endpoints, one +for the client and one for the server. 0x0042 (66) is the destination +EID of the server, 0x0043 (67) is the destination EID of the client. +The second field is the _length_ of the payload in octets, 0x0040 = +64. This is needed because Ethernet II has a minimum frame size of 64 +bytes and pads smaller frames (called _runt frames_)[^2]. The +remaining 64 bytes are the oping payload, giving us an 82 byte packet. + +That's it for the raw QoS. The next one is _voice_. A voice service +usually requires packets to be delivered with little delay and jitter +(i.e. ASAP). Out-of-order packets are rejected since they cause +artifacts in the audio output. The voice QoS will enable FRCP, because +it needs to track sequence numbers. + +``` +$ oping -n oping -c 20 -i 200ms -q voice +Pinging oping with 64 bytes of data (20 packets): + +64 bytes from oping: seq=0 time=0.860 ms +64 bytes from oping: seq=2 time=0.704 ms +64 bytes from oping: seq=3 time=0.721 ms +64 bytes from oping: seq=4 time=0.706 ms +64 bytes from oping: seq=5 time=0.721 ms +64 bytes from oping: seq=6 time=0.710 ms +64 bytes from oping: seq=7 time=0.721 ms +64 bytes from oping: seq=8 time=0.691 ms +64 bytes from oping: seq=10 time=0.691 ms +64 bytes from oping: seq=12 time=0.702 ms +64 bytes from oping: seq=13 time=0.730 ms +64 bytes from oping: seq=14 time=0.716 ms +64 bytes from oping: seq=15 time=0.725 ms +64 bytes from oping: seq=16 time=0.709 ms +64 bytes from oping: seq=17 time=0.703 ms +64 bytes from oping: seq=18 time=0.693 ms +64 bytes from oping: seq=19 time=0.666 ms +Server timed out. + +--- oping ping statistics --- +20 packets transmitted, 17 received, 0 out-of-order, 15% packet loss, time: 6004.243 ms +rtt min/avg/max/mdev = 0.666/0.716/0.860/0.040 ms +``` + +As you can see, packets are delivered in-order, and some packets are +missing. Nothing fancy. Let's look at a data packet: + +``` +14:06:05.607699 00:00:00:00:00:00 (oui Ethernet) > 00:00:00:00:00:00 (oui Ethernet), ethertype Unknown (0xa000), length 94: + 0x0000: 0045 004c 0100 0000 eb1e 73ad 0000 0000 .E.L......s..... + 0x0010: 0000 0000 0000 0012 a013 0000 0000 0000 ................ + 0x0020: 705c e53a 0000 0000 0000 0000 0000 0000 p\.:............ + 0x0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ + 0x0040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ + +``` + +The same 18-byte header is present. The flow endpoint ID is a +different one, and the length is also different. The packet is 94 +bytes, the payload length for the _ipcp-eth_dix_ is 0x004c = 76 +octets. So the FRCP header adds 12 bytes, the total overhead is 30 +bytes. Maybe a bit more detail on the FRCP header contents (more depth +is available the protocol documentation). The first 2 bytes are the +FLAGS (0x0100). There are only 7 flags, it's 16 bits for memory +alignment. This packet only has the DATA bit set. Then follows the +flow control window, which is 0 (not implemented yet). Then we have a +4 byte sequence number (eb1e 73ae = 3944641454)[^3] and a 4 byte ACK +number, which is 0. The remaining 64 bytes are the oping payload. + +Next, the data QoS: + +``` +$ oping -n oping -c 20 -i 200ms -q data +Pinging oping with 64 bytes of data (20 packets): + +64 bytes from oping: seq=0 time=0.932 ms +64 bytes from oping: seq=1 time=0.701 ms +64 bytes from oping: seq=2 time=200.949 ms +64 bytes from oping: seq=3 time=0.817 ms +64 bytes from oping: seq=4 time=0.753 ms +64 bytes from oping: seq=5 time=0.730 ms +64 bytes from oping: seq=6 time=0.726 ms +64 bytes from oping: seq=7 time=0.887 ms +64 bytes from oping: seq=8 time=0.878 ms +64 bytes from oping: seq=9 time=0.883 ms +64 bytes from oping: seq=10 time=0.865 ms +64 bytes from oping: seq=11 time=401.192 ms +64 bytes from oping: seq=12 time=201.047 ms +64 bytes from oping: seq=13 time=0.872 ms +64 bytes from oping: seq=14 time=0.966 ms +64 bytes from oping: seq=15 time=0.856 ms +64 bytes from oping: seq=16 time=0.849 ms +64 bytes from oping: seq=17 time=0.843 ms +64 bytes from oping: seq=18 time=0.797 ms +64 bytes from oping: seq=19 time=0.728 ms + +--- oping ping statistics --- +20 packets transmitted, 20 received, 0 out-of-order, 0% packet loss, time: 4004.491 ms +rtt min/avg/max/mdev = 0.701/40.864/401.192/104.723 ms +``` + +With the data spec, we have no packet loss, but some packets have been +retransmitted (hence the higher latency). The reason for the very high +latency is that the current implementation only ACKs on data packets, +this will be fixed soon. + +Looking at an Ethernet frame, it's again 94 bytes: + +``` +14:35:42.612066 00:00:00:00:00:00 (oui Ethernet) > 00:00:00:00:00:00 (oui Ethernet), ethertype Unknown (0xa000), length 94: + 0x0000: 0044 004c 0700 0000 81b8 0259 e2f3 eb59 .D.L.......Y...Y + 0x0010: 0000 0000 0000 0012 911a 0000 0000 0000 ................ + 0x0020: 86b3 273b 0000 0000 0000 0000 0000 0000 ..';............ + 0x0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ + 0x0040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ + +``` + +The main difference is that it has 2 flags set (DATA + ACK), and it +thus contains both a sequence number (81b8 0259) and an +acknowledgement (e2f3 eb59). + +That's about it for now. More to come soon. + +Dimitri + +[^1]: Don't you love standards? One of the key design objectives for Ouroboros is exactly to avoid such shenanigans. Modify/abuse a header and Ouroboros should reject it because it _cannot work_, not because some standard says one shouldn't do it. +[^2]: Lesser known fact: Gigabit Ethernet has a 512 byte minimum frame size; but _carrier extension_ handles this transparently. +[^3]: In _network byte order_. \ No newline at end of file diff --git a/content/en/blog/20200507-python-lb.png b/content/en/blog/20200507-python-lb.png new file mode 100644 index 0000000..89e710e Binary files /dev/null and b/content/en/blog/20200507-python-lb.png differ diff --git a/content/en/blog/20200507-python.md b/content/en/blog/20200507-python.md new file mode 100644 index 0000000..2b05c91 --- /dev/null +++ b/content/en/blog/20200507-python.md @@ -0,0 +1,74 @@ +--- +date: 2020-05-07 +title: "A Python API for Ouroboros" +linkTitle: "Python" +description: "Python" +author: Dimitri Staessens +--- + +Support for other programming languages than C/C++ has been on my todo +list for quite some time. The initial approach was using +[SWIG](http://www.swig.org), but we found the conversion always +clunky, it didn't completely work as we wanted to, and a while back we +just decided to deprecate it. Apart from C/C++ we only had a [rust +wrapper](https://github.com/chritchens/ouroboros-rs). + +Until now! I finally took the time to sink my teeth into the bindings +for Python. I had some brief looks at the +[ctypes](https://docs.python.org/3/library/ctypes.html) library a +while back, but this time I looked into +[cffi](https://cffi.readthedocs.io/en/latest/) and I was amazed at how +simple it was to wrap the more difficult functions that manipulate +blocks of memory (flow\_read, but definitely the async fevent() call). +And now there is path towards a 'nice' Python API. + +Here is a taste of what the +[oecho](https://ouroboros.rocks/cgit/ouroboros/tree/src/tools/oecho/oecho.c) +tool looks like in Python: + +```Python +from ouroboros import * +import argparse + + +def client(): + f = flow_alloc("oecho") + f.writeline("Hello, PyOuroboros!") + print(f.readline()) + f.dealloc() + + +def server(): + print("Starting the server.") + while True: + f = flow_accept() + print("New flow.") + line = f.readline() + print("Message from client is " + line) + f.writeline(line) + f.dealloc() + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description='A simple echo client/server') + parser.add_argument('-l', '--listen', help='run as a server', action='store_true') + args = parser.parse_args() + if args.listen is True: + server() + else: + client() +``` + +I have more time in the next couple of days, so I expect this to be +released after the weekend. + +Oh, and here is a picture of Ouroboros load-balancing between the C (top right) +and Python (top left) implementations using the C and Python clients: + +{{
}} + +Can't wait to get the full API done! + +Cheers, + +Dimitri diff --git a/content/en/blog/20201212-congestion-avoidance.md b/content/en/blog/20201212-congestion-avoidance.md new file mode 100644 index 0000000..0b010c5 --- /dev/null +++ b/content/en/blog/20201212-congestion-avoidance.md @@ -0,0 +1,358 @@ +--- +date: 2020-12-12 +title: "Congestion avoidance in Ouroboros" +linkTitle: "Congestion avoidance" +description: "API for congestion avoidance and the Ouroboros MB-ECN algorithm" +author: Dimitri Staessens +--- + +The upcoming 0.18 version of the prototype has a bunch of big +additions coming in, but the one that I'm most excited about is the +addition of congestion avoidance. Now that the implementation is +reaching its final shape, I just couldn't wait to share with the world +what it looks like, so here I'll talk a bit about how it works. + +# Congestion avoidance + +Congestion avoidance is a mechanism for a network to avoid situations +where the where the total traffic offered on a network element (link +or node) systemically exceeds its capacity to handle this traffic +(temporary overload due to traffic burstiness is not +congestion). While bursts can be handled with adding buffers to +network elements, the solution to congestion is to reduce the ingest +of traffic at the network endpoints that are sources for the traffic +over the congested element(s). + +I won't be going into too many details here, but there are two classes +of mechanisms to inform traffic sources of congestion. One is Explicit +Congestion Notification (ECN), where information is sent to the sender +that its traffic is traversing a congested element. This is a solution +that is, for instance, used by +[DataCenter TCP (DCTCP)](https://tools.ietf.org/html/rfc8257), +and is also supported by +[QUIC](https://www.ietf.org/archive/id/draft-ietf-quic-recovery-33.txt). +The other mechanism is implicit congestion detection, for instance by +inferring congestion from packet loss (most TCP flavors) or increases +in round-trip-time (TCP vegas). + +Once the sender is aware that its traffic is experiencing congestion, +it has to take action. A simple (proven) way is the AIMD algorithm +(Additive Increase, Multiplicative Decrease). When there is no sign of +congestion, senders will steadily increase the amount of traffic they +are sending (Additive Increase). When congestion is detected, they +will quickly back off (Multiplicative Decrease). Usually this is +augmented with a Slow Start (Multiplicative Increase) phase when the +senders begins to send, to reach the maximum bandwidth more +quickly. AIMD is used by TCP and QUIC (among others), and Ouroboros is +no different. It's been proven to work mathematically. + +Now that the main ingredients are known, we can get to the +preparation of the course. + +# Ouroboros congestion avoidance + +Congestion avoidance is in a very specific location in the Ouroboros +architecture: at the ingest point of the network; it is the +responsibility of the network, not the client application. In +OSI-layer terminology, we could say that in Ouroboros, it's in "Layer +3", not in "Layer 4". + +Congestion has to be dealt with for each individual traffic +source/destination pair. In TCP this is called a connection, in +Ouroboros we call it a _flow_. + +Ouroboros _flows_ are abstractions for the most basic of packet flows. +A flow is defined by two endpoints and all that a flow guarantees is +that there exist strings of bytes (packets) that, when offered at the +ingress endpoint, have a non-zero chance of emerging at the egress +endpoint. I say 'there exist' to allow, for instance, for maximum +packet lengths. If it helps, think of flow endpoints as an IP:UDP +address:port pair (but emphatically _NOT_ an IP:TCP address:port +pair). There is no protocol assumed for the packets that traverse the +flow. To the ingress and egress point, they are just a bunch of bytes. + +Now this has one major implication: We will need to add some +information to these packets to infer congestion indirectly or +explicitly. It should be obvious that explicit congestion notification +is the simplest solution here. The Ouroboros prototype (currently) +allows an octet for ECN. + +# Functional elements of the congestion API + +This section glances over the API in an informal way. A reference +manual for the actual C API will be added after 0.18 is in the master +branch of the prototype. The most important thing to keep in mind is +that the architecture dictates this API, not any particular algorithm +for congestion that we had in mind. In fact, to be perfectly honest, +up front I wasn't 100% sure that congestion avoidance was feasible +without adding additional fields fields to the DT protocol, such as a +packet counter, or sending some feedback for measuring the Round-Trip +Time (RTT). But as the algorithm below will show, it can be done. + +When flows are created, some state can be stored, which we call the +_congestion context_. For now it's not important to know what state is +stored in that context. If you're familiar with the inner workings of +TCP, think of it as a black-box generalization of the _tranmission +control block_. Both endpoints of a flow have such a congestion +context. + +At the sender side, the congestion context is updated for each packet +that is sent on the flow. Now, the only information that is known at +the ingress is 1) that there is a packet to be sent, and 2) the length +of this packet. The call at the ingress is thus: + +``` + update_context_at_sender +``` + +This function has to inform when it is allowed to actually send the +packet, for instance by blocking for a certain period. + +At the receiver flow endpoint, we have a bit more information, 1) that +a packet arrived, 2) the length of this packet, and 3) the value of +the ECN octet associated with this packet. The call at the egress is +thus: + +``` + update_context_at_receiver +``` + +Based on this information, receiver can decide if and when to update +the sender. We are a bit more flexible in what can be sent, at this +point, the prototype allows sending a packet (which we call +FLOW_UPDATE) with a 16-bit Explicit Congestion Experienced (ECE) field. + +This implies that the sender can get this information from the +receiver, so it knows 1) that such a packet arrived, and 2) the value +of the ECE field. + +``` + update_context_at_sender_ece +``` + +That is the API for the endpoints. In each Ouroboros IPCP (think +'router'), the value of the ECN field is updated. + +``` + update_ecn_in_router +``` + +That's about as lean as as it gets. Now let's have a look at the +algorithm that I designed and +[implemented](https://ouroboros.rocks/cgit/ouroboros/tree/src/ipcpd/unicast/pol/ca-mb-ecn.c?h=be) +as part of the prototype. + +# The Ouroboros multi-bit Forward ECN (MB-ECN) algorithm + +The algorithm is based on the workings of DataCenter TCP +(DCTCP). Before I dig into the details, I will list the main +differences, without any judgement. + +* The rate for additive increase is the same _constant_ for all flows + (but could be made configurable for each network layer if + needed). This is achieved by having a window that is independent of + the Round-Trip Time (RTT). This may make it more fair, as congestion + avoidance in DCTCP (and in most -- if not all -- TCP variants), is + biased in favor of flows with smaller RTT[^1]. + +* Because it is operating at the _flow_ level, it estimates the + _actual_ bandwidth sent, including retransmissions, ACKs and what + not from protocols operating on the flow. DCTCP estimates bandwidth + based on which data offsets are acknowledged. + +* The algorithm uses 8 bits to indicate the queue depth in each + router, instead of a single bit (due to IP header restrictions) for + DCTCP. + +* MB-ECN sends a (small) out-of-band FLOW_UPDATE packet, DCTCP updates + in-band TCP ECN/ECE bits in acknowledgment (ACK) packets. Note that + DCTCP sends an immediate ACK with ECE set at the start of + congestion, and sends an immediate ACK with ECE not set at the end + of congestion. Otherwise, the ECE is set accordingly for any + "regular" ACKs. + +* The MB-ECN algorithm can be implemented without the need for + dividing numbers (apart from bit shifts). At least in the linux + kernel implementation, DCTCP has a division for estimating the + number of bytes that experienced congestion from the received acks + with ECE bits set. I'm not sure this can be avoided[^2]. + +Now, on to the MB-ECN algorithm. The values for some constants +presented here have only been quickly tested; a _lot_ more scientific +scrutiny is definitely needed here to make any statements about the +performance of this algorithm. I will just explain the operation, and +provide some very preliminary measurement results. + +First, like DCTCP, the routers mark the ECN field based on the +outgoing queue depth. The current minimum queue depth to trigger and +ECN is 16 packets (implemented as a bit shift of the queue size when +writing a packet). We perform a logical OR with the previous value of +the packet. If the width of the ECN field would be a single bit, this +operation would be identical to DCTCP. + +At the _receiver_ side, the context maintains two state variables. + +* The floating sum (ECE) of the value of the (8-bit) ECN field over the +last 2N packets is maintained (currently N=5, so 32 +packets). This is a value between 0 and 28 + 5 - 1. + +* The number of packets received during a period of congestion. This + is just for internal use. + +If th ECE value is 0, no actions are performed at the receiver. + +If this ECE value becomes higher than 0 (there is some indication of +start of congestion), an immediate FLOW_UPDATE is sent with this +value. If a packet arrives with ECN = 0, the ECE value is _halved_. + +For every _increase_ in the ECE value, an immediate update is sent. + +If the ECE value remains stable or decreases, an update is sent only +every M packets (currently, M = 8). This is what the counter is for. + +If the ECE value returns to 0 after a period of congestion, an +immediate FLOW_UPDATE with the value 0 is sent. + +At the _sender_ side, the context keeps track of the actual congestion +window. The sender keeps track of: + +* The current sender ECE value, which is updated when receiving a + FLOW_UPDATE. + +* A bool indicating Slow Start, which is set to false when a + FLOW_UPDATE arrives. + +* A sender_side packet counter. If this exceeds the value of N, the + ECE is reset to 0. This protects the sender from lost FLOW_UPDATES + that signal the end of congestion. + +* The window size multiplier W. For all flows, the window starts at a + predetermined size, 2W ns. Currently W = 24, starting at + about 16.8ms. The power of 2 allows us to perform operations on the + window boundaries using bit shift arithmetic. + +* The current window start time (a single integer), based on the + multiplier. + +* The number of packets sent in the current window. If this is below a + PKT_MIN threshold before the start of a window period, the new + window size is doubled. If this is above a PKT_MAX threshold before + the start of a new window period, the new window size is halved. The + thresholds are currently set to 8 and 64, scaling the window width + to average sending ~36 packets in a window. When the window scales, + the value for the allowed bytes to send in this window (see below) + scales accordingly to keep the sender bandwidth at the same + level. These values should be set with the value of N at the + receiver side in mind. + +* The number bytes sent in this window. This is updated when sending + each packet. + +* The number of allowed bytes in this window. This is calculated at + the start of a new window: doubled at Slow Start, multiplied by a + factor based on sender ECE when there is congestion, and increased + by a fixed (scaled) value when there is no congestion outside of + Slow Start. Currently, the scaled value is 64KiB per 16.8ms. + +There is one caveat: what if no FLOW_UPDATE packets arrive at all? +DCTCP (being TCP) will timeout at the Retransmission TimeOut (RTO) +value (since its ECE information comes from ACK packets), but this +algorithm has no such mechanism at this point. The answer is that we +currently do not monitor flow liveness from the flow allocator, but a +Keepalive or Bidirectional Forwarding Detection (BFD)-like mechanism +for flows should be added for QoS maintenance, and can serve to +timeout the flow and reset it (meaning a full reset of the +context). + +# MB-ECN in action + +From version 0.18 onwards[^3], the state of the flow -- including its +congestion context -- can be monitored from the flow allocator +statics: + +```bash +$ cat /tmp/ouroboros/unicast.1/flow-allocator/66 +Flow established at: 2020-12-12 09:54:27 +Remote address: 99388 +Local endpoint ID: 2111124142794211394 +Remote endpoint ID: 4329936627666255938 +Sent (packets): 1605719 +Sent (bytes): 1605719000 +Send failed (packets): 0 +Send failed (bytes): 0 +Received (packets): 0 +Received (bytes): 0 +Receive failed (packets): 0 +Receive failed (bytes): 0 +Congestion avoidance algorithm: Multi-bit ECN +Upstream congestion level: 0 +Upstream packet counter: 0 +Downstream congestion level: 48 +Downstream packet counter: 0 +Congestion window size (ns): 65536 +Packets in this window: 7 +Bytes in this window: 7000 +Max bytes in this window: 51349 +Current congestion regime: Multiplicative dec +``` + +I ran a quick test using the ocbr tool (modified to show stats every +100ms) on a jFed testbed using 3 Linux servers (2 clients and a +server) in star configuration with a 'router' (a 4th Linux server) in +the center. The clients are connected to the 'router' over Gigabit +Ethernet, the link between the 'router' and server is capped to 100Mb +using ethtool[^4]. + +Output from the ocbr tool: + +``` +Flow 64: 998 packets ( 998000 bytes)in 101 ms => 9880.8946 pps, 79.0472 Mbps +Flow 64: 1001 packets ( 1001000 bytes)in 101 ms => 9904.6149 pps, 79.2369 Mbps +Flow 64: 999 packets ( 999000 bytes)in 101 ms => 9882.8697 pps, 79.0630 Mbps +Flow 64: 998 packets ( 998000 bytes)in 101 ms => 9880.0143 pps, 79.0401 Mbps +Flow 64: 999 packets ( 999000 bytes)in 101 ms => 9887.6627 pps, 79.1013 Mbps +Flow 64: 999 packets ( 999000 bytes)in 101 ms => 9891.0891 pps, 79.1287 Mbps +New flow. +Flow 64: 868 packets ( 868000 bytes)in 102 ms => 8490.6583 pps, 67.9253 Mbps +Flow 65: 542 packets ( 542000 bytes)in 101 ms => 5356.5781 pps, 42.8526 Mbps +Flow 64: 540 packets ( 540000 bytes)in 101 ms => 5341.5105 pps, 42.7321 Mbps +Flow 65: 534 packets ( 534000 bytes)in 101 ms => 5285.6111 pps, 42.2849 Mbps +Flow 64: 575 packets ( 575000 bytes)in 101 ms => 5691.4915 pps, 45.5319 Mbps +Flow 65: 535 packets ( 535000 bytes)in 101 ms => 5291.0053 pps, 42.3280 Mbps +Flow 64: 561 packets ( 561000 bytes)in 101 ms => 5554.3455 pps, 44.4348 Mbps +Flow 65: 533 packets ( 533000 bytes)in 101 ms => 5272.0079 pps, 42.1761 Mbps +Flow 64: 569 packets ( 569000 bytes)in 101 ms => 5631.3216 pps, 45.0506 Mbps +``` + +With only one client running, the flow is congestion controlled to +about ~80Mb/s (indicating the queue limit at 16 packets may be a bit +too low a bar). When the second client starts sending, both flows go +quite quickly (at most 100ms) to a fair state of about 42 Mb/s. + +The IO graph from wireshark shows a reasonably stable profile (i.e. no +big oscillations because of AIMD), when switching the flows on the +clients on and off which is on par with DCTCP and not unexpected +keeping in mind the similarities between the algorithms: + +{{
}} + +The periodic "gaps" were not seen at the ocbr endpoint applicationand +may have been due to tcpdump not capturing everything that those +points, or possibly a bug somewhere. + +As said, a lot more work is needed analyzing this algorithm in terms +of performance and stability[^5]. But I am feeling some excitement about its +simplicity and -- dare I say it? -- elegance. + +Stay curious! + +Dimitri + +[^1]: Additive Increase increases the window size with 1 MSS each + RTT. Slow Start doubles the window size each RTT. + +[^2]: I'm pretty sure the kernel developers would if they could. +[^3]: Or the current "be" branch for the less patient. +[^4]: Using Linux traffic control (```tc```) to limit traffic adds + kernel queues and may interfere with MB-ECN. +[^5]: And the prototype implementation as a whole! diff --git a/content/en/blog/20201212-congestion.png b/content/en/blog/20201212-congestion.png new file mode 100644 index 0000000..8e5b89f Binary files /dev/null and b/content/en/blog/20201212-congestion.png differ diff --git a/content/en/blog/20201219-congestion-avoidance.md b/content/en/blog/20201219-congestion-avoidance.md new file mode 100644 index 0000000..240eb88 --- /dev/null +++ b/content/en/blog/20201219-congestion-avoidance.md @@ -0,0 +1,313 @@ +--- +date: 2020-12-19 +title: "Exploring Ouroboros with wireshark" +linkTitle: "Exploring Ouroboros with wireshark " +description: "" +author: Dimitri Staessens +--- + +I recently did some +[quick tests](/blog/2020/12/12/congestion-avoidance-in-ouroboros/#mb-ecn-in-action) +with the new congestion avoidance implementation, and thought to +myself that it was a shame that Wireshark could not identify the +Ouroboros flows, as that could give me some nicer graphs. + +Just to be clear, I think generic network tools like tcpdump and +wireshark -- however informative and nice-to-use they are -- are a +symptom of a lack of network security. The whole point of Ouroboros is +that it is _intentionally_ designed to make it hard to analyze network +traffic. Ouroboros is not a _network stack_[^1]: one can't simply dump +a packet from the wire and derive the packet contents all the way up +to the application by following identifiers for protocols and +well-known ports. Using encryption to hide the network structure from +the packet is shutting the door after the horse has bolted. + +To write an Ouroboros dissector, one needs to know the layered +structure of the network at the capturing point at that specific point +in time. It requires information from the Ouroboros runtime on the +capturing machine and at the exact time of the capture, to correctly +analyze traffic flows. I just wrote a dissector that works for my +specific setup[^2]. + +## Congestion avoidance test + +First, a quick refresh on the experiment layout, it's the the same +4-node experiment as in the +[previous post](/blog/2020/12/12/congestion-avoidance-in-ouroboros/#mb-ecn-in-action) + +{{
}} + +I tried to draw the setup as best as I can in the figure above. + +There are 4 rack mounted 1U servers, connected over Gigabit Ethernet +(GbE). Physically there is a big switch connecting all of them, but +each "link" is separated as a port-based VLAN, so there are 3 +independent Ethernet segments. We create 3 ethernet _layers_, drawn +in a lighter gray, with a single unicast layer -- consisting of 4 +unicast IPC processes (IPCPs) -- on top, drawn in a darker shade of +gray. The link between the router and server has been capped to 100 +megabit/s using ```ethtool```[^3], and traffic is captured on the +Ethernet NIC at the "Server" node using ```tcpdump```. All traffic is +generated with our _constant bit rate_ ```ocbr``` tool trying to send +about 80 Mbit/s of application-level throughput over the unicast +layer. + +{{
}} + +The graph above shows the bandwidth -- as captured on the congested +100Mbit Ethernet link --, separated for each traffic flow, from the +same pcap capture as in my previous post. A flow can be identified by +a (destination address, endpoint ID)-pair, and since the destination +is all the same, I could filter out the flows by simply selecting them +based on the (64-bit) endpoint identifier. + +What you're looking at is that first, a flow (green starts), at around +T=14s, a new flow enters (red) that stops at around T=24s. At around +T=44s, another flow enters (blue) for about 14 seconds, and finally, a +fourth (orange) flow enters at T=63s. The first (green) flow exits at +around T=70s, leaving all the available bandwidth for the orange flow. + +The most important thing that I wanted to check is that when there are +multiple flows, _if_ and _how fast_ they would converge to the same +bandwidth. I'm not dissatisfied with the initial result: the answers +seem to be _yes_ and _pretty fast_, with no observable oscillation to +boot[^4] + +## Protocol overview + +Now, the wireshark dissector can be used to present some more details +about the Ouroboros protocols in a familiar setting -- make it more +accessible to some -- so, let's have a quick look. + +The Ouroboros network protocol has +[5 fields](/docs/concepts/protocols/#network-protocol): + +``` +| DST | TTL | QOS | ECN | EID | +``` + +which we had to map to the Ethernet II protocol for our ipcpd-eth-dix +implementation. The basic Ethernet II MAC (layer-2) header is pretty +simple. It has 2 6-byte addresses (dst, src) and a 2-byte Ethertype. + +Since Ethernet doesn't do QoS or congestion, the main missing field +here is the EID. We could have mapped it to the Ethertype, but we +noticed that a lot of routers and switches drop unknown Ethertypes +(and, for the purposes of this blog post here: it would have all but +prevented to write the dissector). So we made the ethertype +configurable per layer (so it can be set to a value that is not +blocked by the network), and added 2 16-bit fields after the Ethernet +MAC header for an Ouroboros layer: + +* Endpoint ID **eid**, which works just like in the unicast layer, to + identify the N+1 application (in our case: a data transfer flow and + a management flow for a unicast IPC process). + +* A length field **len**, which is needed because Ethernet NICs pad + frames that are smaller than 64 bytes in length with trailing zeros + (and we receive these zeros in our code). A length field is present + in Ethernet type I, but since most "Layer 3" protocols also had a + length field, it was re-purposed as Ethertype in Ethernet II. The + value of the **len** field is the length of the **data** payload. + +The Ethernet layer that spans that 100Mbit link has Ethertype 0xA000 +set (which is the Ouroboros default), the Ouroboros plugin hooks into +that ethertype. + +On top of the Ethernet layer, we have a unicast, layer with the 5 +fields specified above. The dissector also shows the contents of the +flow allocation messages, which are (currently) sent to EID = 0. + +So, the protocol header as analysed in the experiment is, starting +from the "wire": + +``` ++---------+---------+-----------+-----+-----+------ +| dst MAC | src MAC | Ethertype | eid | len | data /* ETH LAYER */ ++---------+---------+-----------+-----+-----+------ + + /* eid == 0 -> ipcpd-eth flow allocator, */ + /* this is not analysed */ + ++-----+-----+-----+-----+-----+------ +| DST | QOS | TTL | ECN | EID | DATA /* UNICAST LAYER */ ++-----+-----+-----+-----+-----+------ + + /* EID == 0 -> flow allocator */ + ++-----+-------+-------+------+------+-----+-------------+ +| SRC | R_EID | S_EID | CODE | RESP | ECE | ... QOS ....| /* FA */ ++-----+-------+-------+------+------+-----+-------------+ +``` + +## The network protocol + +{{
}} + +We will first have a look at packets captured around the point in time +where the second (red) flow enters the network, about 14 seconds into +the capture. The "N+1 Data" packets in the image above all belong to +the green flow. The ```ocbr``` tool that we use sends 1000-byte data +units that are zeroed-out. The packet captured on the wire is 1033 +bytes in length, so we have a protocol overhead of 33 bytes[^5]. We +can break this down to: + +``` + ETHERNET II HEADER / 14 / + 6 bytes Ethernet II dst + 6 bytes Ethernet II src + 2 bytes Ethernet II Ethertype + OUROBOROS ETH-DIX HEADER / 4 / + 2 bytes eid + 2 byte len + OUROBOROS UNICAST NETWORK HEADER / 15 / + 4 bytes DST + 1 byte QOS + 1 byte TTL + 1 byte ECN + 8 bytes EID + --- TOTAL / 33 / + 33 bytes +``` + +The **Data (1019 bytes)** reported by wireshark is what Ethernet II +sees as data, and thus includes the 19 bytes for the two Ouroboros +headers. Note that DST length is configurable, currently up to 64 +bits. + +Now, let's have a brief look at the values for these fields. The +**eid** is 65, this means that the _data-transfer flow_ established +between the unicast IPCPs on the router and the server (_uni-r_ and +_uni-s_ in our experiment figure) is identified by endpoint id 65 in +the eth-dix IPCP on the Server machine. The **len** is 1015. Again, no +surprises, this is the length of the Ouroboros unicast network header +(15 bytes) + the 1000 bytes payload. + +**DST**, the destination address is 4135366193, a 32-bit address +that was randomly assigned to the _uni-s_ IPCP. The QoS cube is 0, +which is the default best-effort QoS class. *TTL* is 59. The starting +TTL is configurable for a layer, the default is 60, and it was +decremented by 1 in the _uni-r_ process on the router node. The packet +experienced no congestion (**ECN** is 0), and the endpoint ID is a +64-bit random number, 475...56. This endpoint ID identifies the flow +endpoint for the ```ocbr``` server. + +## The flow request + +{{
}} + +The first "red" packet that was captured is the one for the flow +allocation request, **FLOW REQUEST**[^6]. As mentioned before, the +endpoint ID for the flow allocator is 0. + +A rather important remark is in place here: Ouroboros does not allow a +UDP-like _datagram service_ from a layer. With which I mean: fabricate +a packet with the correct destination address and some known EID and +dump it in the network. All traffic that is offered to an Ouroboros +layer requires a _flow_ to be allocated. This keeps the network layer +in control its resources; the protocol details inside a layer are a +secret to that layer. + +Now, what about that well-known EID=0 for the flow allocator (FA)? And +the directory (Distributed Hash Table, DHT) for that matter, which is +currently on EID=1? Doesn't that contradict the "no datagram service" +statement above? Well, no. These components are part of the layer and +are thus inside the layer. The DHT and FA are internal +components. They are direct clients of the Data Transfer component. +The globally known EID for these components is an absolute necessity +since they need to be able to reach endpoints more than a hop +(i.e. a flow in a lower layer) away. + +Let's now look inside that **FLOW REQUEST** message. We know it is a +request from the **msg code** field[^7]. + +This is the **only** packet that contains the source (and destination) +address for this flow. There is a small twist, this value is decoded +with different _endianness_ than the address in the DT protocol output +(probably a bug in my dissector). The source address 232373199 in the +FA message corresponds to the address 3485194509 in the DT protocol +(and in the experiment image at the top): the source of our red flow +is the "Client 2" node. Since this is a **FLOW REQUEST**, the remote +endpoint id is not yet known, and set to 0[^8. The source endpoint ID +-- a 64-bit randomly generated value unique to the source IPC +process[^9] -- is sent to the remote. The other fields are not +relevant for this message. + +## The flow reply + +{{
}} + +Now, the **FLOW REPLY** message for our request. It originates our +machine, so you will notice that the TTL is the starting value of 60. +The destination address is what we sent in our original **FLOW +REQUEST** -- add some endianness shenanigans. The **FLOW REPLY** +mesage response sends the newly generated source endpoint[^10] ID, and +this packet is the **only** packet that contains both endpoint IDs +for this flow. + +## Congestion / flow update + +{{
}} + +Now a quick look at the congestion avoidance mechanisms. The +information for the Additive Increase / Multiple Decrease algorithm is +gathered from the **ECN** field in the packets. When both flows are +active, they experience congestion since the requested bandwidth from +the two ```ocbr``` clients (180Mbit) exceeds the 100Mbit link, and the +figure above shows a packet marked with an ECN value of 11. + +{{
}} + +When the packets on a flow experience congestion, the flow allocator +at the endpoint (the one our _uni-s_ IPCP) will update the sender with +an **ECE** _Explicit Congestion Experienced_ value; in this case, 297. +The higher this value, the quicker the sender will decrease its +sending rate. The algorithm is explained a bit in my previous +post. + +That's it for today's post, I hope it provides some new insights how +Ouroboros works. As always, stay curious. + +Dimitri + +[^1]: Neither is RINA, for that matter. + +[^2]: This quick-and-dirty dissector is available in the + ouroboros-eth-uni branch on my + [github](https://github.com/dstaesse/wireshark/) + +[^3]: The prototype is able to handle Gigabit Ethernet, this is mostly + to make the size of the capture files somewhat manageable. + +[^4]: Of course, this needs more thorough evaluation with more + clients, distributions on the latency, different configurations + for the FRCP protocol in the N+1 and all that jazz. I have, + however, limited amounts of time to spare and am currently + focusing on building and documenting the prototype and tools so + that more thorough evaluations can be done if someone feels like + doing them. + +[^5]: A 4-byte Ethernet Frame Check Sequence (FCS) is not included in + the 'bytes on the wire'. As a reference, the minimum overhead + for this kind of setup using UDP/IPv4 is 14 bytes Ethernet + 20 + bytes IPv4 + 8 bytes UDP = 42 bytes. + +[^6]: Actually, in a larger network there could be some DHT traffic + related to resolving the address, but in such a small network, + the DHT is basically a replicated database between all 4 nodes. + +[^7]: The reason it's not the first field in the protocol has to to + with performance of memory alignment in x86 architectures. + +[^8]: We haven't optimised the FA protocol not to send fields it + doesn't need for that particular message type -- yet. + +[^9]: Not the host machine, but that particular IPCP on the host + machine. You can have multiple IPCPs for the same layer on the + same machine, but in this case, expect correlation between their + addresses. 64-bits / IPCP should provide some security against + remotes trying to hack into another service on the same host by + guessing EIDs. + +[^10]: This marks the point in space-time where I notice the + misspelling in the dissector. \ No newline at end of file diff --git a/content/en/blog/20201219-congestion.png b/content/en/blog/20201219-congestion.png new file mode 100644 index 0000000..5675438 Binary files /dev/null and b/content/en/blog/20201219-congestion.png differ diff --git a/content/en/blog/20201219-exp.svg b/content/en/blog/20201219-exp.svg new file mode 100644 index 0000000..68e09e2 --- /dev/null +++ b/content/en/blog/20201219-exp.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/content/en/blog/20201219-ws-0.png b/content/en/blog/20201219-ws-0.png new file mode 100644 index 0000000..fd7a83a Binary files /dev/null and b/content/en/blog/20201219-ws-0.png differ diff --git a/content/en/blog/20201219-ws-1.png b/content/en/blog/20201219-ws-1.png new file mode 100644 index 0000000..0f07fd0 Binary files /dev/null and b/content/en/blog/20201219-ws-1.png differ diff --git a/content/en/blog/20201219-ws-2.png b/content/en/blog/20201219-ws-2.png new file mode 100644 index 0000000..7cd8b7d Binary files /dev/null and b/content/en/blog/20201219-ws-2.png differ diff --git a/content/en/blog/20201219-ws-3.png b/content/en/blog/20201219-ws-3.png new file mode 100644 index 0000000..2a6f6d5 Binary files /dev/null and b/content/en/blog/20201219-ws-3.png differ diff --git a/content/en/blog/20201219-ws-4.png b/content/en/blog/20201219-ws-4.png new file mode 100644 index 0000000..3a0ef8c Binary files /dev/null and b/content/en/blog/20201219-ws-4.png differ diff --git a/content/en/blog/_index.md b/content/en/blog/_index.md index 43820eb..34904e5 100644 --- a/content/en/blog/_index.md +++ b/content/en/blog/_index.md @@ -1,13 +1,7 @@ --- -title: "Docsy Blog" +title: "Developer Blog" linkTitle: "Blog" menu: main: weight: 30 --- - - -This is the **blog** section. It has two categories: News and Releases. - -Files in these directories will be listed in reverse chronological order. - diff --git a/content/en/blog/news/20191006-new-site.md b/content/en/blog/news/20191006-new-site.md deleted file mode 100644 index c04ff2d..0000000 --- a/content/en/blog/news/20191006-new-site.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -date: 2019-10-06 -title: "New Website" -linkTitle: "New Ouroboros website" -description: "Announcing the new website" -author: Dimitri Staessens ---- diff --git a/content/en/blog/news/20200116-hn.md b/content/en/blog/news/20200116-hn.md deleted file mode 100644 index b80a7bd..0000000 --- a/content/en/blog/news/20200116-hn.md +++ /dev/null @@ -1,30 +0,0 @@ ---- -date: 2020-01-16 -title: "Getting back to work" -linkTitle: "Getting back to work" -description: "Show HN - Ouroboros" -author: Dimitri Staessens ---- - -Yesterday there was a bit of an unexpected spike in interest in -Ouroboros following a [post on -HN](https://news.ycombinator.com/item?id=22052416). I'm really -humbled by the response and grateful to all the people that show -genuine interest in this project. - -I fully understand that people would like to know a lot more details -about Ouroboros than the current site provides. It was the top -priority on the todo list, and this new interest gives me some -additional motivation to get to it. There's a lot to Ouroboros that's -not so trivial, which makes writing clear documentation a tricky -thing to do. - -I will also tackle some of the questions from the HN in a series of -blog posts in the next few days, replacing the (very old and outdated) -FAQ section. I hope these will be useful. - -Again thank you for your interest. - -Sincerely, - -Dimitri diff --git a/content/en/blog/news/20200212-ecmp.md b/content/en/blog/news/20200212-ecmp.md deleted file mode 100644 index 019b40d..0000000 --- a/content/en/blog/news/20200212-ecmp.md +++ /dev/null @@ -1,68 +0,0 @@ ---- -date: 2020-02-12 -title: "Equal-Cost Multipath (ECMP)" -linkTitle: "Adding Equal-Cost multipath (ECMP)" -description: "ECMP is coming to Ouroboros (finally)" -author: Dimitri Staessens ---- - -Some recent news -- Multi-Path TCP (MPTCP) implementation is [landing -in mainstream Linux kernel -5.6](https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.6-Starts-Multipath-TCP) --- finally got me to integrate the equal-cost multipath (ECMP) -implementation from [Nick Aerts's master -thesis](https://lib.ugent.be/nl/catalog/rug01:002494958) into -Ouroboros. And working on the ECMP implementation in gives me an -excuse to rant a little bit about MPTCP. - -The first question that comes to mind is: _Why is it called -multi-**path** TCP_? IP is routing packets, not TCP, and there are -equal-cost multipath options for IP in both [IS-IS and -OSPF](https://tools.ietf.org/html/rfc2991). Maybe _multi-flow TCP_ -would be a better name? This would also be more transparent to the -fact that running MPTCP over longer hops will make less sense, since -the paths are more likely to converge over the same link. - -So _why is there a need for multi-path TCP_? The answer, of course, is -that the Internet Protocol routes packets between IP endpoints, which -are _interfaces_, not _hosts_. So, if a server is connected over 4 -interfaces, ECMP routing will not be of any help if one of them goes -down. The TCP connections will time out. Multipath TCP, however, is -actually establishing 4 subflows, each over a different interface. If -an interface goes down, MPTCP will still have 3 subflows ready. The -application is listening the the main TCP connection, and will not -notice a TCP-subflow timing out[^1]. - -This brings us, of course, to the crux of the problem. IP names the -[point of attachment](https://tools.ietf.org/html/rfc1498); IP -addresses are assigned to interfaces. Another commonly used workaround -is a virtual IP interface on the loopback, but then you need a lot of -additional configuration (and if that were the perfect solution, one -wouldn't need MPTCP!). MPTCP avoids the network configuration mess, -but does require direct modification in the application using -[additions to the sockets -API](https://tools.ietf.org/html/draft-hesmans-mptcp-socket-03) in the -form of a bunch of (ugly) setsockopts. - -Now this is a far from ideal situation, but given its constraints, -MPTCP is a workable engineering solution that will surely see its -uses. It's strange that it took years for MPTCP to get to this stage. - -Now, of course, Ouroboros does not assign addresses to -points-of-attachments ( _flow endpoints_). It doesn't even assign -addresses to hosts/nodes! Instead, the address is derived from the -forwarding protocol machines inside each node. (For the details, see -the [article](https://arxiv.org/pdf/2001.09707.pdf)). The net effect -is that an ECMP routing algorithm can cleanly handle hosts with -multiple interfaces. Details about the routing algorithm are not -exposed to application APIs. Instead, Ouroboros applications request -an implementation-independent _service_. - -The ECMP patch for Ouroboros is coming _soon_. Once it's available I -will also add a couple of tutorials on it. - -Peace. - -Dimitri - -[^1]: Question: Why are the subflows not UDP? That would avoid a lot of duplicated overhead (sequence numbers etc)... Would it be too messy on the socket API side? \ No newline at end of file diff --git a/content/en/blog/news/20200216-ecmp.md b/content/en/blog/news/20200216-ecmp.md deleted file mode 100644 index ce632c9..0000000 --- a/content/en/blog/news/20200216-ecmp.md +++ /dev/null @@ -1,118 +0,0 @@ ---- -date: 2020-02-16 -title: "Equal-Cost Multipath (ECMP) routing" -linkTitle: "Equal-Cost multipath (ECMP) example" -description: "A very quick example of ECMP" -author: Dimitri Staessens ---- - -As promised, I added equal cost multipath routing to the Ouroboros -unicast IPCP. I will add some more explanations later when it's fully -tested and merge into the master branch, but you can already try it. -You will need to pull the _be_ branch. You will also need to have -_fuse_ installed to monitor the flows from _/tmp/ouroboros/_. The -following script will bootstrap a 4-node unicast network on your -machine that routes using ECMP: - -```bash -#!/bin/bash - -# create a local IPCP. This emulates the "Internet" -irm i b t local n local l local - -#create the first unicast IPCP with ecmp -irm i b t unicast n uni.a l net routing ecmp - -#bind the unicast IPCP to the names net and uni.a -irm b i uni.a n net -irm b i uni.a n uni.a - -#register these 2 names in the local IPCP -irm n r net l local -irm n r uni.a l local - -#create 3 more unicast IPCPs, and enroll them with the first -irm i e t unicast n uni.b l net -irm b i uni.b n net -irm b i uni.b n uni.b -irm n r uni.b l local - -irm i e t unicast n uni.c l net -irm b i uni.c n net -irm b i uni.c n uni.c -irm n r uni.c l local - -irm i e t unicast n uni.d l net -irm b i uni.d n net -irm b i uni.d n uni.d -irm n r uni.d l local - -#connect uni.b to uni.a this creates a DT flow and a mgmt flow -irm i conn name uni.b dst uni.a - -#now do the same for the others, creating a square -irm i conn name uni.c dst uni.b -irm i conn name uni.d dst uni.c -irm i conn name uni.d dst uni.a - -#register the oping application at 4 different locations -#this allows us to check the multipath implementation -irm n r oping.a i uni.a -irm n r oping.b i uni.b -irm n r oping.c i uni.c -irm n r oping.d i uni.d - -#bind oping program to oping names -irm b prog oping n oping.a -irm b prog oping n oping.b -irm b prog oping n oping.c -irm b prog oping n oping.d - -#good to go! -``` - -In order to test the setup, start an irmd (preferably in a terminal so -you can see what's going on). In another terminal, run the above -script and then start an oping server: - -```bash -$ ./ecmpscript -$ oping -l -Ouroboros ping server started. -``` - -This single server program will accept all flows for oping from any of -the unicast IPCPs. Ouroboros _multi-homing_ in action. - -Open another terminal, and type the following command: - -```bash -$ watch -n 1 'grep "sent (packets)" /tmp/ouroboros/uni.a/dt.*/6* | sed -n -e 1p -e 7p' -``` - -This will show you the packet statistics from the 2 data transfer -flows from the first IPCP (uni.a). - -On my machine it looks like this: - -``` -Every 1,0s: grep "sent (packets)" /tmp/ouroboros/uni.a/dt.*/6* | sed -n -e 1p -e 7p - -/tmp/ouroboros/uni.a/dt.1896199821/65: sent (packets): 10 -/tmp/ouroboros/uni.a/dt.1896199821/67: sent (packets): 6 -``` - -Now, from yet another terminal, run connect an oping client to oping.c -(the client should attach to the first IPCP, so oping.c should be the -one with 2 equal cost paths) and watch both counters increase: - -```bash -oping -n oping.c -i 100ms -``` - -When you do this to the other destinations (oping.b and oping.d) you -should see only one of the flow counters increasing. - -Hope you enjoyed this little demo! - -Dimitri diff --git a/content/en/blog/news/20200502-frcp.md b/content/en/blog/news/20200502-frcp.md deleted file mode 100644 index 28c5794..0000000 --- a/content/en/blog/news/20200502-frcp.md +++ /dev/null @@ -1,236 +0,0 @@ ---- -date: 2020-05-02 -title: "Flow and Retransmission Control Protocol (FRCP) implementation" -linkTitle: "Flow and Retransmission Control Protocol (FRCP)" -description: "A quick demo of FRCP" -author: Dimitri Staessens ---- - -With the longer weekend I had some fun implementing (parts of) the -[Flow and Retransmission Control Protocol (FRCP)](/docs/concepts/protocols/#flow-and-retransmission-control-protocol-frcp) -to the point that it's stable enough to bring you a very quick demo of it. - -FRCP is the Ouroboros alternative to TCP / QUIC / LLC. It assures -delivery of packets when the network itself isn't very reliable. - -The setup is simple: we run Ouroboros over the Ethernet loopback -adapter _lo_, -``` -systemctl restart ouroboros -irm i b t eth-dix l dix n dix dev lo -``` -to which we add some impairment -[_qdisc_](http://man7.org/linux/man-pages/man8/tc-netem.8.html): - -``` -$ sudo tc qdisc add dev lo root netem loss 8% duplicate 3% reorder 10% delay 1 -``` - -This causes the link to lose, duplicate and reorder packets. - -We can use the oping tool to uses different [QoS -specs](https://ouroboros.rocks/cgit/ouroboros/tree/include/ouroboros/qos.h) -and watch the behaviour. Quality-of-Service (QoS) specs are a -technology-agnostic way to request a network service (current -status - not finalized yet). I'll also capture tcpdump output. - -We start an oping server and tell Ouroboros for it to listen to the _name_ "oping": -``` -#bind the program oping to the name oping -irm b prog oping n oping -#register the name oping in the Ethernet layer that is attached to the loopback -irm n r oping l dix -#run the oping server -oping -l -``` - -We'll now send 20 pings. If you try this, it can be that the flow -allocation fails, due to the loss of a flow allocation packet (a bit -similar to TCP losing the first SYN). The oping client currently -doesn't retry flow allocation. The default payload for oping is 64 -bytes (of zeros); oping waits 2 seconds for all packets it has -sent. It doesn't detect duplicates. - -Let's first look at the _raw_ QoS cube. That's like best-effort -UDP/IP. In Ouroboros, however, it doesn't require a packet header at -all. - -First, the output of the client using a _raw_ QoS cube: -``` -$ oping -n oping -c 20 -i 200ms -q raw -Pinging oping with 64 bytes of data (20 packets): - -64 bytes from oping: seq=0 time=0.880 ms -64 bytes from oping: seq=1 time=0.742 ms -64 bytes from oping: seq=4 time=1.303 ms -64 bytes from oping: seq=6 time=0.739 ms -64 bytes from oping: seq=6 time=0.771 ms [out-of-order] -64 bytes from oping: seq=6 time=0.789 ms [out-of-order] -64 bytes from oping: seq=7 time=0.717 ms -64 bytes from oping: seq=8 time=0.759 ms -64 bytes from oping: seq=9 time=0.716 ms -64 bytes from oping: seq=10 time=0.729 ms -64 bytes from oping: seq=11 time=0.720 ms -64 bytes from oping: seq=12 time=0.718 ms -64 bytes from oping: seq=13 time=0.722 ms -64 bytes from oping: seq=14 time=0.700 ms -64 bytes from oping: seq=16 time=0.670 ms -64 bytes from oping: seq=17 time=0.712 ms -64 bytes from oping: seq=18 time=0.716 ms -64 bytes from oping: seq=19 time=0.674 ms -Server timed out. - ---- oping ping statistics --- -20 packets transmitted, 18 received, 2 out-of-order, 10% packet loss, time: 6004.273 ms -rtt min/avg/max/mdev = 0.670/0.765/1.303/0.142 ms -``` - -The _netem_ did a good job of jumbling up the traffic! There were a -couple out-of-order, duplicates, and quite some packets lost. - -Let's dig into an Ethernet frame captured from the "wire". The most -interesting thing its small total size: 82 bytes. - -``` -13:37:25.875092 00:00:00:00:00:00 (oui Ethernet) > 00:00:00:00:00:00 (oui Ethernet), ethertype Unknown (0xa000), length 82: - 0x0000: 0042 0040 0000 0001 0000 0011 e90c 0000 .B.@............ - 0x0010: 0000 0000 203f 350f 0000 0000 0000 0000 .....?5......... - 0x0020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ - 0x0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ - 0x0040: 0000 0000 -``` - -The first 12 bytes are the two MAC addresses (all zeros), then 2 bytes -for the "Ethertype" (the default for an Ouroboros layer is 0xa000, so -you can create more layers and seperate them by Ethertype[^1]. The -Ethernet Payload is thus 68 bytes. The Ouroboros _ipcpd-eth-dix_ adds -and extra header of 4 bytes with 2 extra "fields". The first field we -needed to take over from our [Data -Transfer](/docs/concepts/protocols/) protocol: the Endpoint Identifier -that identifies the flow. The _ipcpd-eth-dix_ has two endpoints, one -for the client and one for the server. 0x0042 (66) is the destination -EID of the server, 0x0043 (67) is the destination EID of the client. -The second field is the _length_ of the payload in octets, 0x0040 = -64. This is needed because Ethernet II has a minimum frame size of 64 -bytes and pads smaller frames (called _runt frames_)[^2]. The -remaining 64 bytes are the oping payload, giving us an 82 byte packet. - -That's it for the raw QoS. The next one is _voice_. A voice service -usually requires packets to be delivered with little delay and jitter -(i.e. ASAP). Out-of-order packets are rejected since they cause -artifacts in the audio output. The voice QoS will enable FRCP, because -it needs to track sequence numbers. - -``` -$ oping -n oping -c 20 -i 200ms -q voice -Pinging oping with 64 bytes of data (20 packets): - -64 bytes from oping: seq=0 time=0.860 ms -64 bytes from oping: seq=2 time=0.704 ms -64 bytes from oping: seq=3 time=0.721 ms -64 bytes from oping: seq=4 time=0.706 ms -64 bytes from oping: seq=5 time=0.721 ms -64 bytes from oping: seq=6 time=0.710 ms -64 bytes from oping: seq=7 time=0.721 ms -64 bytes from oping: seq=8 time=0.691 ms -64 bytes from oping: seq=10 time=0.691 ms -64 bytes from oping: seq=12 time=0.702 ms -64 bytes from oping: seq=13 time=0.730 ms -64 bytes from oping: seq=14 time=0.716 ms -64 bytes from oping: seq=15 time=0.725 ms -64 bytes from oping: seq=16 time=0.709 ms -64 bytes from oping: seq=17 time=0.703 ms -64 bytes from oping: seq=18 time=0.693 ms -64 bytes from oping: seq=19 time=0.666 ms -Server timed out. - ---- oping ping statistics --- -20 packets transmitted, 17 received, 0 out-of-order, 15% packet loss, time: 6004.243 ms -rtt min/avg/max/mdev = 0.666/0.716/0.860/0.040 ms -``` - -As you can see, packets are delivered in-order, and some packets are -missing. Nothing fancy. Let's look at a data packet: - -``` -14:06:05.607699 00:00:00:00:00:00 (oui Ethernet) > 00:00:00:00:00:00 (oui Ethernet), ethertype Unknown (0xa000), length 94: - 0x0000: 0045 004c 0100 0000 eb1e 73ad 0000 0000 .E.L......s..... - 0x0010: 0000 0000 0000 0012 a013 0000 0000 0000 ................ - 0x0020: 705c e53a 0000 0000 0000 0000 0000 0000 p\.:............ - 0x0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ - 0x0040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ - -``` - -The same 18-byte header is present. The flow endpoint ID is a -different one, and the length is also different. The packet is 94 -bytes, the payload length for the _ipcp-eth_dix_ is 0x004c = 76 -octets. So the FRCP header adds 12 bytes, the total overhead is 30 -bytes. Maybe a bit more detail on the FRCP header contents (more depth -is available the protocol documentation). The first 2 bytes are the -FLAGS (0x0100). There are only 7 flags, it's 16 bits for memory -alignment. This packet only has the DATA bit set. Then follows the -flow control window, which is 0 (not implemented yet). Then we have a -4 byte sequence number (eb1e 73ae = 3944641454)[^3] and a 4 byte ACK -number, which is 0. The remaining 64 bytes are the oping payload. - -Next, the data QoS: - -``` -$ oping -n oping -c 20 -i 200ms -q data -Pinging oping with 64 bytes of data (20 packets): - -64 bytes from oping: seq=0 time=0.932 ms -64 bytes from oping: seq=1 time=0.701 ms -64 bytes from oping: seq=2 time=200.949 ms -64 bytes from oping: seq=3 time=0.817 ms -64 bytes from oping: seq=4 time=0.753 ms -64 bytes from oping: seq=5 time=0.730 ms -64 bytes from oping: seq=6 time=0.726 ms -64 bytes from oping: seq=7 time=0.887 ms -64 bytes from oping: seq=8 time=0.878 ms -64 bytes from oping: seq=9 time=0.883 ms -64 bytes from oping: seq=10 time=0.865 ms -64 bytes from oping: seq=11 time=401.192 ms -64 bytes from oping: seq=12 time=201.047 ms -64 bytes from oping: seq=13 time=0.872 ms -64 bytes from oping: seq=14 time=0.966 ms -64 bytes from oping: seq=15 time=0.856 ms -64 bytes from oping: seq=16 time=0.849 ms -64 bytes from oping: seq=17 time=0.843 ms -64 bytes from oping: seq=18 time=0.797 ms -64 bytes from oping: seq=19 time=0.728 ms - ---- oping ping statistics --- -20 packets transmitted, 20 received, 0 out-of-order, 0% packet loss, time: 4004.491 ms -rtt min/avg/max/mdev = 0.701/40.864/401.192/104.723 ms -``` - -With the data spec, we have no packet loss, but some packets have been -retransmitted (hence the higher latency). The reason for the very high -latency is that the current implementation only ACKs on data packets, -this will be fixed soon. - -Looking at an Ethernet frame, it's again 94 bytes: - -``` -14:35:42.612066 00:00:00:00:00:00 (oui Ethernet) > 00:00:00:00:00:00 (oui Ethernet), ethertype Unknown (0xa000), length 94: - 0x0000: 0044 004c 0700 0000 81b8 0259 e2f3 eb59 .D.L.......Y...Y - 0x0010: 0000 0000 0000 0012 911a 0000 0000 0000 ................ - 0x0020: 86b3 273b 0000 0000 0000 0000 0000 0000 ..';............ - 0x0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ - 0x0040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ - -``` - -The main difference is that it has 2 flags set (DATA + ACK), and it -thus contains both a sequence number (81b8 0259) and an -acknowledgement (e2f3 eb59). - -That's about it for now. More to come soon. - -Dimitri - -[^1]: Don't you love standards? One of the key design objectives for Ouroboros is exactly to avoid such shenanigans. Modify/abuse a header and Ouroboros should reject it because it _cannot work_, not because some standard says one shouldn't do it. -[^2]: Lesser known fact: Gigabit Ethernet has a 512 byte minimum frame size; but _carrier extension_ handles this transparently. -[^3]: In _network byte order_. \ No newline at end of file diff --git a/content/en/blog/news/20200507-python-lb.png b/content/en/blog/news/20200507-python-lb.png deleted file mode 100644 index 89e710e..0000000 Binary files a/content/en/blog/news/20200507-python-lb.png and /dev/null differ diff --git a/content/en/blog/news/20200507-python.md b/content/en/blog/news/20200507-python.md deleted file mode 100644 index d4b3504..0000000 --- a/content/en/blog/news/20200507-python.md +++ /dev/null @@ -1,74 +0,0 @@ ---- -date: 2020-05-07 -title: "A Python API for Ouroboros" -linkTitle: "Python" -description: "Python" -author: Dimitri Staessens ---- - -Support for other programming languages than C/C++ has been on my todo -list for quite some time. The initial approach was using -[SWIG](http://www.swig.org), but we found the conversion always -clunky, it didn't completely work as we wanted to, and a while back we -just decided to deprecate it. Apart from C/C++ we only had a [rust -wrapper](https://github.com/chritchens/ouroboros-rs). - -Until now! I finally took the time to sink my teeth into the bindings -for Python. I had some brief looks at the -[ctypes](https://docs.python.org/3/library/ctypes.html) library a -while back, but this time I looked into -[cffi](https://cffi.readthedocs.io/en/latest/) and I was amazed at how -simple it was to wrap the more difficult functions that manipulate -blocks of memory (flow\_read, but definitely the async fevent() call). -And now there is path towards a 'nice' Python API. - -Here is a taste of what the -[oecho](https://ouroboros.rocks/cgit/ouroboros/tree/src/tools/oecho/oecho.c) -tool looks like in Python: - -```Python -from ouroboros import * -import argparse - - -def client(): - f = flow_alloc("oecho") - f.writeline("Hello, PyOuroboros!") - print(f.readline()) - f.dealloc() - - -def server(): - print("Starting the server.") - while True: - f = flow_accept() - print("New flow.") - line = f.readline() - print("Message from client is " + line) - f.writeline(line) - f.dealloc() - - -if __name__ == "__main__": - parser = argparse.ArgumentParser(description='A simple echo client/server') - parser.add_argument('-l', '--listen', help='run as a server', action='store_true') - args = parser.parse_args() - if args.listen is True: - server() - else: - client() -``` - -I have more time in the next couple of days, so I expect this to be -released after the weekend. - -Oh, and here is a picture of Ouroboros load-balancing between the C (top right) -and Python (top left) implementations using the C and Python clients: - -{{
}} - -Can't wait to get the full API done! - -Cheers, - -Dimitri diff --git a/content/en/blog/news/20201212-congestion-avoidance.md b/content/en/blog/news/20201212-congestion-avoidance.md deleted file mode 100644 index f395a4f..0000000 --- a/content/en/blog/news/20201212-congestion-avoidance.md +++ /dev/null @@ -1,358 +0,0 @@ ---- -date: 2020-12-12 -title: "Congestion avoidance in Ouroboros" -linkTitle: "Congestion avoidance" -description: "API for congestion avoidance and the Ouroboros MB-ECN algorithm" -author: Dimitri Staessens ---- - -The upcoming 0.18 version of the prototype has a bunch of big -additions coming in, but the one that I'm most excited about is the -addition of congestion avoidance. Now that the implementation is -reaching its final shape, I just couldn't wait to share with the world -what it looks like, so here I'll talk a bit about how it works. - -# Congestion avoidance - -Congestion avoidance is a mechanism for a network to avoid situations -where the where the total traffic offered on a network element (link -or node) systemically exceeds its capacity to handle this traffic -(temporary overload due to traffic burstiness is not -congestion). While bursts can be handled with adding buffers to -network elements, the solution to congestion is to reduce the ingest -of traffic at the network endpoints that are sources for the traffic -over the congested element(s). - -I won't be going into too many details here, but there are two classes -of mechanisms to inform traffic sources of congestion. One is Explicit -Congestion Notification (ECN), where information is sent to the sender -that its traffic is traversing a congested element. This is a solution -that is, for instance, used by -[DataCenter TCP (DCTCP)](https://tools.ietf.org/html/rfc8257), -and is also supported by -[QUIC](https://www.ietf.org/archive/id/draft-ietf-quic-recovery-33.txt). -The other mechanism is implicit congestion detection, for instance by -inferring congestion from packet loss (most TCP flavors) or increases -in round-trip-time (TCP vegas). - -Once the sender is aware that its traffic is experiencing congestion, -it has to take action. A simple (proven) way is the AIMD algorithm -(Additive Increase, Multiplicative Decrease). When there is no sign of -congestion, senders will steadily increase the amount of traffic they -are sending (Additive Increase). When congestion is detected, they -will quickly back off (Multiplicative Decrease). Usually this is -augmented with a Slow Start (Multiplicative Increase) phase when the -senders begins to send, to reach the maximum bandwidth more -quickly. AIMD is used by TCP and QUIC (among others), and Ouroboros is -no different. It's been proven to work mathematically. - -Now that the main ingredients are known, we can get to the -preparation of the course. - -# Ouroboros congestion avoidance - -Congestion avoidance is in a very specific location in the Ouroboros -architecture: at the ingest point of the network; it is the -responsibility of the network, not the client application. In -OSI-layer terminology, we could say that in Ouroboros, it's in "Layer -3", not in "Layer 4". - -Congestion has to be dealt with for each individual traffic -source/destination pair. In TCP this is called a connection, in -Ouroboros we call it a _flow_. - -Ouroboros _flows_ are abstractions for the most basic of packet flows. -A flow is defined by two endpoints and all that a flow guarantees is -that there exist strings of bytes (packets) that, when offered at the -ingress endpoint, have a non-zero chance of emerging at the egress -endpoint. I say 'there exist' to allow, for instance, for maximum -packet lengths. If it helps, think of flow endpoints as an IP:UDP -address:port pair (but emphatically _NOT_ an IP:TCP address:port -pair). There is no protocol assumed for the packets that traverse the -flow. To the ingress and egress point, they are just a bunch of bytes. - -Now this has one major implication: We will need to add some -information to these packets to infer congestion indirectly or -explicitly. It should be obvious that explicit congestion notification -is the simplest solution here. The Ouroboros prototype (currently) -allows an octet for ECN. - -# Functional elements of the congestion API - -This section glances over the API in an informal way. A reference -manual for the actual C API will be added after 0.18 is in the master -branch of the prototype. The most important thing to keep in mind is -that the architecture dictates this API, not any particular algorithm -for congestion that we had in mind. In fact, to be perfectly honest, -up front I wasn't 100% sure that congestion avoidance was feasible -without adding additional fields fields to the DT protocol, such as a -packet counter, or sending some feedback for measuring the Round-Trip -Time (RTT). But as the algorithm below will show, it can be done. - -When flows are created, some state can be stored, which we call the -_congestion context_. For now it's not important to know what state is -stored in that context. If you're familiar with the inner workings of -TCP, think of it as a black-box generalization of the _tranmission -control block_. Both endpoints of a flow have such a congestion -context. - -At the sender side, the congestion context is updated for each packet -that is sent on the flow. Now, the only information that is known at -the ingress is 1) that there is a packet to be sent, and 2) the length -of this packet. The call at the ingress is thus: - -``` - update_context_at_sender -``` - -This function has to inform when it is allowed to actually send the -packet, for instance by blocking for a certain period. - -At the receiver flow endpoint, we have a bit more information, 1) that -a packet arrived, 2) the length of this packet, and 3) the value of -the ECN octet associated with this packet. The call at the egress is -thus: - -``` - update_context_at_receiver -``` - -Based on this information, receiver can decide if and when to update -the sender. We are a bit more flexible in what can be sent, at this -point, the prototype allows sending a packet (which we call -FLOW_UPDATE) with a 16-bit Explicit Congestion Experienced (ECE) field. - -This implies that the sender can get this information from the -receiver, so it knows 1) that such a packet arrived, and 2) the value -of the ECE field. - -``` - update_context_at_sender_ece -``` - -That is the API for the endpoints. In each Ouroboros IPCP (think -'router'), the value of the ECN field is updated. - -``` - update_ecn_in_router -``` - -That's about as lean as as it gets. Now let's have a look at the -algorithm that I designed and -[implemented](https://ouroboros.rocks/cgit/ouroboros/tree/src/ipcpd/unicast/pol/ca-mb-ecn.c?h=be) -as part of the prototype. - -# The Ouroboros multi-bit Forward ECN (MB-ECN) algorithm - -The algorithm is based on the workings of DataCenter TCP -(DCTCP). Before I dig into the details, I will list the main -differences, without any judgement. - -* The rate for additive increase is the same _constant_ for all flows - (but could be made configurable for each network layer if - needed). This is achieved by having a window that is independent of - the Round-Trip Time (RTT). This may make it more fair, as congestion - avoidance in DCTCP (and in most -- if not all -- TCP variants), is - biased in favor of flows with smaller RTT[^1]. - -* Because it is operating at the _flow_ level, it estimates the - _actual_ bandwidth sent, including retransmissions, ACKs and what - not from protocols operating on the flow. DCTCP estimates bandwidth - based on which data offsets are acknowledged. - -* The algorithm uses 8 bits to indicate the queue depth in each - router, instead of a single bit (due to IP header restrictions) for - DCTCP. - -* MB-ECN sends a (small) out-of-band FLOW_UPDATE packet, DCTCP updates - in-band TCP ECN/ECE bits in acknowledgment (ACK) packets. Note that - DCTCP sends an immediate ACK with ECE set at the start of - congestion, and sends an immediate ACK with ECE not set at the end - of congestion. Otherwise, the ECE is set accordingly for any - "regular" ACKs. - -* The MB-ECN algorithm can be implemented without the need for - dividing numbers (apart from bit shifts). At least in the linux - kernel implementation, DCTCP has a division for estimating the - number of bytes that experienced congestion from the received acks - with ECE bits set. I'm not sure this can be avoided[^2]. - -Now, on to the MB-ECN algorithm. The values for some constants -presented here have only been quickly tested; a _lot_ more scientific -scrutiny is definitely needed here to make any statements about the -performance of this algorithm. I will just explain the operation, and -provide some very preliminary measurement results. - -First, like DCTCP, the routers mark the ECN field based on the -outgoing queue depth. The current minimum queue depth to trigger and -ECN is 16 packets (implemented as a bit shift of the queue size when -writing a packet). We perform a logical OR with the previous value of -the packet. If the width of the ECN field would be a single bit, this -operation would be identical to DCTCP. - -At the _receiver_ side, the context maintains two state variables. - -* The floating sum (ECE) of the value of the (8-bit) ECN field over the -last 2N packets is maintained (currently N=5, so 32 -packets). This is a value between 0 and 28 + 5 - 1. - -* The number of packets received during a period of congestion. This - is just for internal use. - -If th ECE value is 0, no actions are performed at the receiver. - -If this ECE value becomes higher than 0 (there is some indication of -start of congestion), an immediate FLOW_UPDATE is sent with this -value. If a packet arrives with ECN = 0, the ECE value is _halved_. - -For every _increase_ in the ECE value, an immediate update is sent. - -If the ECE value remains stable or decreases, an update is sent only -every M packets (currently, M = 8). This is what the counter is for. - -If the ECE value returns to 0 after a period of congestion, an -immediate FLOW_UPDATE with the value 0 is sent. - -At the _sender_ side, the context keeps track of the actual congestion -window. The sender keeps track of: - -* The current sender ECE value, which is updated when receiving a - FLOW_UPDATE. - -* A bool indicating Slow Start, which is set to false when a - FLOW_UPDATE arrives. - -* A sender_side packet counter. If this exceeds the value of N, the - ECE is reset to 0. This protects the sender from lost FLOW_UPDATES - that signal the end of congestion. - -* The window size multiplier W. For all flows, the window starts at a - predetermined size, 2W ns. Currently W = 24, starting at - about 16.8ms. The power of 2 allows us to perform operations on the - window boundaries using bit shift arithmetic. - -* The current window start time (a single integer), based on the - multiplier. - -* The number of packets sent in the current window. If this is below a - PKT_MIN threshold before the start of a window period, the new - window size is doubled. If this is above a PKT_MAX threshold before - the start of a new window period, the new window size is halved. The - thresholds are currently set to 8 and 64, scaling the window width - to average sending ~36 packets in a window. When the window scales, - the value for the allowed bytes to send in this window (see below) - scales accordingly to keep the sender bandwidth at the same - level. These values should be set with the value of N at the - receiver side in mind. - -* The number bytes sent in this window. This is updated when sending - each packet. - -* The number of allowed bytes in this window. This is calculated at - the start of a new window: doubled at Slow Start, multiplied by a - factor based on sender ECE when there is congestion, and increased - by a fixed (scaled) value when there is no congestion outside of - Slow Start. Currently, the scaled value is 64KiB per 16.8ms. - -There is one caveat: what if no FLOW_UPDATE packets arrive at all? -DCTCP (being TCP) will timeout at the Retransmission TimeOut (RTO) -value (since its ECE information comes from ACK packets), but this -algorithm has no such mechanism at this point. The answer is that we -currently do not monitor flow liveness from the flow allocator, but a -Keepalive or Bidirectional Forwarding Detection (BFD)-like mechanism -for flows should be added for QoS maintenance, and can serve to -timeout the flow and reset it (meaning a full reset of the -context). - -# MB-ECN in action - -From version 0.18 onwards[^3], the state of the flow -- including its -congestion context -- can be monitored from the flow allocator -statics: - -```bash -$ cat /tmp/ouroboros/unicast.1/flow-allocator/66 -Flow established at: 2020-12-12 09:54:27 -Remote address: 99388 -Local endpoint ID: 2111124142794211394 -Remote endpoint ID: 4329936627666255938 -Sent (packets): 1605719 -Sent (bytes): 1605719000 -Send failed (packets): 0 -Send failed (bytes): 0 -Received (packets): 0 -Received (bytes): 0 -Receive failed (packets): 0 -Receive failed (bytes): 0 -Congestion avoidance algorithm: Multi-bit ECN -Upstream congestion level: 0 -Upstream packet counter: 0 -Downstream congestion level: 48 -Downstream packet counter: 0 -Congestion window size (ns): 65536 -Packets in this window: 7 -Bytes in this window: 7000 -Max bytes in this window: 51349 -Current congestion regime: Multiplicative dec -``` - -I ran a quick test using the ocbr tool (modified to show stats every -100ms) on a jFed testbed using 3 Linux servers (2 clients and a -server) in star configuration with a 'router' (a 4th Linux server) in -the center. The clients are connected to the 'router' over Gigabit -Ethernet, the link between the 'router' and server is capped to 100Mb -using ethtool[^4]. - -Output from the ocbr tool: - -``` -Flow 64: 998 packets ( 998000 bytes)in 101 ms => 9880.8946 pps, 79.0472 Mbps -Flow 64: 1001 packets ( 1001000 bytes)in 101 ms => 9904.6149 pps, 79.2369 Mbps -Flow 64: 999 packets ( 999000 bytes)in 101 ms => 9882.8697 pps, 79.0630 Mbps -Flow 64: 998 packets ( 998000 bytes)in 101 ms => 9880.0143 pps, 79.0401 Mbps -Flow 64: 999 packets ( 999000 bytes)in 101 ms => 9887.6627 pps, 79.1013 Mbps -Flow 64: 999 packets ( 999000 bytes)in 101 ms => 9891.0891 pps, 79.1287 Mbps -New flow. -Flow 64: 868 packets ( 868000 bytes)in 102 ms => 8490.6583 pps, 67.9253 Mbps -Flow 65: 542 packets ( 542000 bytes)in 101 ms => 5356.5781 pps, 42.8526 Mbps -Flow 64: 540 packets ( 540000 bytes)in 101 ms => 5341.5105 pps, 42.7321 Mbps -Flow 65: 534 packets ( 534000 bytes)in 101 ms => 5285.6111 pps, 42.2849 Mbps -Flow 64: 575 packets ( 575000 bytes)in 101 ms => 5691.4915 pps, 45.5319 Mbps -Flow 65: 535 packets ( 535000 bytes)in 101 ms => 5291.0053 pps, 42.3280 Mbps -Flow 64: 561 packets ( 561000 bytes)in 101 ms => 5554.3455 pps, 44.4348 Mbps -Flow 65: 533 packets ( 533000 bytes)in 101 ms => 5272.0079 pps, 42.1761 Mbps -Flow 64: 569 packets ( 569000 bytes)in 101 ms => 5631.3216 pps, 45.0506 Mbps -``` - -With only one client running, the flow is congestion controlled to -about ~80Mb/s (indicating the queue limit at 16 packets may be a bit -too low a bar). When the second client starts sending, both flows go -quite quickly (at most 100ms) to a fair state of about 42 Mb/s. - -The IO graph from wireshark shows a reasonably stable profile (i.e. no -big oscillations because of AIMD), when switching the flows on the -clients on and off which is on par with DCTCP and not unexpected -keeping in mind the similarities between the algorithms: - -{{
}} - -The periodic "gaps" were not seen at the ocbr endpoint applicationand -may have been due to tcpdump not capturing everything that those -points, or possibly a bug somewhere. - -As said, a lot more work is needed analyzing this algorithm in terms -of performance and stability[^5]. But I am feeling some excitement about its -simplicity and -- dare I say it? -- elegance. - -Stay curious! - -Dimitri - -[^1]: Additive Increase increases the window size with 1 MSS each - RTT. Slow Start doubles the window size each RTT. - -[^2]: I'm pretty sure the kernel developers would if they could. -[^3]: Or the current "be" branch for the less patient. -[^4]: Using Linux traffic control (```tc```) to limit traffic adds - kernel queues and may interfere with MB-ECN. -[^5]: And the prototype implementation as a whole! diff --git a/content/en/blog/news/20201212-congestion.png b/content/en/blog/news/20201212-congestion.png deleted file mode 100644 index 8e5b89f..0000000 Binary files a/content/en/blog/news/20201212-congestion.png and /dev/null differ diff --git a/content/en/blog/news/20201219-congestion-avoidance.md b/content/en/blog/news/20201219-congestion-avoidance.md deleted file mode 100644 index 7391091..0000000 --- a/content/en/blog/news/20201219-congestion-avoidance.md +++ /dev/null @@ -1,313 +0,0 @@ ---- -date: 2020-12-19 -title: "Exploring Ouroboros with wireshark" -linkTitle: "Exploring Ouroboros with wireshark " -description: "" -author: Dimitri Staessens ---- - -I recently did some -[quick tests](/blog/2020/12/12/congestion-avoidance-in-ouroboros/#mb-ecn-in-action) -with the new congestion avoidance implementation, and thought to -myself that it was a shame that Wireshark could not identify the -Ouroboros flows, as that could give me some nicer graphs. - -Just to be clear, I think generic network tools like tcpdump and -wireshark -- however informative and nice-to-use they are -- are a -symptom of a lack of network security. The whole point of Ouroboros is -that it is _intentionally_ designed to make it hard to analyze network -traffic. Ouroboros is not a _network stack_[^1]: one can't simply dump -a packet from the wire and derive the packet contents all the way up -to the application by following identifiers for protocols and -well-known ports. Using encryption to hide the network structure from -the packet is shutting the door after the horse has bolted. - -To write an Ouroboros dissector, one needs to know the layered -structure of the network at the capturing point at that specific point -in time. It requires information from the Ouroboros runtime on the -capturing machine and at the exact time of the capture, to correctly -analyze traffic flows. I just wrote a dissector that works for my -specific setup[^2]. - -## Congestion avoidance test - -First, a quick refresh on the experiment layout, it's the the same -4-node experiment as in the -[previous post](/blog/2020/12/12/congestion-avoidance-in-ouroboros/#mb-ecn-in-action) - -{{
}} - -I tried to draw the setup as best as I can in the figure above. - -There are 4 rack mounted 1U servers, connected over Gigabit Ethernet -(GbE). Physically there is a big switch connecting all of them, but -each "link" is separated as a port-based VLAN, so there are 3 -independent Ethernet segments. We create 3 ethernet _layers_, drawn -in a lighter gray, with a single unicast layer -- consisting of 4 -unicast IPC processes (IPCPs) -- on top, drawn in a darker shade of -gray. The link between the router and server has been capped to 100 -megabit/s using ```ethtool```[^3], and traffic is captured on the -Ethernet NIC at the "Server" node using ```tcpdump```. All traffic is -generated with our _constant bit rate_ ```ocbr``` tool trying to send -about 80 Mbit/s of application-level throughput over the unicast -layer. - -{{
}} - -The graph above shows the bandwidth -- as captured on the congested -100Mbit Ethernet link --, separated for each traffic flow, from the -same pcap capture as in my previous post. A flow can be identified by -a (destination address, endpoint ID)-pair, and since the destination -is all the same, I could filter out the flows by simply selecting them -based on the (64-bit) endpoint identifier. - -What you're looking at is that first, a flow (green starts), at around -T=14s, a new flow enters (red) that stops at around T=24s. At around -T=44s, another flow enters (blue) for about 14 seconds, and finally, a -fourth (orange) flow enters at T=63s. The first (green) flow exits at -around T=70s, leaving all the available bandwidth for the orange flow. - -The most important thing that I wanted to check is that when there are -multiple flows, _if_ and _how fast_ they would converge to the same -bandwidth. I'm not dissatisfied with the initial result: the answers -seem to be _yes_ and _pretty fast_, with no observable oscillation to -boot[^4] - -## Protocol overview - -Now, the wireshark dissector can be used to present some more details -about the Ouroboros protocols in a familiar setting -- make it more -accessible to some -- so, let's have a quick look. - -The Ouroboros network protocol has -[5 fields](/docs/concepts/protocols/#network-protocol): - -``` -| DST | TTL | QOS | ECN | EID | -``` - -which we had to map to the Ethernet II protocol for our ipcpd-eth-dix -implementation. The basic Ethernet II MAC (layer-2) header is pretty -simple. It has 2 6-byte addresses (dst, src) and a 2-byte Ethertype. - -Since Ethernet doesn't do QoS or congestion, the main missing field -here is the EID. We could have mapped it to the Ethertype, but we -noticed that a lot of routers and switches drop unknown Ethertypes -(and, for the purposes of this blog post here: it would have all but -prevented to write the dissector). So we made the ethertype -configurable per layer (so it can be set to a value that is not -blocked by the network), and added 2 16-bit fields after the Ethernet -MAC header for an Ouroboros layer: - -* Endpoint ID **eid**, which works just like in the unicast layer, to - identify the N+1 application (in our case: a data transfer flow and - a management flow for a unicast IPC process). - -* A length field **len**, which is needed because Ethernet NICs pad - frames that are smaller than 64 bytes in length with trailing zeros - (and we receive these zeros in our code). A length field is present - in Ethernet type I, but since most "Layer 3" protocols also had a - length field, it was re-purposed as Ethertype in Ethernet II. The - value of the **len** field is the length of the **data** payload. - -The Ethernet layer that spans that 100Mbit link has Ethertype 0xA000 -set (which is the Ouroboros default), the Ouroboros plugin hooks into -that ethertype. - -On top of the Ethernet layer, we have a unicast, layer with the 5 -fields specified above. The dissector also shows the contents of the -flow allocation messages, which are (currently) sent to EID = 0. - -So, the protocol header as analysed in the experiment is, starting -from the "wire": - -``` -+---------+---------+-----------+-----+-----+------ -| dst MAC | src MAC | Ethertype | eid | len | data /* ETH LAYER */ -+---------+---------+-----------+-----+-----+------ - - /* eid == 0 -> ipcpd-eth flow allocator, */ - /* this is not analysed */ - -+-----+-----+-----+-----+-----+------ -| DST | QOS | TTL | ECN | EID | DATA /* UNICAST LAYER */ -+-----+-----+-----+-----+-----+------ - - /* EID == 0 -> flow allocator */ - -+-----+-------+-------+------+------+-----+-------------+ -| SRC | R_EID | S_EID | CODE | RESP | ECE | ... QOS ....| /* FA */ -+-----+-------+-------+------+------+-----+-------------+ -``` - -## The network protocol - -{{
}} - -We will first have a look at packets captured around the point in time -where the second (red) flow enters the network, about 14 seconds into -the capture. The "N+1 Data" packets in the image above all belong to -the green flow. The ```ocbr``` tool that we use sends 1000-byte data -units that are zeroed-out. The packet captured on the wire is 1033 -bytes in length, so we have a protocol overhead of 33 bytes[^5]. We -can break this down to: - -``` - ETHERNET II HEADER / 14 / - 6 bytes Ethernet II dst - 6 bytes Ethernet II src - 2 bytes Ethernet II Ethertype - OUROBOROS ETH-DIX HEADER / 4 / - 2 bytes eid - 2 byte len - OUROBOROS UNICAST NETWORK HEADER / 15 / - 4 bytes DST - 1 byte QOS - 1 byte TTL - 1 byte ECN - 8 bytes EID - --- TOTAL / 33 / - 33 bytes -``` - -The **Data (1019 bytes)** reported by wireshark is what Ethernet II -sees as data, and thus includes the 19 bytes for the two Ouroboros -headers. Note that DST length is configurable, currently up to 64 -bits. - -Now, let's have a brief look at the values for these fields. The -**eid** is 65, this means that the _data-transfer flow_ established -between the unicast IPCPs on the router and the server (_uni-r_ and -_uni-s_ in our experiment figure) is identified by endpoint id 65 in -the eth-dix IPCP on the Server machine. The **len** is 1015. Again, no -surprises, this is the length of the Ouroboros unicast network header -(15 bytes) + the 1000 bytes payload. - -**DST**, the destination address is 4135366193, a 32-bit address -that was randomly assigned to the _uni-s_ IPCP. The QoS cube is 0, -which is the default best-effort QoS class. *TTL* is 59. The starting -TTL is configurable for a layer, the default is 60, and it was -decremented by 1 in the _uni-r_ process on the router node. The packet -experienced no congestion (**ECN** is 0), and the endpoint ID is a -64-bit random number, 475...56. This endpoint ID identifies the flow -endpoint for the ```ocbr``` server. - -## The flow request - -{{
}} - -The first "red" packet that was captured is the one for the flow -allocation request, **FLOW REQUEST**[^6]. As mentioned before, the -endpoint ID for the flow allocator is 0. - -A rather important remark is in place here: Ouroboros does not allow a -UDP-like _datagram service_ from a layer. With which I mean: fabricate -a packet with the correct destination address and some known EID and -dump it in the network. All traffic that is offered to an Ouroboros -layer requires a _flow_ to be allocated. This keeps the network layer -in control its resources; the protocol details inside a layer are a -secret to that layer. - -Now, what about that well-known EID=0 for the flow allocator (FA)? And -the directory (Distributed Hash Table, DHT) for that matter, which is -currently on EID=1? Doesn't that contradict the "no datagram service" -statement above? Well, no. These components are part of the layer and -are thus inside the layer. The DHT and FA are internal -components. They are direct clients of the Data Transfer component. -The globally known EID for these components is an absolute necessity -since they need to be able to reach endpoints more than a hop -(i.e. a flow in a lower layer) away. - -Let's now look inside that **FLOW REQUEST** message. We know it is a -request from the **msg code** field[^7]. - -This is the **only** packet that contains the source (and destination) -address for this flow. There is a small twist, this value is decoded -with different _endianness_ than the address in the DT protocol output -(probably a bug in my dissector). The source address 232373199 in the -FA message corresponds to the address 3485194509 in the DT protocol -(and in the experiment image at the top): the source of our red flow -is the "Client 2" node. Since this is a **FLOW REQUEST**, the remote -endpoint id is not yet known, and set to 0[^8. The source endpoint ID --- a 64-bit randomly generated value unique to the source IPC -process[^9] -- is sent to the remote. The other fields are not -relevant for this message. - -## The flow reply - -{{
}} - -Now, the **FLOW REPLY** message for our request. It originates our -machine, so you will notice that the TTL is the starting value of 60. -The destination address is what we sent in our original **FLOW -REQUEST** -- add some endianness shenanigans. The **FLOW REPLY** -mesage response sends the newly generated source endpoint[^10] ID, and -this packet is the **only** packet that contains both endpoint IDs -for this flow. - -## Congestion / flow update - -{{
}} - -Now a quick look at the congestion avoidance mechanisms. The -information for the Additive Increase / Multiple Decrease algorithm is -gathered from the **ECN** field in the packets. When both flows are -active, they experience congestion since the requested bandwidth from -the two ```ocbr``` clients (180Mbit) exceeds the 100Mbit link, and the -figure above shows a packet marked with an ECN value of 11. - -{{
}} - -When the packets on a flow experience congestion, the flow allocator -at the endpoint (the one our _uni-s_ IPCP) will update the sender with -an **ECE** _Explicit Congestion Experienced_ value; in this case, 297. -The higher this value, the quicker the sender will decrease its -sending rate. The algorithm is explained a bit in my previous -post. - -That's it for today's post, I hope it provides some new insights how -Ouroboros works. As always, stay curious. - -Dimitri - -[^1]: Neither is RINA, for that matter. - -[^2]: This quick-and-dirty dissector is available in the - ouroboros-eth-uni branch on my - [github](https://github.com/dstaesse/wireshark/) - -[^3]: The prototype is able to handle Gigabit Ethernet, this is mostly - to make the size of the capture files somewhat manageable. - -[^4]: Of course, this needs more thorough evaluation with more - clients, distributions on the latency, different configurations - for the FRCP protocol in the N+1 and all that jazz. I have, - however, limited amounts of time to spare and am currently - focusing on building and documenting the prototype and tools so - that more thorough evaluations can be done if someone feels like - doing them. - -[^5]: A 4-byte Ethernet Frame Check Sequence (FCS) is not included in - the 'bytes on the wire'. As a reference, the minimum overhead - for this kind of setup using UDP/IPv4 is 14 bytes Ethernet + 20 - bytes IPv4 + 8 bytes UDP = 42 bytes. - -[^6]: Actually, in a larger network there could be some DHT traffic - related to resolving the address, but in such a small network, - the DHT is basically a replicated database between all 4 nodes. - -[^7]: The reason it's not the first field in the protocol has to to - with performance of memory alignment in x86 architectures. - -[^8]: We haven't optimised the FA protocol not to send fields it - doesn't need for that particular message type -- yet. - -[^9]: Not the host machine, but that particular IPCP on the host - machine. You can have multiple IPCPs for the same layer on the - same machine, but in this case, expect correlation between their - addresses. 64-bits / IPCP should provide some security against - remotes trying to hack into another service on the same host by - guessing EIDs. - -[^10]: This marks the point in space-time where I notice the - misspelling in the dissector. \ No newline at end of file diff --git a/content/en/blog/news/20201219-congestion.png b/content/en/blog/news/20201219-congestion.png deleted file mode 100644 index 5675438..0000000 Binary files a/content/en/blog/news/20201219-congestion.png and /dev/null differ diff --git a/content/en/blog/news/20201219-exp.svg b/content/en/blog/news/20201219-exp.svg deleted file mode 100644 index 68e09e2..0000000 --- a/content/en/blog/news/20201219-exp.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/content/en/blog/news/20201219-ws-0.png b/content/en/blog/news/20201219-ws-0.png deleted file mode 100644 index fd7a83a..0000000 Binary files a/content/en/blog/news/20201219-ws-0.png and /dev/null differ diff --git a/content/en/blog/news/20201219-ws-1.png b/content/en/blog/news/20201219-ws-1.png deleted file mode 100644 index 0f07fd0..0000000 Binary files a/content/en/blog/news/20201219-ws-1.png and /dev/null differ diff --git a/content/en/blog/news/20201219-ws-2.png b/content/en/blog/news/20201219-ws-2.png deleted file mode 100644 index 7cd8b7d..0000000 Binary files a/content/en/blog/news/20201219-ws-2.png and /dev/null differ diff --git a/content/en/blog/news/20201219-ws-3.png b/content/en/blog/news/20201219-ws-3.png deleted file mode 100644 index 2a6f6d5..0000000 Binary files a/content/en/blog/news/20201219-ws-3.png and /dev/null differ diff --git a/content/en/blog/news/20201219-ws-4.png b/content/en/blog/news/20201219-ws-4.png deleted file mode 100644 index 3a0ef8c..0000000 Binary files a/content/en/blog/news/20201219-ws-4.png and /dev/null differ diff --git a/content/en/blog/news/_index.md b/content/en/blog/news/_index.md deleted file mode 100644 index c10cfa2..0000000 --- a/content/en/blog/news/_index.md +++ /dev/null @@ -1,5 +0,0 @@ ---- -title: "News About Docsy" -linkTitle: "News" -weight: 20 ---- diff --git a/content/en/blog/releases/_index.md b/content/en/blog/releases/_index.md deleted file mode 100644 index b1d9eb4..0000000 --- a/content/en/blog/releases/_index.md +++ /dev/null @@ -1,8 +0,0 @@ - ---- -title: "New Releases" -linkTitle: "Releases" -weight: 20 ---- - - diff --git a/content/en/blog/releases/upcoming.md b/content/en/blog/releases/upcoming.md deleted file mode 100644 index f984380..0000000 --- a/content/en/blog/releases/upcoming.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -date: 2019-10-06 -title: "Plans for 0.16" -linkTitle: "Ouroboros 0.16" -description: "Ouroboros 0.16" -author: Dimitri Staessens ---- diff --git a/content/en/docs/Releases/0_18.md b/content/en/docs/Releases/0_18.md new file mode 100644 index 0000000..c489d33 --- /dev/null +++ b/content/en/docs/Releases/0_18.md @@ -0,0 +1,109 @@ +--- +date: 2021-02-12 +title: "Ouroboros 0.18" +linkTitle: "Ouroboros 0.18" +description: "Major additions and changes in 0.18.0" +author: Dimitri Staessens +--- + +With version 0.18 come a number of interesting updates to the prototype. + +### Automated Repeat-Request (ARQ) and flow control + +We finished the implementation of the base retransmission +logic. Ouroboros will now send, receive and handle acknowledgments +under packet loss conditions. It will also send and handle window +updates for flow control. The operation of flow control is very +similar to the operation of window-based flow control in TCP, the main +difference being that our sequence numbers are per-packet instead of +per-byte. + +The previous version of FRCP had some partial implementation of the +ARQ functionality, such as piggybacking ACK information on _writes_ +and handling sequence numbers on _reads_. But now, Ourobroos will also +send (delayed) ACK packets without data if the application is not +sending and finish sending when a flow is closed if not everything was +acknowledged (can be turned off with the FRCTFLINGER flag). + +Recall that Ouroboros has this logic implemented in the application +library, it's not a separate component (or kernel) that is managing +transmit and receive buffers and retransmission. Furthermore, our +implementation doesn't add a thread to the application. If a +single-threaded application uses ARQ, it will remain single-threaded. + +It's not unlikely that in the future we will add the option for the +library to start a dedicated thread to manage ARQ as this may have +some beneficial characteristics for read/write call durations. Other +future addditions may include fast-retransmit and selective ACK +support. + +The most important characteristic of Ouroboros FRCP compared to TCP +and derivative protocols (QUIC, SCTP, ...) is that it is 100% +independent of congestion control, which allows for it to operate at +real RTT timescales (i.e. microseconds in datacenters) without fear of +RTT underestimates severely capping throughput. Another characteristic +is that the RTT estimate is really measuring the responsiveness of the +application, not the kernel on the machine. + +A detailed description of the operation of ARQ can be found +in the [protocols](/docs/concepts/protocols/#operation-of-frcp) +section. + +### Congestion Avoidance + +The next big addition is congestion avoidance. By default, the unicast +layer's default configuration will now congestion-control all client +traffic sent over them[^1]. As noted above, congestion avoidance in +Ouroboros is completely independent of the operation of ARQ and flow +control. For more information about how this all works, have a look at +the developer blog +[here](/blog/2020/12/12/congestion-avoidance-in-ouroboros/) and +[here](/blog/2020/12/19/exploring-ouroboros-with-wireshark/). + +### Revision of the flow allocator + +We also made a change to the flow allocator, more specifically the +Endpoint IDs to use 64-bit identifiers. The reason for this change is +to make it harder to guess these endpoint identifiers. In TCP, +applications can listen to sockets that are bound to a port on a (set +of) IP addresses. You can't imagine how many hosts are trying to brute +force password guess SSH logins on TCP port 22. To make this at least +a bit harder, Ouroboros has no well-known application ports, and after +this patch they are roughtly equivalent to a 32-bit random +number. Note that in an ideal Ouroboros deployment, sensitive +applications such as SSH login should run on a different layer/network +than publicly available applications. + +### Revision of the ipcpd-udp + +The ipcpd-udp has gone through some revisions during its lifetime. In +the beginning, we wanted to emulate the operation of an Ouroboros +layers, having the flow allocator listening on a certain UDP port, and +mapping endpoints identifiers to random ephemeral UDP ports. So as an +example, the source would generate a UDP socket, e.g. on port 30927, +and send a request for a new flow the fixed known Ouroboros UDP port +(3531) at the receiver. This also generates a socket on an ephemeral +UDP port, say 23705, and it sends a response back to the source on UDP +port 3531. Traffic for the "client" flow would be on UDP port pair +(30927, 23705). This was giving a bunch of headaches with computers +behind NAT firewalls, rendering that scheme only useful in lab +environments. To make it more useable, the next revision used a single +fixed incoming UDP port for the flow allocator protocol, using an +ephemeral UDP port from the sender side per flow and added the flow +allocator endpoints as a "next header" inside UDP. So traffic would +always be sent to destination UDP port 3531. Benefit was that only a +single port was needed in the NAT forwarding rules, and that anyone +running Ouroboros would be able to receive allocation messages, and +this is enforcing a bit all users to participate in a mesh topology. +However, opening a certain UDP port is still a hassle, so in this +(most likely final) revision, we just run the flow allocator in the +ipcpd-udp as a UDP server on a (configurable) port. No more NAT +firewall configurations required if you want to connect (but if you +want to accept connections, opening UDP port 3531 is still required). + +The full changelog can be browsed in +[cgit](/cgit/ouroboros/log/?showmsg=1). + +[^1]: This is not a claim that every packet inside a layer is + flow-controlled: internal management traffic to the layer (flow + allocator protocol, etc) is not congestion-controlled. \ No newline at end of file diff --git a/content/en/docs/Releases/_index.md b/content/en/docs/Releases/_index.md new file mode 100644 index 0000000..8328c33 --- /dev/null +++ b/content/en/docs/Releases/_index.md @@ -0,0 +1,6 @@ + +--- +title: "Releases" +linkTitle: "Release notes" +weight: 120 +--- -- cgit v1.2.3