aboutsummaryrefslogtreecommitdiff
path: root/content/en/blog/20211229-flow-vs-connection.md
blob: 3806dd21c7b24b72749b0dcf43e01de6e00978f9 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
---
date: 2021-12-29
title: "Behaviour of Ouroboros flows vs UDP sockets and TCP connections/sockets"
linkTitle: "Flows vs connections/sockets"
author: Dimitri Staessens
---

A couple of days ago, I received a very good question from someone who
was playing around around with Ouroboros/O7s. He started from the
[_oecho_](https://ouroboros.rocks/cgit/ouroboros/tree/src/tools/oecho/oecho.c#n94) tool.

_oecho_ is a very simple application. It establishes what we call a
"raw" flow. Raw flows have no fancy features, they are the best-effort
class of packet transport (a bit like UDP). Raw flows do not have an
Flow-and-retransmission control protocol (FRCP) machine. This person
changed oecho to use a _reliable_ flow, and slightly modified it, ran
into some unexpected behaviour,and then asked: **is it possible to
detect a half-closed connection?** Yes, it is, but it's not
implemented (yet). But I think it's worth answering this in a fair bit
of detail, as it highlights some differences between O7s flows and
(TCP) connections.

A bit of knowledge on the core protocols in Ouroboros is needed, and
can be found [here](/docs/concepts/protocols/) and the flow allocator
[here](/docs/concepts/fa/). If you haven't read these in while, it
will be useful to first read them to make the most out of this post.

## The oecho application

The oecho server is waiting for a client to request a flow, reads the
message from the client, sends it back, and deallocates the flow.

The client will allocate a _raw_ flow, the QoS parameter for the flow
is _NULL_. Then it will write a message, read the response and also
deallocate the flow.

In a schematic, the communication for this simple application looks
like this[^1]:

{{<figure width="90%" src="/blog/20211229-oecho-1.png">}}

All the API calls used are inherently _blocking_ calls. They wait for
some event to happen and do not always return immediately.

First, the client will allocate a flow to the server. The server's
_flow\_accept()_ call will return when it receives the request, the
client's _flow\_alloc()_ call will return when the response message is
received from the server. This exchange agrees on the Endpoint IDs and
possibly the flow characteristics (QoS) that the application will
use. For a raw flow, this will only set the Endpoint IDs that will be
used in the DT protocol[^2]. On the server side, the _flow\_accept()_
returns, and the server calls _flow\_read()_. While the _flow\_read()_
is still waiting on the server side, the flow allocation response is
underway to the client. The reception of the allocation response
causes the _flow\_alloc()_ call on the client side to return and the
(raw) flow is established[^3].

Now the client writes a packet, the server reads it and sends it
back. Immediately after sending that packet, the server _deallocates_
the flow. The response, containing a copy of the client message, is
still on its way to the client. After the client receives it, it also
deallocates the flow. Flow deallocation destroys the state associated
with the flow and will release the EIDs for reuse. In this case of
raw, unreliable flows, _flow\_dealloc()_ will return almost
immediately.

## Flows vs connections

The most important thing to notice from the diagram for _oecho_, is
that flow deallocation _does not send any messages_! It only cleans up
_local_ state. Suppose that the server would send a message to destroy
the flow immediately after it sends the response. What if that message
to destroy the flow arrives _before_ the response?  When do we destroy
the state associated with the flow? Flows are not connections. Raw
flows like the one used in oecho behave like UDP. No guarantees. Now,
let's have a look at _reliable_ flows, which behave more like TCP.

## A modification to oecho with reliable flows

{{<figure width="90%" src="/blog/20211229-oecho-2.png">}}

To use a reliable flow, we call a _flow\_alloc()_ from the client with
a different QoS spec (qos_data). The flow allocation works exactly as
before. The flow allocation request now contains a data QoS request
instead of a raw QoS request. Upon reception of this request, the
server will create a protocol machine for FRCP, the protocol in O7s
that is in charge of delivering packets reliably, in-order and without
duplicates. FRCP also performs flow control to avoid sending more
packets than the server can process. When the flow allocation arrives
at the client, it will also create an FRCP protocol instance. When
these FRCP instances are created, they are in an initial state where
the Delta-t timers are _timed out_. This is the state that allows
starting a new _run_. I will not explain every detail of FRCP here,
these are explained in the
[protocols](/docs/concepts/protocols/#flow-and-retransmission-control-protocol-frcp)
section.

Now, the client sends its first packet, with a randomly chosen
sequence number (100) and the Data Run Flag (DRF) enabled. The meaning
of the DRF is that there were no _previously unacknowledged_ packets
in the currently tracked packet sequence, and it allows to avoid a
3-way handshake.

When that packet with sequence number 100 arrives in the FRCP protocol
machine at the server, it will detect that DRF is set to 1, and that
it is in an initial state where all timers are timed out. It will
start accepting packets for this new run starting with sequence number
100. The server almost immediately sends a response packet back. It
has no active sending run, so a random sequence number is chosen (300)
and the DRF is set to 1. This packet will contain an acknowledgment
for the received packet. FRCP acknowledgements contain the lowest
acceptable packet number (so 101). After sending the packet, the
server calls _dealloc()_, which will block on FRCP still having
unacknowledged packets.

Now the client gets the return packet, it has no active incoming run,
the receiver connection is set to initial timed out state, and like
the server, it will see that the DRF is set to 1, and accept this new
incoming run starting from sequence number 300. The client has no data
packets anymore, so the deallocation will send a _bare_
acknowledgement for 301 and exit. At the server side, the
_flow\_dealloc()_ call will exit after it receives the
acknowledgement. Not drawn in the figure, is that the flow identifiers
(EIDs) will only time out internally after a full Delta-t timeout. TCP
does something similar and will not reused closed connection state for
2 * Maximum Segment Lifetime (MSL).

## Unexpected behaviour

{{<figure width="90%" src="/blog/20211229-oecho-3.png">}}

While playing around with the prototype, a modification was made to
oecho as above: another _flow_read()_ was added to the client. As you
can see from the diagram, there will never be a packet sent, and, if
no timeout is set on the read() operation, after the server has
deallocated the flow (and re-entered the loop to accept a new flow),
the client will remain in limbo, forever stuck on the
_flow\_read()_. And so, I got the following question:

```
I would have expected the second call to abort with an error
code. However, the client gets stuck while the server is waiting for a
new request. Is this expected? If so, is it possible to detect a
half-closed connection?
```

## A _"half-closed connection"_

So, first things first: the observation is correct, and that second
call should (and soon will) exit on an error, as the flow is not valid
anymore. Now it will only exit if there was an error in the FRCP
connection (packet retransmission fails to receive an acknowledgment
within a certain timeout). It should also exit on a remotely
deallocated flow. But how will Ouroboros detect it?

Now, a "half closed connection" comes from TCP. TCP afficionados will
probably think that I need to add something to FRCP, like
[FIN](https://www.googlecloudcommunity.com/gc/Cloud-Product-Articles/TCP-states-explained/ta-p/78462)
at the end of TCP to signal the end of a flow[^4]:

```
TCP A                                                TCP B

  1.  ESTABLISHED                                          ESTABLISHED

  2.  (Close)
      FIN-WAIT-1  --> <SEQ=100><ACK=300><CTL=FIN,ACK>  --> CLOSE-WAIT

  3.  FIN-WAIT-2  <-- <SEQ=300><ACK=101><CTL=ACK>      <-- CLOSE-WAIT

  4.                                                       (Close)
      TIME-WAIT   <-- <SEQ=300><ACK=101><CTL=FIN,ACK>  <-- LAST-ACK

  5.  TIME-WAIT   --> <SEQ=101><ACK=301><CTL=ACK>      --< CLOSED

  6.  (2 MSL) CLOSED
```

While FRCP performs functions that are present in TCP, not everything
is so readily transferable. Purely from a design perspective, it's
just not FRCP's job to keep a flow alive or detect if the flow is
alive. It's job is to deliver packets reliably, and all it needs to do
that job is present. But would adding FINs work?

Well, the server can crash just before the dealloc() call, leaving it
in the current situation (the client won't receive FINs). To resolve
it, it would also need a keepalive mechanism. Yes, TCP also has a
keepalive mechanism. And would adding that solve it? Not to my
satisfaction. Because, Ouroboros flows are not connections, they don't
always have an end-to-end protocol (FRCP) running[^5]. So if we add
FIN and keepalive to FRCP, we would still need to add something
_similar_ for flows that don't have FRCP. We would need to duplicate
the keepalive functionality somewhere else. The main objective of O7s
is to avoid functional duplication. So, can we kill all the birds with
one stone? Detect flows that are down? Sure we can!

## Flow liveness monitoring

But we need to take a birds eye view of the flow first.

On the server side, the allocated flow has a flow endpoint with
internal Flow ID (FID 16), to which the oecho server writes using its
flow descriptor, fd=71. On the client side, the client reads/writes
from its fd=68, which behind the scenes is linking to the flow
endpoint with ID 9.  On the network side, the flow allocator in the
IPCPs also reads and writes from these endpoints to transfer packets
along the network. So, the flow endpoint marks the boundary between
the "network".

{{<figure width="80%" src="/blog/20211229-oecho-4.png">}}

This is drawn in the figure above. I'll repeat it because it is
important: the datastructure associated with a flow at the endpoints
is this "flow endpoint". It forms the bridge between the application
and the network layer. The role of the IRMd is to manage these
endpoints and the associated datastructures.

Flow deallocation is a two step process: both the IPCP and the
application have a _dealloc()_ call. The endpoint is only destroyed if
_both_ the application process and the IPCP signal they are done with
it. So a _flow\_dealloc()_ from the application will kill only its use
with the endpoint. This allows the IRMd to keep it alive until it
sends an OK to the IPCP to also deallocate the flow and signal it is
done with it. Usually, if all goes well, the application will
deallocate the flow first.

The IRMd also monitors all O7s processes. If it detects an application
crashing, or an IPCP crashing, it will automatically perform that
applications' half of the flow deallocation, but not the complete
deallocation. If an IPCP crashes, applications still hold the FRCP
state and can recover the connection over a different flow[^6].

**Edit: the below section is not correct, but it's interesting to read
anyway**[^7]. There is a new post, documenting the
[actual implementation](/blog/2022/02/28/application-level-flow-liveness-monitoring/).

So, now it should be clear that the liveness of a flow has to be
detected in the flow allocator of the IPCPs, not in the application
(again, reminder: FRCP state is maintained inside the application).
The IPCP will detect that its flow has been deallocated locally
(either intentionally or because of a crash).It's paramount to do it
here, because of the recursive nature of the network. Flows are
everywhere, also between "router machines"!  Routers usually restrict
themselves to raw flows. No retransmissions, no flow control, no fuss,
that's all too expensive to perform at high rates. But they need to be
able to detect links going down. In IP networks, the OSPF protocol may
use something like Bi-directional Forwarding Detection (BFD) to detect
failed adjacencies. And then applications may use TCP keepalive and
FIN. Or HTTP keepalive. All unneeded functional duplication, symptoms
of a messy architecture, at least in my book. In Ouroboros, this flow
liveness check is implemented once, in the flow allocator. It is the
only place in the Ouroboros system where liveness checks are
needed. It handles failed allocation, broken connections, terminated
or crashed applications. Clean. Shipshape. Nice and tidy. Spick and
span. We call it Flow Liveness Monitoring (FLM).

If I recall correctly, we implemented an FLM in the RINA/IRATI flow
allocator years ago when we were working on PRISTINE and were trying
to get loop-free alternate (LFA) routes working. This needed to detect
flows going down. In Ouroboros it is not implemented yet. Maybe I'll
add it in the near future. Time is in short supply, the items on my
todo list are not.

## Flows vs connections, a "layered" view

To wrap it up, I tried to represent how O7s functionality is organized
in a way similar to the OSI/TCP models. I omitted the "physical
layer", which is handled by dedicated IPCP implementations, such as
the ipcpd-local, ipcpd-eth, etc. It's not that important here. What is
important is that O7s splits functionality that is in TCP/IP in two
layers (L3/L4), into **3 independent layers**[^8] (and protocols). Let's
go through O7s from bottom to top.

{{<figure width="80%" src="/blog/20211229-oecho-5.png">}}

Network forwarding layer, which moves packets between (unicast) IPCP
data transfer components (the forwarding elements in the model).

The network end-to-end layer does flow monitoring (the FLM explained
in this post) and also congestion control/avoidance (preventing that
applications can send more traffic than the network can handle). The
lifetime of a flow starts at flow allocation, and ends when one of the
peers deallocates the flow, or crashes, (or an IPCP at the client or
server crashes).

The application end-to-end layer does flow control (avoiding that
client applications send more than the server application can handle)
and reliability (taken care of by FRCP). But also integrity (e.g. a
Cyclic Redundancy Check) to detect packet corruption and
authentication and encryption are handled here. Each of these
functions can be enabled/disabled indepenendently (and is derived from
the QoS specification from the _flow\_alloc()_ call). In essence the
lifetime of an FRCP connection is _infinite_ (see Watson's Delta-t
paper if this sounds weird), but FRCP is subdivided in "data runs". A
failure of a data run (i.e. an FRCP connection record times out with
unacknowledged packets) is the only thing that causes an FRCP
connection to terminate. It is up to the application how to deal with
this. An FRCP connection can last at most as long as an application
flow. It can potentially recover from IPCP crashes, but not from
application crashes.

Finally, the application session layer takes care of establishing,
maintaining, synchronizing and terminating application
sessions. Application sessions can be shorter, as long as, or longer
than the duration of an application flow.

Probably long enough for a blog post. Have yourselves a wonderful new
year, and above all, stay curious!

Dimitri

[^1]: We are omiting the role of the Ouroboros daemons (IPCPd's and
      IRMd) for now. There would be a name resolution step for "oecho"
      to an address in the IPCPds. Also, the IRMd at the server side
      brokers the flow allocation request to a valid oecho server. If
      the server is not running when the flow allocation request
      arrives at the IRMd, O7s can also start the oecho server
      application _in response_ to a flow allocation request. But
      going into those details are not needed for this discussion. We
      focus solely on the application perspective here.

[^2]: Flow allocation has no direct analogue in TCP or UDP, where the
      protocol to be used and the destination port are known in
      advance. In any case, flow allocation should not be confused
      with a TCP 3-way handshake.

[^3]: I will probably do another post on how flow allocation deals
      with lost messages, as it is also an interesting subject.

[^4]: Or even more bluntly tell me to "just use TCP instead of FRCP".

[^5]: A UDP server that has clients exit or crash is also left to its
      own devices to clean up the state associated with that UDP
      socket.

[^6]: This has not been implemented yet, and should make for a nice
      demo.

[^7]: After implementing the solution below, it became apparent to me
      that something was off. I needed to leak the FRCP timeout into
      the IPCP, which is a layer violation. I noted this fact in my
      commit message, but after more thought, I decided to retract my
      patch... it just couldn't be right. This layer violation didn't
      come up when we implemented FLM in the Flow allocator RINA,
      because RINA puts the whole retransmission logic (called DTCP)
      in the IPCP.

[^8]: The "recursive layer boundary" in the figure uses the word layer
      in the sense of a RINA DIF. We didn't adopt the terminology DIF,
      since it has special meaning in RINA, and O7s' recursive layers
      are not interchangeable or compatible with RINA DIFs.