aboutsummaryrefslogtreecommitdiff
path: root/content/en/blog/20220520-oping-flm.md
blob: f3c81c0b4cf3fe90ed6e94471441aec248970a5d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
date: 2022-05-20
title: "What is there to learn from oping about flow liveness monitoring?"
linkTitle: "cleaning up flows"
author: Thijs Paelman
---

### Cleaning up flows

While I was browsing through some oping code
(trying to get a feeling about how to do [broadcast](https://ouroboros.rocks/blog/2021/04/02/how-does-ouroboros-do-anycast-and-multicast/#broadcast)),
I stumbled about the [cleaner thread](https://ouroboros.rocks/cgit/ouroboros/tree/src/tools/oping/oping_server.c?id=bec8f9ac7d6ebefbce6bd4c882c0f9616f561f1c#n54).
As we can see, it was used to clean up 'stale' flows (sanitized):

```C
void * cleaner_thread(void * o)
{
        int deadline_ms = 10000;

        while (true) {
                for (/* all active flows i */) {

                        diff = /* diff in ms between last valid ping packet and now */;

                        if (diff > deadline_ms) {
                                printf("Flow %d timed out.\n", i);
                                flow_dealloc(i);
                        }
                }
                sleep(1);
        }
}
```

But we have since version 19.x flow liveness monitoring (FLM), which does this for us!
So all this code could be thrown away, right?

Turns out I was semi-wrong!
It's all about semantics, or 'what do you want to achieve'.

If this thread was there for cleaning up flows from which the peers stopped their flow (and stopped sending keep-alives),
then we could throw it away by all means! Because FLM does that job.

Or was it there to clean up valid flows, but from which the peers didn't send any ping packets anymore (they *do* send keep-alives, otherwise FLM kicks in)?
Then we should of course keep it, because this is a server-side decision to cut those peers off.
This might protect for example against client implementations which connect, send a few pings, but then leave the flow open.
Or a better illustration of the 'cleaner' thread might be to cut off peers after a 100 pings,
showing that this decision to 'clean up' has nothing to do with flow timeouts.

### Keeping timed-out flows

On the other side of the spectrum, we have those flows that are timing out (no keep-alives are coming in anymore).
This is my proposal for the server side parsing of messages:

```C
while(/* get next fd on which an event happened */) {
        msg_len = flow_read(fd, buf, OPING_BUF_SIZE);
        if (msg_len < 0) {
                /* if-statement is the only difference with before */
                if (msg_len == -EFLOWPEER) {
                    fset_del(server.flows, fd);
                    flow_dealloc(fd);
                }
                continue;
        }
        /* continue with parsing and responding */
}
```

We can see here that the decision is taken to 'clean up' (= `flow_dealloc`) those flows that are timing out.
But, as we can see, it's an application decision!
We might as well decide to keep it open for another 10 min to see if the client (or the network in between) recovers from interruptions, e.g..

### Asymmetrical QoS
There will probably follow some more [discussion](https://ouroboros.rocks/community/)
on asymmetric QoS[^1], but here we're only talking about asymmetric FLM.

When I was working on the server code, I tried to set a FLM of 10 sec on the server side,
such that it would time-out (and [remove](#keeping-timed-out-flows))
flows from unresponsive clients.
I then discovered that `flow_accept(qosspec_t * qs, const struct timespec * timeo)` only uses qs to write to,
not to read the server QoS expectations.
Thus, at the moment, only the client sets the QoS (if the layer can give it).
This will change, such that the server sends its QoS to the client too.

When we're talking about FLM, this might result in the server saying something like:
> I don't trust you, I'm gone if I don't hear from you in 4 minutes, and if you want to want to wait for me 2 days, suits me just fine

where the server sets a FLM of 4 min and the client sets a FLM of 2 days.  
In this scenario, the client has to send[^2] every minute a keep-alive to keep the server interested ([unless it cleans up anyway](#cleaning-up-flows)).
The server, on the other hand, only needs to send a keep-alive every 12 hours to keep the flow open (assuming no other traffic at all).

There might be cases where you want to sync this timeout (like taking the smallest value),
but it still needs to be determined if this should be done at the application level or ouroboros and if so, how.
But for a 'raw' flow, we go with the 'none-of-your-business, just send in time' principle.

### Conclusion

As an application, you have total freedom (and responsibility) over your flows.
Ouroboros will only inform you that your flow is timing out (and your peer thus appears to be down),
but it's up to you to decide if you deallocate your side of the flow and when.  
The asymmetric QoS will need more discussion and experimenting before we know what the best approach would be.
Do we need reconciliation algorithms, and between which parties do they work? (what the application wants, what the layer below can deliver and what the peer can promise)

Excited for my first blog post & always learning,

Thijs

[^1]: like for wireless links, with a potential different BER or loss in each direction
      or even IPsec, where you can have encryption in one direction, but not the other

[^2]: fixed at 1/4 of the time-out period at the moment, see [previous post](https://ouroboros.rocks/blog/2022/02/28/application-level-flow-liveness-monitoring/)