What is there to learn from oping about flow liveness monitoring?
Cleaning up flows
While I was browsing through some oping code (trying to get a feeling about how to do broadcast), I stumbled about the cleaner thread. As we can see, it was used to clean up ‘stale’ flows (sanitized):
void * cleaner_thread(void * o)
{
int deadline_ms = 10000;
while (true) {
for (/* all active flows i */) {
diff = /* diff in ms between last valid ping packet and now */;
if (diff > deadline_ms) {
printf("Flow %d timed out.\n", i);
flow_dealloc(i);
}
}
sleep(1);
}
}
But we have since version 19.x flow liveness monitoring (FLM), which does this for us! So all this code could be thrown away, right?
Turns out I was semi-wrong! It’s all about semantics, or ‘what do you want to achieve’.
If this thread was there for cleaning up flows from which the peers stopped their flow (and stopped sending keep-alives), then we could throw it away by all means! Because FLM does that job.
Or was it there to clean up valid flows, but from which the peers didn’t send any ping packets anymore (they do send keep-alives, otherwise FLM kicks in)? Then we should of course keep it, because this is a server-side decision to cut those peers off. This might protect for example against client implementations which connect, send a few pings, but then leave the flow open. Or a better illustration of the ‘cleaner’ thread might be to cut off peers after a 100 pings, showing that this decision to ‘clean up’ has nothing to do with flow timeouts.
Keeping timed-out flows
On the other side of the spectrum, we have those flows that are timing out (no keep-alives are coming in anymore). This is my proposal for the server side parsing of messages:
while(/* get next fd on which an event happened */) {
msg_len = flow_read(fd, buf, OPING_BUF_SIZE);
if (msg_len < 0) {
/* if-statement is the only difference with before */
if (msg_len == -EFLOWPEER) {
fset_del(server.flows, fd);
flow_dealloc(fd);
}
continue;
}
/* continue with parsing and responding */
}
We can see here that the decision is taken to ‘clean up’ (= flow_dealloc
) those flows that are timing out.
But, as we can see, it’s an application decision!
We might as well decide to keep it open for another 10 min to see if the client (or the network in between) recovers from interruptions, e.g..
We might for example use this mechanism to show to the user that the peer seems to be down1 and even take measures (like saving or removing state), but also allow to just wait until the peer is live again.
Conclusion
As an application, you have total freedom (and responsibility) over your flows. Ouroboros will only inform you that your flow is timing out (and your peer thus appears to be down), but it’s up to you to decide if you deallocate your side of the flow and when.
Excited for my first blog post & always learning,
Thijs
- I’m thinking about things like the Overleaf banner:
Lost Connection. Reconnecting in 2 secs. Try Now
[return]