@dansup Amazing stuff, thank you for sharing the "behind the scenes".
Perhaps I am missing a piece of information (and that's on me) but do you perhaps a staging or test server/environment setup? As in, a separate server/instance that gets OS/docker/whatever underlying system updates first, prior to it going on the live one?
If not, it might be something to look into. Having a staging environment can help catch these kinds of things. That said, it will cost some time and thus money :-/
@dansup alternatively the standby server is also a pretty good idea (but keeping them in sync in terms of everything else par data can be a handful) You'd also have to pick a time after the changes were applied on the live server that they also go on the standby server. (Say 1 or 2 days)
That all said, I'm just some rando on the internet and you're the one in the trenches there fixing stuff. Hope my ramblings might be of some use, if not, I still appreciate your time and the openness :)