After some digging in, I was able to debug the issue and work on a fix thanks to ChatGPT
Seriously, ChatGPT is a life saver, I'm not an ops person but I was able to use the tips it provided to diagnose the issue, and after learning a bit about GRUB and shit I was able to fix it properly.
I'll be paying more attention to kernel updates and implementing a new update procedure + adding another standby app server to prevent this in the future.
Running prod services are fun, until they aren't
@dansup Amazing stuff, thank you for sharing the "behind the scenes".
Perhaps I am missing a piece of information (and that's on me) but do you perhaps a staging or test server/environment setup? As in, a separate server/instance that gets OS/docker/whatever underlying system updates first, prior to it going on the live one?
If not, it might be something to look into. Having a staging environment can help catch these kinds of things. That said, it will cost some time and thus money :-/