So, what happened was that one of the web servers (host 2) was struggling and I decided to upgrade it like I did yesterday for (host 1): https://mastodon.social/@mastohost/109292313550726464
Yesterday it was really smooth and the upgrade to the cloud instance only caused 90 seconds of downtime.
Today, everything went wrong:
- the network interface for the private IP changed
- when changing the configuration to the new network interface and doing a network restart, it just stopped responding
So, I was on the phone debugging the problem with OVH and we were able to bring the instance back online.
This caused the service to be partially down (because host 1 was still running for anyone that was using that DNS configuration) for a little over an hour.
Really sorry about the trouble this might have caused.
Traffic grew 10x in one week and it was impossible for me to predict (or afford) to scale for this before it happened.
Hope you understand.