Email or username:

Password:

Forgot your password?
dansup

So the recent pixelfed.social outage was caused by the app server not being able to boot after a restart

I couldn't even ssh in, I had to use recovery mode and was presented with this error message

Girl, the kernel wasn't the only thing panicking...

12 comments
Григорий Клюшников

looks like a dying hard drive or a very corrupted file system

dansup

@grishka it was a corrupted fs, but I fixed it!

Ariadne Conill 🐰

@dansup why only one physical server for such a large instance?

dansup

@ariadne Good question, we do have a separate DB server and once I finish migrating old local media to S3, I can add multiple app servers and a load balancer.

It do be kinda impressive that such a large fedi instance (using php) can scale this much on a DO VPS!

dansup

After some digging in, I was able to debug the issue and work on a fix thanks to ChatGPT

Seriously, ChatGPT is a life saver, I'm not an ops person but I was able to use the tips it provided to diagnose the issue, and after learning a bit about GRUB and shit I was able to fix it properly.

I'll be paying more attention to kernel updates and implementing a new update procedure + adding another standby app server to prevent this in the future.

Running prod services are fun, until they aren't

Aprazeth

@dansup Amazing stuff, thank you for sharing the "behind the scenes".

Perhaps I am missing a piece of information (and that's on me) but do you perhaps a staging or test server/environment setup? As in, a separate server/instance that gets OS/docker/whatever underlying system updates first, prior to it going on the live one?

If not, it might be something to look into. Having a staging environment can help catch these kinds of things. That said, it will cost some time and thus money :-/

Aprazeth

@dansup alternatively the standby server is also a pretty good idea (but keeping them in sync in terms of everything else par data can be a handful) You'd also have to pick a time after the changes were applied on the live server that they also go on the standby server. (Say 1 or 2 days)

That all said, I'm just some rando on the internet and you're the one in the trenches there fixing stuff. Hope my ramblings might be of some use, if not, I still appreciate your time and the openness :)

Radieschen

@aprazeth @dansup not having a staging environment also costs time and money.

Aprazeth

@radieschen @dansup absolutely, but having worked in environments/organisations where time/money/resources are tight it can be a decision that unfortunately is made.

Which is why I'd rather mention as many options as possible, so the solution for the situation/budget can be made. Designing the ultimate solution is far simpler when you don't have those restrictions but "we gotta do the best we can with what we have"

utzer [Pleroma]
@dansup everyone knows this, I mean this breaks your legs every now and then, boot loader and/or kernel panic. Local access to the server or rescue mode is your find in such cases.

So shoulder padding for your work.
Go Up