So the recent pixelfed.social outage was caused by...

dansup's posts Post Back to profile

dansup

So the recent pixelfed.social outage was caused by the app server not being able to boot after a restart

I couldn't even ssh in, I had to use recovery mode and was presented with this error message

Girl, the kernel wasn't the only thing panicking...

Kernel panic - not syncing: VFS: Unable to mount root fs
on unknown-block(0,0)

Like 14 April at 10:59 | Open on mastodon.social

12 comments

Григорий Клюшников

looks like a dying hard drive or a very corrupted file system

1 14 April at 11:00

dansup

@grishka it was a corrupted fs, but I fixed it!

14 April at 11:48 | Open on mastodon.social

Haelwenn /элвэн/ :triskell:

@dansup Linux should have kernel don't panic instead.

14 April at 11:00 | Open on queer.hacktivis.me

Ariadne Conill 🐰

@dansup why only one physical server for such a large instance?

14 April at 11:03 | Open on social.treehouse.systems

dansup

@ariadne Good question, we do have a separate DB server and once I finish migrating old local media to S3, I can add multiple app servers and a load balancer.

It do be kinda impressive that such a large fedi instance (using php) can scale this much on a DO VPS!

14 April at 11:05 | Open on mastodon.social

dansup

After some digging in, I was able to debug the issue and work on a fix thanks to ChatGPT

Seriously, ChatGPT is a life saver, I'm not an ops person but I was able to use the tips it provided to diagnose the issue, and after learning a bit about GRUB and shit I was able to fix it properly.

I'll be paying more attention to kernel updates and implementing a new update procedure + adding another standby app server to prevent this in the future.

Running prod services are fun, until they aren't

14 April at 11:03 | Open on mastodon.social

Aprazeth

@dansup Amazing stuff, thank you for sharing the "behind the scenes".

Perhaps I am missing a piece of information (and that's on me) but do you perhaps a staging or test server/environment setup? As in, a separate server/instance that gets OS/docker/whatever underlying system updates first, prior to it going on the live one?

If not, it might be something to look into. Having a staging environment can help catch these kinds of things. That said, it will cost some time and thus money :-/

14 April at 11:09 | Open on mastodon.social

Aprazeth

@dansup alternatively the standby server is also a pretty good idea (but keeping them in sync in terms of everything else par data can be a handful) You'd also have to pick a time after the changes were applied on the live server that they also go on the standby server. (Say 1 or 2 days)

That all said, I'm just some rando on the internet and you're the one in the trenches there fixing stuff. Hope my ramblings might be of some use, if not, I still appreciate your time and the openness :)

14 April at 11:18 | Open on mastodon.social

Radieschen

@aprazeth @dansup not having a staging environment also costs time and money.

14 April at 11:19 | Open on climatejustice.social

Aprazeth

@radieschen @dansup absolutely, but having worked in environments/organisations where time/money/resources are tight it can be a decision that unfortunately is made.

Which is why I'd rather mention as many options as possible, so the solution for the situation/budget can be made. Designing the ultimate solution is far simpler when you don't have those restrictions but "we gotta do the best we can with what we have"

14 April at 11:25 | Open on mastodon.social

me-instance :heartaro:

@dansup That's called DevOops

14 April at 11:18 | Open on tootsfrom.ahabitual.dev

utzer [Pleroma]

@dansup everyone knows this, I mean this breaks your legs every now and then, boot loader and/or kernel panic. Local access to the server or rescue mode is your find in such cases.

So shoulder padding for your work.

14 April at 11:14 | Open on soc.utzer.de