@thegibson spent til 2am on phone with vendor for an 2.5PB HPC storage system, reading kernel panics and module code. Clients and servers had gotten to a state where the metadata servers just boot-looped, blocking all work.
Eventually found complex workaround to use til bug was fixed by vendor. Got nice amount of recognition for my efforts and skills...
Next day approved & applied a change that obviously & trivially brought it down again, costing 100s of scientists 10s of Kcpu- hours of work.
@thegibson foolishly I stayed in that HPC job for a while longer.
Somehow (despite general lack of networking background) was made responsible for overseeing a wholesale re-wire of the Infiniband Network for expansion of a cluster.
Rats nests of cables... my oversight failed to catch a label with '6' being read as '9' by a colleague. Didn't have a downtime impact. But took literally weeks for someone else to find and correct the performance issues that resulted. Unmanaged IB switches suck.