Email or username:

Password:

Forgot your password?
tef

ps blaming "human error" for catastrophic systems failures is like blaming gravity for buildings falling over

16 comments
Ben Cox

@tef Reliability engineering involves a lot of repeatedly asking the question "and what if THAT fails?"

tef

@ben not sure i agree, taking a reductionist approach to the problem tends to lead to devs obsessing over specific examples of problems over seeing the broader picture of risk management and operating within safety margins

Ben Cox

@tef It depends on where your focus is I guess. If you're in an industry where a mistake means someone gets killed, you do have to get down into those weeds.

tef

@ben reliability engineering is coping in the face of unknown errors, and your process is about coping with predicted risk

focusing on the accidents you know is missing out on the broader systematic process of managing risk

that's my point

tef

@ben the other, perhaps more salient point here, is that when you take the approach of going "ahahah! what if!" consistently in risk management you inevitably set up an atagonistic relationship with the people you're trying to help

it's alarm fatigue and everything you say sounds like "what if the sun collapses!"

there's always a point in which you throw your hands up and declare an act of god

or you end up with the ye old example of a reinforced door bolted to walls made of plasterboard

Ben Cox

@tef If you assume competence on the part of the safety engineers, it doesn't have to be that way.

You know nothing about my process other than a throwaway quip I made.

Ben Cox

@tef I guess you've misunderstood mine.

The Janx Devil

@ben It also involves frequently asking “Are the people doing the reliability engineering work getting enough sleep on the regular?”

Michael Gemar

@tef @mawhrin But it’s caused by *humans* failing to do stuff properly. Thinking of it like gravity implies there is nothing that could have been done.

tef

@michaelgemar @mawhrin

i am glad you don't work in the construction industry, as we'd all be camping outside

tef

@michaelgemar @mawhrin

the point you're missing here is that "we build processes that incorporate human error as a given, or they collapse, much like how we build literal buildings knowing full well we are subject to the laws of gravity"

tef

@michaelgemar @mawhrin

your point of "ah, but isn't it always human error!" is missing how human error is used to excuse systematic failures of risk management—the load bearing word here is "blaming"

your processes should account for "an intern pressed the wrong key because their manager threatened to fire them"

that's why we have things like building codes, building inspections, certified professionals involved, rather than simply being "someone forgot a brick lol, never mind

Michael Gemar

@tef @mawhrin I don’t think we ultimately disagree.

Michael Gemar

@tef @mawhrin I agree, but it’s also us *building* the processes — they’re not just natural forces. If we fail to build resilient processes, that in itself is human error.

Fred Hebert

@michaelgemar @tef @mawhrin
Why are people shipping the code "causing" more outages than the people signing the checks? There are systemic and situational elements in play.

The lens of human error stalls investigations and prevents learning.

flere-imsaho

@michaelgemar @tef there's a reason the process looks like this and it's not because people don't know the risks of such deployments.

safe software engineering practices are not cost-effective. (until today happens.)

Go Up