Email or username:

Password:

Forgot your password?
Hector Martin

Ah yes, let's ship a kernel driver that parses update files that are pushed globally simultaneously to millions of users without progressive staging, and let's write it in a memory unsafe language so it crashes if an update is malformed, and let's have no automated boot recovery mechanism to disable things after a few failed boots. What could possibly go wrong?

🤦‍♂️

108 comments
gaytabase

@marcan they are about security, not reliability

syn

@dysfun @marcan what's more secure than a bricked computer after all

Wilfried Klaebe

I assume you're all joking, but please, use the terms correctly!

From the back of my head, "IT security" is "ensuring confidentiality, integrity AND availability", and a bricked computer only ticks - at maximum - two of those boxes.

@syn @dysfun @marcan

DaCool

@wonka I think they were just joking, nothing deeper then that.

Hector Martin

@dysfun Reminds me of gr"let's crash on integer overflows that aren't a security bug, and then let's try to fix one such overflow with a hilariously broken obviously unreviewed patch that instead of working around it replaced it with an actual overflow bug that still crashed, thus creating a local kernel panic DoS that anyone can trigger with a shell one-liner, also we don't count DoSes as CVEs so don't bother responsibly disclosing this but we're going to flame you on Twitter and embarrass ourselves so bad we end up deleting our Twitter account but at least we banned your dynamic IP address from our website and forum, take that!!!!!"security.

(Yes, this really happened after I crashed my grsecurity kernel Gentoo box years ago by pasting too much text into a terminal, then tweeted a repro. I stopped using grsecurity after that.)

reddit.com/r/programming/comme

@dysfun Reminds me of gr"let's crash on integer overflows that aren't a security bug, and then let's try to fix one such overflow with a hilariously broken obviously unreviewed patch that instead of working around it replaced it with an actual overflow bug that still crashed, thus creating a local kernel panic DoS that anyone can trigger with a shell one-liner, also we don't count DoSes as CVEs so don't bother responsibly disclosing this but we're going to flame you on Twitter and embarrass ourselves...

gaytabase

@marcan that doesn't surprise me tbh, gibson is an arse

Graham Spookyland🎃/Polynomial

@dysfun @marcan that's GRC, not grsec (we're collectively bad at naming things)

Graham Spookyland🎃/Polynomial

@marcan @dysfun yeah Steve Gibson is the guy who looks like a vacuum cleaner salesman that makes snakeoil disk recovery software under the name "GRC" (and also cohosts a podcast), whereas Brad Spengler is the grsecurity guy who had a meltdown on Twitter.

Andrew Zonenberg

@gsuberland @marcan @dysfun Lol I knew Steve was nuts and full of snake oil but this is the first I've heard the vacuum cleaner line.

Bornach

@azonenberg @gsuberland @marcan @dysfun
Not seen many vacuum cleaner salesmen to be able to make a judgement but I can picture Steve Gibson being skilled at it
grc.com/pdp-8/deepthought-sbc.
On his GRC site, SG walks the viewer through the features of his "blinkenlights" program for a PDP-8 emulator

JaxxAI

@dysfun @marcan Availability is literally one of the three pillars of information security, also known as the CIA triad, along with confidentiality and integrity. A lack of reliability leads to unavailability and I now feel like I'm turning into Infosec Yoda.

April Phoenix

@marcan but you gotta understand antivirus is important and you can't wait even an hour for an update to roll out, it has to happen instantly /s 🦋

Chairmander

@marcan The thing that surprises me the most about this situation: How did something like this not happen waaaayyy sooner? This seems so incredibly fragile, how did it hold up for so long.

Michael Kohne

@suschi @l_prod Actually, I'd bet their engineers are probably pretty good, which is why they've gotten away with whatever hole in their process let this through for so long.

Cycling_Liz

@marcan I have no idea what any of that means, but I'm glad you and others understand it!

James Calligeros

@marcan the amount of enterprise software that does this - especially locally installed subscription-licensed software - is actually incredible. not all of it is as invasive as your crowdstrikes and beyondtrusts but damn near all of it is mission-critical in some way, shape or form, and has absolutely zero respect for the customer's change management process or the fact that the customer's machines are not in fact theirs to do with as they please. a customer of ours was unable to let me complete an upgrade of our software on their server as their pci dss compliance malware could not be disabled once installed without nuking it off the host machine (which due to how invasive it is, involves reimaging the machine entirely)

our helpdesk was able to continue operating uninterrupted all of this afternoon and evening because we simply do not install software on our production machines that does not respect our ownership of said machines (apart from windows server itself as the itsm tool we use is an asp dot net thing)

@marcan the amount of enterprise software that does this - especially locally installed subscription-licensed software - is actually incredible. not all of it is as invasive as your crowdstrikes and beyondtrusts but damn near all of it is mission-critical in some way, shape or form, and has absolutely zero respect for the customer's change management process or the fact that the customer's machines are not in fact theirs to do with as they please. a customer of ours was unable to let me complete...

Javier

@marcan the language has nothing to do. It's a piece of crap code plain and simple.

It would never have passed any cursory code review if it were a bit more open. The only reason it's widely demoted is because it's mandated by committees that don't care how it works

Hector Martin

@javierg The language matters because segfaulting on invalid input only happens in memory unsafe languages. On a memory safe language you generally have to make a conscious decision about how to handle errors and unexpected situations.

Javier

@marcan and that's exactly how this kind of quality-free coffee is written: assuming nothing wrong ever happens. In "memory safe" languages it's the "reliably crash" that would stay in the code because nobody cares to check if it's replaced with actual error handling.

Hector Martin

@javierg At least with a memory-safe language someone had to make an *active decision* to reliably crash (making this something solvable by policy, e.g. ban such constructs in the linter), as opposed to no decision at all (which is impossible to protect against or have processes that forbid, once you're using a memory unsafe language).

Henri

@marcan @javierg if they used Rust they would just put unsafe everywhere, c’mon you know this.

soc

@slyecho @marcan @javierg With which part of

> something solvable by policy, e.g. ban such constructs

are you struggling?

Henri

@soc @marcan @javierg Yeah, I work in corporate software development, we have all kinds of rules, guidelines, code review at least by 2 persons, SonarQube and still a lot of crap gets through

Javier

@marcan
that's too hopeful. in this case it seems the bug was in the parser; evidently it's a codepath that has never been tested. thinking that any linter or development tool would "fix" that presumes a lot more discipline than what passes as "professional" in that kind of companies.

the problem is their "success" in secrecy. for anything security- or management-related that's the perfect recipe for failure.

no tool can help those who don't have to do a good job to profit.

@marcan
that's too hopeful. in this case it seems the bug was in the parser; evidently it's a codepath that has never been tested. thinking that any linter or development tool would "fix" that presumes a lot more discipline than what passes as "professional" in that kind of companies.

the problem is their "success" in secrecy. for anything security- or management-related that's the perfect recipe for failure.

Esparta :ruby:

@marcan @javierg

re:

> At least with a memory-safe language someone had to make an *active decision* to reliably crash (making this something solvable by policy, e.g. ban such constructs in the linter),

I've seen entire teams making concise active decision to break things for the sake of save their ass or the corporate reputation - if any.

I agree, it's way better if the programming language has all the controls and tries their best to avoid unconscious bad decisions.

feld
@marcan @javierg

> The language matters because segfaulting on invalid input only happens in memory unsafe languages.

but the error is PAGE_FAULT_IN_NONPAGED_AREA

Code you write does not get to handle this error gracefully. This is the kernel shooting it in the head. This is not something Rust magically solves. I literally reported an issue a couple weeks ago to a Rust program that was having this type of problem on FreeBSD

pid 10464 (qdrant), jid 113, uid 0: exited on signal 11 (no core dump - bad address)
pid 14270 (qdrant), jid 113, uid 0: exited on signal 11 (no core dump - bad address)
pid 16531 (qdrant), jid 113, uid 0: exited on signal 11 (no core dump - bad address)
pid 19441 (qdrant), jid 113, uid 0: exited on signal 11 (no core dump - bad address)
@marcan @javierg

> The language matters because segfaulting on invalid input only happens in memory unsafe languages.

Orca🌻 | 🏴🏳️‍⚧️

@marcan@social.treehouse.systems

... ship a kernel driver that parses update files that ...
Fainted on-site, someone call an ambulance? ​:nkocampfiredrink:​

Raul

@marcan What a chain of dangerous/bad-practice deployment choices, no rollback options, + likely poor/incomplete testing.

I bet systems with Crowdstrike + Bitlocker on will cause some major headaches

Raul

@knightlie @marcan

Dave Farley (author of the "Continuous Delivery", and "Modern Software Engineering" books) highlights in this video four questions that ought to have mitigated the impact of this failed update, or rather, ought to help mitigate similar incidents in the future:

1. Why wasn't this caught by Testing?

2. Why didn't they use Canary Releasing?

3. Why didn't CrowdStrike have proper Observability of this as it happened?

4. Why no Rollback planned within the change?

youtu.be/MwjQVAwIATE?feature=s

@knightlie @marcan

Dave Farley (author of the "Continuous Delivery", and "Modern Software Engineering" books) highlights in this video four questions that ought to have mitigated the impact of this failed update, or rather, ought to help mitigate similar incidents in the future:

1. Why wasn't this caught by Testing?

HP van Braam :verified:

@marcan you do have to admit, those endpoints are very secure now.

h3artbl33d :openbsd: :ve:

@hp @marcan

I disagree. They are still plugged in. Intel vPro and the AMD equivalent have a KVM over IP built-in.

Kote Isaev

@marcan Imagine if someone write such kernel driver in memory-safe language. But malformed update arrives anyway. So this memory-safe driver crashes, but may be with other error code, leaving the system unbootable anyway. So, it really require complex changes, which will not happen.

Hector Martin

@koteisaev A memory-safe language would force you to make an active choice to crash, which at least gives you a chance to, you know, not do that and instead just bypass the update or fail gracefully.

Yes, it is still possible to write crap code for memory-safe languages, but you generally have to do that much more on purpose. With a memory-unsafe language you just don't think about it and the default option is to crash (or worse).

Martin Uecker

@marcan @koteisaev Wouldn't you get a panic by default for an out-of-bounds access in Rust? How would the end result be different? Also see Ariane 5 for an example how memory safety can fail rather spectacularly.

Imikoy

@uecker@mastodon.social @marcan@social.treehouse.systems @koteisaev@mastodon.online The programmer has to make sure the code does not panic by default in the kernel.

edit: panic!() in Linux calls BUG() (from rust/kernel/lib.rs), so it isn't a complete catastrophe to debug.

Hector Martin

@uecker @koteisaev It is trivial to set up a Rust build to make panic a compile time error, forcing you to use primitives that don't do that and to handle the errors gracefully, such as `.get()` instead of indexing with [].

You cannot do that in a memory unsafe language, fundamentally.

Martin Uecker

@marcan @koteisaev You can not do what? Forcing people to use at instead of [] in C++? Would certainly be possible. But the discussion what "you could do" misses the point completely. What would happen in reality in a poor code base written on limited budget? Most likely would simply panic, or?

Martin Uecker

@marcan @koteisaev The argument that a memory safe language would have prevented the problem is incorrect, because terminating the program (kernel) on invalid operation *is* something that could also happen in a memory safe language and seems even the default in Rust for many things. Whether you could have done it differently (certainly!) and whether Rust makes this easier or not is a different discussion.

Hector Martin

@uecker @koteisaev The argument is that using a memory safe language would be a *requirement* to be *able* to avoid this class of problems, as evidenced by decades of memory safety bugs. Yes you can write crap code in any language, but it's plainly obvious to everyone who isn't in denial about the state of software engineering that approximately nobody can write correct and memory-safe complex code in memory-unsafe languages.

Kote Isaev

@marcan I agree that memory-safe languages are necessary. And many others here would agree on this.
But many coders write in C and C++ in a way like these languages are memory-safe. Like, "Hey, Bob, why you check this parameter for array size bounds here? I already checked it in function which calls this code! Your check slows code for 0.3%!".
But problem that caused this outage is NOT a memory leak or out-of-bounds data read/write. It was malformed "content update". Broken input data.

Martin Uecker

@marcan @koteisaev . My preferred solution is to use a subset of C and compile to eBPF which is then verified at run-time.

Kote Isaev

@marcan Here broken piece of input comes, and your memory safe code dies with exit code 9000. I mean, this whole situaiton when faulty driver can cause crash whole system instead of markng this driver as faulty and to not not use it at next boot, and if necessary, reboot with new settings, or use special fallback driver whose purpose would be to report about problem on next reboot, and then it would be quite short outage and systems would be alive after few reboots, read - few minutes max.

ZanaGB

@marcan Which driver would that be?, the only new driver i'm aware has released this week would be the new "OSS" nVidia driver.

EDIT: I'm an idiot, it's the crowdstrilke thing.

Stoneface Vimes

@marcan but they're all so clever - I simply cannot comprehend how it could all have gone so wrong. Give them some more money.

sudokush

you pumped your fist after posting this tweet huh

EndlessMason

@marcan
How come their tooling just lets you release a malformed thing and doesn't, say, do a basic "does it parse" check first?

Christoph Petrausch

@marcan the only question I have: why did it take so long for this?

Pierre Bourdon

@marcan somewhere in Crowdstrike's bugtracker, 3 bugs titled "move ruleset eval out of process", "always canary dynamic update files" and "add fuzzing to the update files parser", filed by SREs 5+ years ago, marked as P4, get resurrected and set as P0.

(Remember go/outage-2013 ? :p)

ambroisie

@delroth that go link seems to be dead 😢

Lion abt not making pride puns

@delroth @marcan I wonder if Microsoft are again scowling at third-parties making their system "unreliable" and weighing up auto-disabling additional kernel drivers after repeated failed boots again.

(Consumer Windows does so much attempted autorepair I'm kind of surprised it *can't* dig itself out of this hole if safe mode works. But maybe "trusted boot" and disk encryption built atop that screws all this up in an enterprise world.)

Pierre Bourdon

@LionsPhil @marcan they issue the signing certs, they can set any policy they want - but they don't.

Lion abt not making pride puns

@delroth @marcan Imagine a world in which they'd also said "no" to the likes of SecuROM the first time. *sigh*

Shiz

@LionsPhil safe mode does fix it, but I think you can only enter safe mode using the Bitlocker recovery key

lena

@marcan you can stay secure that nobody will be accessing that system again

furicle

@marcan isn't that the definition of av/endpoint security software?

Other than the non memory safe language bit, it has to work that way, with the incentives all around keeping it that way.

(I'm still amused (?) that ms gets to sell security for software it sells, separately, by monthly subscription)

Tammi🐈‍⬛🙂‍↔️ ⛓️‍💥

@marcan there is a process as crowdstrike has explained: just boot into safemode and delete the driver file

Ash_Crow

@tamtararam @marcan alternatively, just reboot the machine again and again until the driver update finishes before the BSOD happens.

Tammi🐈‍⬛🙂‍↔️ ⛓️‍💥

@marcan ppl will be like "haha windows bad" but they have not seen the shit companies install on enterprise servers. you dont even need ebpf nor kernel modules to hang a system. just setup fanotify to block all file open requests and have the process that is supposed to approve those die and stop responding to those. and there you go: McAffee for Linux achieved.

Scotty Trees

@marcan is this related to the crowdstrike thing I'm seeing just now?

Frederic Thevenet

@marcan they _could_ have done all of what you said, but who cares about boring shit like that? Certainly not investors or share-holders, that's for sure!
So instead, how about rushing some half-arsed AI chat bot that nobody who actually needs to get things done will ever use? Yay! Now that's the spirit!
crowdstrike.com/platform/charl

Andrew

@marcan least reckless enterprise software developer

cameronbosch :endeavourOS:

@marcan And one that cannot be changed outside of our company because it isn't FOSS! What can POSSIBLY go wrong? 😂

The Penguin of Evil

@marcan Also it seems "and lets not sign it either"

Farce Majeure

@etchedpixels @marcan unsigned third party kernel drivers... what could go wrong...

Hector Martin

@etchedpixels Current rumors is it was broken in a post-processing step, so it might have been signed internally after that, they just never tested the actual final blob that was distributed.

RussianDeepStateSock

@marcan and lets make sure FBI cybercrimes folks work at the provider of this service, and of course make sure this critical infrastructure defense service is also: <checks notes> front and center in many of the most significant cold war 2.0 nation state hacking allegations on the planet lmaaaaaooo now THAT is what I call safety and security.

Its like riding on top of a fully armed drone fighter with your core systems. I wonder if this makes you more or less a target? lmaaaao

ajn142

@marcan CrowdStrike taking notes on your post so they know what to go Google.

DarkAthena ✒️

@marcan @lisamelton All with a name that evokes violence against people.

Robert

@marcan in a way this wouldn't have happened if NT was a micro kernel, am I wrong?

Scribe, Backwoods Artificer
@marcan is it bad that part of me wonders, like
what if it had been even worse
what if the systems were just completely unrecoverable
ydalton

@marcan Lol, any kind of string parsing in a non-Python language makes me uneasy, and especially when you do it in a kernel context 💀

Trillion Byter

@marcan brb. Need to add a few topics in my personal Jira.

Christian Berger DECT 2763

@marcan Well it's a nice stunt. I mean nobody would ever use "endpoint security" software on an important system. That would be ridiculous and a clear breach of, for example, the contract rules of Crowdstrike.

Jan ☕🎼🎹☁️🏋️‍♂️

@marcan rust in kernel wasn't available when this thing was written. And you just don't go rewrite stuff for the fun of it.

Not to defend them for whatever errors happened in the qa proces, but hindsight 20/20 here.

Go Up