Ah yes, let's ship a kernel driver that parses update...

Ah yes, let's ship a kernel driver that parses update files that are pushed globally simultaneously to millions of users without progressive staging, and let's write it in a memory unsafe language so it crashes if an update is malformed, and let's have no automated boot recovery mechanism to disable things after a few failed boots. What could possibly go wrong?

🤦‍♂️

Like 19 Jul 2024 at 10:56 | Open on social.treehouse.systems

108 comments

gaytabase

@marcan they are about security, not reliability

19 Jul 2024 at 10:58 | Open on social.treehouse.systems

syn

@dysfun @marcan what's more secure than a bricked computer after all

19 Jul 2024 at 10:59 | Open on ohai.social

Wilfried Klaebe

I assume you're all joking, but please, use the terms correctly!

From the back of my head, "IT security" is "ensuring confidentiality, integrity AND availability", and a bricked computer only ticks - at maximum - two of those boxes.

@syn @dysfun @marcan

19 Jul 2024 at 17:08 | Open on chaos.social

DaCool

@wonka I think they were just joking, nothing deeper then that.

19 Jul 2024 at 18:07 | Open on layer8.space

Hector Martin

@dysfun Reminds me of gr"let's crash on integer overflows that aren't a security bug, and then let's try to fix one such overflow with a hilariously broken obviously unreviewed patch that instead of working around it replaced it with an actual overflow bug that still crashed, thus creating a local kernel panic DoS that anyone can trigger with a shell one-liner, also we don't count DoSes as CVEs so don't bother responsibly disclosing this but we're going to flame you on Twitter and embarrass ourselves so bad we end up deleting our Twitter account but at least we banned your dynamic IP address from our website and forum, take that!!!!!"security.

(Yes, this really happened after I crashed my grsecurity kernel Gentoo box years ago by pasting too much text into a terminal, then tweeted a repro. I stopped using grsecurity after that.)

https://www.reddit.com/r/programming/comments/4gn0dr/hector_martin_on_twitter_how_to_panic_a_current/

Expand text...

19 Jul 2024 at 11:04 | Open on social.treehouse.systems

gaytabase

@marcan that doesn't surprise me tbh, gibson is an arse

19 Jul 2024 at 11:06 | Open on social.treehouse.systems

Hector Martin

@dysfun I think you mean Brad Spengler.

19 Jul 2024 at 11:07 | Open on social.treehouse.systems

Graham Spookyland🎃/Polynomial

@dysfun @marcan that's GRC, not grsec (we're collectively bad at naming things)

19 Jul 2024 at 17:04 | Open on chaos.social

gaytabase

@gsuberland @marcan yeah but wasn't grsec his kernel extension?

19 Jul 2024 at 17:22 | Open on social.treehouse.systems

Hector Martin

@dysfun @gsuberland No, it's two completely different people.

20 Jul 2024 at 2:16 | Open on social.treehouse.systems

Graham Spookyland🎃/Polynomial

@marcan @dysfun yeah Steve Gibson is the guy who looks like a vacuum cleaner salesman that makes snakeoil disk recovery software under the name "GRC" (and also cohosts a podcast), whereas Brad Spengler is the grsecurity guy who had a meltdown on Twitter.

20 Jul 2024 at 2:55 | Open on chaos.social

Andrew Zonenberg

@gsuberland @marcan @dysfun Lol I knew Steve was nuts and full of snake oil but this is the first I've heard the vacuum cleaner line.

20 Jul 2024 at 3:04 | Open on ioc.exchange

Bornach

@azonenberg @gsuberland @marcan @dysfun
Not seen many vacuum cleaner salesmen to be able to make a judgement but I can picture Steve Gibson being skilled at it
https://www.grc.com/pdp-8/deepthought-sbc.htm
On his GRC site, SG walks the viewer through the features of his "blinkenlights" program for a PDP-8 emulator

20 Jul 2024 at 6:59 | Open on fosstodon.org

Show 1 more reply

halva is

@dysfun @marcan cant break what's broken

19 Jul 2024 at 14:09 | Open on wetdry.world

Nicolás Alvarez

@dysfun @marcan security or checklist-compliance?

19 Jul 2024 at 15:18 | Open on social.treehouse.systems

JaxxAI

@dysfun @marcan Availability is literally one of the three pillars of information security, also known as the CIA triad, along with confidentiality and integrity. A lack of reliability leads to unavailability and I now feel like I'm turning into Infosec Yoda.

19 Jul 2024 at 17:04 | Open on floss.social

April Phoenix

@marcan but you gotta understand antivirus is important and you can't wait even an hour for an update to roll out, it has to happen instantly /s 🦋

19 Jul 2024 at 10:59 | Open on chaos.social

Chairmander

@marcan The thing that surprises me the most about this situation: How did something like this not happen waaaayyy sooner? This seems so incredibly fragile, how did it hold up for so long.

19 Jul 2024 at 11:05 | Open on mastodon.gamedev.place

suschi

@l_prod

Luck 🤷‍♀️

19 Jul 2024 at 11:09 | Open on mastodon.online

Michael Kohne

@suschi @l_prod Actually, I'd bet their engineers are probably pretty good, which is why they've gotten away with whatever hole in their process let this through for so long.

19 Jul 2024 at 11:37 | Open on mastodon.social

Sammy 🐾

@l_prod @marcan luck, i guess

19 Jul 2024 at 12:02 | Open on cherrykitten.gay

WowSuchCyber

@l_prod @marcan it happened years ago to McAfee https://www.zdnet.com/article/defective-mcafee-update-causes-worldwide-meltdown-of-xp-pcs/

Guess who was CTO of McAfee at the time

19 Jul 2024 at 17:19 | Open on toot.zof.sh

Cycling_Liz

@marcan I have no idea what any of that means, but I'm glad you and others understand it!

19 Jul 2024 at 11:08 | Open on mastodon.social

James Calligeros

@marcan the amount of enterprise software that does this - especially locally installed subscription-licensed software - is actually incredible. not all of it is as invasive as your crowdstrikes and beyondtrusts but damn near all of it is mission-critical in some way, shape or form, and has absolutely zero respect for the customer's change management process or the fact that the customer's machines are not in fact theirs to do with as they please. a customer of ours was unable to let me complete an upgrade of our software on their server as their pci dss compliance malware could not be disabled once installed without nuking it off the host machine (which due to how invasive it is, involves reimaging the machine entirely)

our helpdesk was able to continue operating uninterrupted all of this afternoon and evening because we simply do not install software on our production machines that does not respect our ownership of said machines (apart from windows server itself as the itsm tool we use is an asp dot net thing)

Expand text...

19 Jul 2024 at 11:12 | Open on social.treehouse.systems

Javier

@marcan the language has nothing to do. It's a piece of crap code plain and simple.

It would never have passed any cursory code review if it were a bit more open. The only reason it's widely demoted is because it's mandated by committees that don't care how it works

19 Jul 2024 at 11:15 | Open on mstdn.social

Hector Martin

@javierg The language matters because segfaulting on invalid input only happens in memory unsafe languages. On a memory safe language you generally have to make a conscious decision about how to handle errors and unexpected situations.

19 Jul 2024 at 11:23 | Open on social.treehouse.systems

Javier

@marcan and that's exactly how this kind of quality-free coffee is written: assuming nothing wrong ever happens. In "memory safe" languages it's the "reliably crash" that would stay in the code because nobody cares to check if it's replaced with actual error handling.

19 Jul 2024 at 11:30 | Open on mstdn.social

Hector Martin

@javierg At least with a memory-safe language someone had to make an *active decision* to reliably crash (making this something solvable by policy, e.g. ban such constructs in the linter), as opposed to no decision at all (which is impossible to protect against or have processes that forbid, once you're using a memory unsafe language).

19 Jul 2024 at 12:24 | Open on social.treehouse.systems

Henri

@marcan @javierg if they used Rust they would just put unsafe everywhere, c’mon you know this.

19 Jul 2024 at 15:06 | Open on mdon.ee

soc

@slyecho @marcan @javierg With which part of

> something solvable by policy, e.g. ban such constructs

are you struggling?

19 Jul 2024 at 15:49 | Open on chaos.social

Henri

@soc @marcan @javierg Yeah, I work in corporate software development, we have all kinds of rules, guidelines, code review at least by 2 persons, SonarQube and still a lot of crap gets through

19 Jul 2024 at 15:54 | Open on mdon.ee

Javier

@marcan
that's too hopeful. in this case it seems the bug was in the parser; evidently it's a codepath that has never been tested. thinking that any linter or development tool would "fix" that presumes a lot more discipline than what passes as "professional" in that kind of companies.

the problem is their "success" in secrecy. for anything security- or management-related that's the perfect recipe for failure.

no tool can help those who don't have to do a good job to profit.

the problem is their "success" in secrecy. for anything security- or management-related that's the perfect recipe for failure.

Expand text...

19 Jul 2024 at 15:28 | Open on mstdn.social

Esparta :ruby:

@marcan @javierg

re:

> At least with a memory-safe language someone had to make an *active decision* to reliably crash (making this something solvable by policy, e.g. ban such constructs in the linter),

I've seen entire teams making concise active decision to break things for the sake of save their ass or the corporate reputation - if any.

I agree, it's way better if the programming language has all the controls and tries their best to avoid unconscious bad decisions.

19 Jul 2024 at 20:23 | Open on ruby.social

feld

@marcan @javierg

> The language matters because segfaulting on invalid input only happens in memory unsafe languages.

but the error is PAGE_FAULT_IN_NONPAGED_AREA

Code you write does not get to handle this error gracefully. This is the kernel shooting it in the head. This is not something Rust magically solves. I literally reported an issue a couple weeks ago to a Rust program that was having this type of problem on FreeBSD

pid 10464 (qdrant), jid 113, uid 0: exited on signal 11 (no core dump - bad address)
pid 14270 (qdrant), jid 113, uid 0: exited on signal 11 (no core dump - bad address)
pid 16531 (qdrant), jid 113, uid 0: exited on signal 11 (no core dump - bad address)
pid 19441 (qdrant), jid 113, uid 0: exited on signal 11 (no core dump - bad address)

@marcan @javierg

> The language matters because segfaulting on invalid input only happens in memory unsafe languages.

Expand text...

19 Jul 2024 at 13:43 | Open on bikeshed.party

Orca🌻 | 🏴🏳️‍⚧️

@marcan@social.treehouse.systems

... ship a kernel driver that parses update files that ...

Fainted on-site, someone call an ambulance? :nkocampfiredrink:

19 Jul 2024 at 11:22 | Open on nya.one

Raul

@marcan What a chain of dangerous/bad-practice deployment choices, no rollback options, + likely poor/incomplete testing.

I bet systems with Crowdstrike + Bitlocker on will cause some major headaches

19 Jul 2024 at 11:23 | Open on social.treehouse.systems

Jamie Knight

@raulinbonn @marcan Yes, they did...

19 Jul 2024 at 17:27 | Open on social.vivaldi.net

Raul

@knightlie @marcan

Dave Farley (author of the "Continuous Delivery", and "Modern Software Engineering" books) highlights in this video four questions that ought to have mitigated the impact of this failed update, or rather, ought to help mitigate similar incidents in the future:

1. Why wasn't this caught by Testing?

2. Why didn't they use Canary Releasing?

3. Why didn't CrowdStrike have proper Observability of this as it happened?

4. Why no Rollback planned within the change?

https://youtu.be/MwjQVAwIATE?feature=shared

@knightlie @marcan

1. Why wasn't this caught by Testing?

Expand text...

19 Jul 2024 at 19:55 | Open on social.treehouse.systems

HP van Braam :verified:

@marcan you do have to admit, those endpoints are very secure now.

19 Jul 2024 at 11:55 | Open on mastodon.tmm.cx

h3artbl33d :openbsd: :ve:

@hp @marcan

I disagree. They are still plugged in. Intel vPro and the AMD equivalent have a KVM over IP built-in.

19 Jul 2024 at 21:40 | Open on exquisite.social

Kote Isaev

@marcan Imagine if someone write such kernel driver in memory-safe language. But malformed update arrives anyway. So this memory-safe driver crashes, but may be with other error code, leaving the system unbootable anyway. So, it really require complex changes, which will not happen.

19 Jul 2024 at 12:00 | Open on mastodon.online

Hector Martin

@koteisaev A memory-safe language would force you to make an active choice to crash, which at least gives you a chance to, you know, not do that and instead just bypass the update or fail gracefully.

Yes, it is still possible to write crap code for memory-safe languages, but you generally have to do that much more on purpose. With a memory-unsafe language you just don't think about it and the default option is to crash (or worse).

19 Jul 2024 at 12:22 | Open on social.treehouse.systems

Martin Uecker

@marcan @koteisaev Wouldn't you get a panic by default for an out-of-bounds access in Rust? How would the end result be different? Also see Ariane 5 for an example how memory safety can fail rather spectacularly.

19 Jul 2024 at 17:00 | Open on mastodon.social

Imikoy

@uecker@mastodon.social @marcan@social.treehouse.systems @koteisaev@mastodon.online The programmer has to make sure the code does not panic by default in the kernel.

edit: panic!() in Linux calls BUG() (from rust/kernel/lib.rs), so it isn't a complete catastrophe to debug.

19 Jul 2024 at 17:51 | Open on meow.miabaka.moe

Hector Martin

@uecker @koteisaev It is trivial to set up a Rust build to make panic a compile time error, forcing you to use primitives that don't do that and to handle the errors gracefully, such as `.get()` instead of indexing with [].

You cannot do that in a memory unsafe language, fundamentally.

20 Jul 2024 at 2:20 | Open on social.treehouse.systems

Martin Uecker

@marcan @koteisaev You can not do what? Forcing people to use at instead of [] in C++? Would certainly be possible. But the discussion what "you could do" misses the point completely. What would happen in reality in a poor code base written on limited budget? Most likely would simply panic, or?

20 Jul 2024 at 5:47 | Open on mastodon.social

Martin Uecker

@marcan @koteisaev The argument that a memory safe language would have prevented the problem is incorrect, because terminating the program (kernel) on invalid operation *is* something that could also happen in a memory safe language and seems even the default in Rust for many things. Whether you could have done it differently (certainly!) and whether Rust makes this easier or not is a different discussion.

20 Jul 2024 at 6:11 | Open on mastodon.social

Hector Martin

@uecker @koteisaev The argument is that using a memory safe language would be a *requirement* to be *able* to avoid this class of problems, as evidenced by decades of memory safety bugs. Yes you can write crap code in any language, but it's plainly obvious to everyone who isn't in denial about the state of software engineering that approximately nobody can write correct and memory-safe complex code in memory-unsafe languages.

20 Jul 2024 at 7:55 | Open on social.treehouse.systems

Kote Isaev

@marcan I agree that memory-safe languages are necessary. And many others here would agree on this.
But many coders write in C and C++ in a way like these languages are memory-safe. Like, "Hey, Bob, why you check this parameter for array size bounds here? I already checked it in function which calls this code! Your check slows code for 0.3%!".
But problem that caused this outage is NOT a memory leak or out-of-bounds data read/write. It was malformed "content update". Broken input data.

20 Jul 2024 at 9:17 | Open on mastodon.online

Show 15 replies

Martin Uecker

@marcan @koteisaev . My preferred solution is to use a subset of C and compile to eBPF which is then verified at run-time.

20 Jul 2024 at 9:35 | Open on mastodon.social

Kote Isaev

@marcan Here broken piece of input comes, and your memory safe code dies with exit code 9000. I mean, this whole situaiton when faulty driver can cause crash whole system instead of markng this driver as faulty and to not not use it at next boot, and if necessary, reboot with new settings, or use special fallback driver whose purpose would be to report about problem on next reboot, and then it would be quite short outage and systems would be alive after few reboots, read - few minutes max.

20 Jul 2024 at 9:23 | Open on mastodon.online

Show 2 replies

ZanaGB

@marcan Which driver would that be?, the only new driver i'm aware has released this week would be the new "OSS" nVidia driver.

EDIT: I'm an idiot, it's the crowdstrilke thing.

19 Jul 2024 at 12:17 | Open on mastodon.social

Stoneface Vimes

@marcan but they're all so clever - I simply cannot comprehend how it could all have gone so wrong. Give them some more money.

19 Jul 2024 at 12:20 | Open on c.im

sudokush

you pumped your fist after posting this tweet huh

19 Jul 2024 at 12:25 | Open on mastodon.social

EndlessMason

@marcan
How come their tooling just lets you release a malformed thing and doesn't, say, do a basic "does it parse" check first?

19 Jul 2024 at 12:40 | Open on hachyderm.io

MrCheeze :retro:

@marcan And let's do it all on a Friday!

19 Jul 2024 at 13:32 | Open on wetdry.world

Christoph Petrausch

@marcan the only question I have: why did it take so long for this?

19 Jul 2024 at 13:39 | Open on norden.social

Woke Leftist Trash

@marcan software. 🎩 🐇

19 Jul 2024 at 13:40 | Open on types.pl

Pierre Bourdon

@marcan somewhere in Crowdstrike's bugtracker, 3 bugs titled "move ruleset eval out of process", "always canary dynamic update files" and "add fuzzing to the update files parser", filed by SREs 5+ years ago, marked as P4, get resurrected and set as P0.

(Remember go/outage-2013 ? :p)

19 Jul 2024 at 13:48 | Open on mastodon.delroth.net

ambroisie

@delroth that go link seems to be dead 😢

19 Jul 2024 at 15:55 | Open on nixos.paris

Lion abt not making pride puns

@delroth @marcan I wonder if Microsoft are again scowling at third-parties making their system "unreliable" and weighing up auto-disabling additional kernel drivers after repeated failed boots again.

(Consumer Windows does so much attempted autorepair I'm kind of surprised it *can't* dig itself out of this hole if safe mode works. But maybe "trusted boot" and disk encryption built atop that screws all this up in an enterprise world.)

19 Jul 2024 at 17:43 | Open on plush.city

Pierre Bourdon

@LionsPhil @marcan they issue the signing certs, they can set any policy they want - but they don't.

19 Jul 2024 at 17:43 | Open on mastodon.delroth.net

Lion abt not making pride puns

@delroth @marcan Imagine a world in which they'd also said "no" to the likes of SecuROM the first time. *sigh*

19 Jul 2024 at 17:53 | Open on plush.city

Shiz

@LionsPhil safe mode does fix it, but I think you can only enter safe mode using the Bitlocker recovery key

19 Jul 2024 at 21:46 | Open on mastodon.social

lena

@marcan you can stay secure that nobody will be accessing that system again

19 Jul 2024 at 13:48 | Open on social.treehouse.systems

furicle

@marcan isn't that the definition of av/endpoint security software?

Other than the non memory safe language bit, it has to work that way, with the incentives all around keeping it that way.

(I'm still amused (?) that ms gets to sell security for software it sells, separately, by monthly subscription)

19 Jul 2024 at 13:56 | Open on mastodon.social

Tammi🐈‍⬛🙂‍↔️ ⛓️‍💥

cynical reply

@marcan there is a process as crowdstrike has explained: just boot into safemode and delete the driver file

19 Jul 2024 at 14:01 | Open on chaos.social

Ash_Crow

cynical reply

@tamtararam @marcan alternatively, just reboot the machine again and again until the driver update finishes before the BSOD happens.

19 Jul 2024 at 18:48 | Open on mastodon.social

Tammi🐈‍⬛🙂‍↔️ ⛓️‍💥

@marcan ppl will be like "haha windows bad" but they have not seen the shit companies install on enterprise servers. you dont even need ebpf nor kernel modules to hang a system. just setup fanotify to block all file open requests and have the process that is supposed to approve those die and stop responding to those. and there you go: McAffee for Linux achieved.

19 Jul 2024 at 14:09 | Open on chaos.social

Scotty Trees

@marcan is this related to the crowdstrike thing I'm seeing just now?

19 Jul 2024 at 15:02 | Open on mastodon.social

Frederic Thevenet

@marcan they _could_ have done all of what you said, but who cares about boring shit like that? Certainly not investors or share-holders, that's for sure!
So instead, how about rushing some half-arsed AI chat bot that nobody who actually needs to get things done will ever use? Yay! Now that's the spirit!
https://www.crowdstrike.com/platform/charlotte-ai/

19 Jul 2024 at 15:07 | Open on mastodon.social

asmaloney (Andy) 🌎

@marcan ...and let's do it on a Friday!

19 Jul 2024 at 16:10 | Open on fosstodon.org

Andrew

@marcan least reckless enterprise software developer

19 Jul 2024 at 16:44 | Open on hackers.town

cameronbosch :endeavourOS:

@marcan And one that cannot be changed outside of our company because it isn't FOSS! What can POSSIBLY go wrong? 😂

19 Jul 2024 at 16:45 | Open on fosstodon.org

The Penguin of Evil

@marcan Also it seems "and lets not sign it either"

19 Jul 2024 at 16:57 | Open on mastodon.social

Farce Majeure

@etchedpixels @marcan unsigned third party kernel drivers... what could go wrong...

19 Jul 2024 at 17:13 | Open on better.boston

Hector Martin

@etchedpixels Current rumors is it was broken in a post-processing step, so it might have been signed internally after that, they just never tested the actual final blob that was distributed.

20 Jul 2024 at 2:18 | Open on social.treehouse.systems

RussianDeepStateSock

@marcan and lets make sure FBI cybercrimes folks work at the provider of this service, and of course make sure this critical infrastructure defense service is also: <checks notes> front and center in many of the most significant cold war 2.0 nation state hacking allegations on the planet lmaaaaaooo now THAT is what I call safety and security.

Its like riding on top of a fully armed drone fighter with your core systems. I wonder if this makes you more or less a target? lmaaaao

19 Jul 2024 at 17:22 | Open on mastodon.social

Buttered Jorts

@marcan CrowdStrike taking notes on your post so they know what to go Google.

19 Jul 2024 at 17:39 | Open on infosec.exchange

DarkAthena ✒️

@marcan @lisamelton All with a name that evokes violence against people.

19 Jul 2024 at 17:43 | Open on mstdn.social

Robert

@marcan in a way this wouldn't have happened if NT was a micro kernel, am I wrong?

19 Jul 2024 at 17:59 | Open on fosstodon.org

lily 🏳️‍⚧️

@rsf92@fosstodon.org @marcan@social.treehouse.systems god i wish nt was a microkernel

19 Jul 2024 at 23:07 | Open on possum.city

Stefano Marinelli

@marcan

A two-panel comic meme featuring a cartoon dog in a hat. In the first panel, the dog sits at a table holding a mug, surrounded by flames. In the second panel, a close-up of the dog's face shows it smiling and saying "THIS IS FINE." despite the fire raging around it. The meme portrays a character maintaining a calm facade in a clearly disastrous situation.

19 Jul 2024 at 18:29 | Open on mastodon.bsd.cafe

Rupert 🇪🇺🏴‍☠️😾🔭📷🍺🍪🌍🔥

@marcan

19 Jul 2024 at 18:37 | Open on mindly.social

Scribe, Backwoods Artificer

@marcan is it bad that part of me wonders, like
what if it had been even worse
what if the systems were just completely unrecoverable

19 Jul 2024 at 19:40 | Open on labyrinth.zone

Viss

@marcan 83 billion dollar company

19 Jul 2024 at 20:11 | Open on mastodon.social

ydalton

@marcan Lol, any kind of string parsing in a non-Python language makes me uneasy, and especially when you do it in a kernel context 💀

20 Jul 2024 at 11:24 | Open on mastodon.social

Trillion Byter

@marcan brb. Need to add a few topics in my personal Jira.

20 Jul 2024 at 14:52 | Open on mstdn.social

Christian Berger DECT 2763

@marcan Well it's a nice stunt. I mean nobody would ever use "endpoint security" software on an important system. That would be ridiculous and a clear breach of, for example, the contract rules of Crowdstrike.

21 Jul 2024 at 7:17 | Open on f-ckendehoelle.de

Jan ☕🎼🎹☁️🏋️‍♂️

@marcan rust in kernel wasn't available when this thing was written. And you just don't go rewrite stuff for the fun of it.

Not to defend them for whatever errors happened in the qa proces, but hindsight 20/20 here.

21 Jul 2024 at 18:23 | Open on fedi.kcore.org

Go Up