@marcan @koteisaev You can not do what? Forcing people...

@marcan @koteisaev You can not do what? Forcing people to use at instead of [] in C++? Would certainly be possible. But the discussion what "you could do" misses the point completely. What would happen in reality in a poor code base written on limited budget? Most likely would simply panic, or?

Like 20 Jul 2024 at 5:47 | Wall-to-wall | Open on mastodon.social

19 comments

Martin Uecker

@marcan @koteisaev The argument that a memory safe language would have prevented the problem is incorrect, because terminating the program (kernel) on invalid operation *is* something that could also happen in a memory safe language and seems even the default in Rust for many things. Whether you could have done it differently (certainly!) and whether Rust makes this easier or not is a different discussion.

20 Jul 2024 at 6:11 | Open on mastodon.social

Hector Martin

@uecker @koteisaev The argument is that using a memory safe language would be a *requirement* to be *able* to avoid this class of problems, as evidenced by decades of memory safety bugs. Yes you can write crap code in any language, but it's plainly obvious to everyone who isn't in denial about the state of software engineering that approximately nobody can write correct and memory-safe complex code in memory-unsafe languages.

20 Jul 2024 at 7:55 | Open on social.treehouse.systems

Kote Isaev

@marcan I agree that memory-safe languages are necessary. And many others here would agree on this.
But many coders write in C and C++ in a way like these languages are memory-safe. Like, "Hey, Bob, why you check this parameter for array size bounds here? I already checked it in function which calls this code! Your check slows code for 0.3%!".
But problem that caused this outage is NOT a memory leak or out-of-bounds data read/write. It was malformed "content update". Broken input data.

20 Jul 2024 at 9:17 | Open on mastodon.online

Martin Uecker

@koteisaev @marcan My point is that memory safety does not help here. Because a panic at a out-of-bounds is memory safe and would still have the exact same effect. Hector's argument seems to be that other things you can optionally do in Rust would potentially allow to avoid this, but this is not the result of using a memory safe language per se. I agree about the sad state of software engineering and I also I agree about the advantages of memory safety in general.

20 Jul 2024 at 9:41 | Open on mastodon.social

Hector Martin

@uecker @koteisaev My point is that you can do those things in Rust and you can't in C.

The actual crash here was a NULL deref. That is one of the most classic footguns of memory-unsafe languages (not just those, also others like Java for some reason). In Rust there are no NULLs, only explicit Option<T>s, which force you consider the case of there being no value. Yes, you can still just turn it into "panic if no value" but making it an explicit decision that the programmer has to make means it's a lot less likely to happen by accident and a lot more likely to be correctly handled with error propagation, and it also means you can outright ban that choice by policy and technical means.

@uecker @koteisaev My point is that you can do those things in Rust and you can't in C.

Expand text...

20 Jul 2024 at 10:47 | Open on social.treehouse.systems

Martin Uecker replied to Hector

@marcan @koteisaev I can almost agree with this, but my conclusion from this is not "let's dump C because it is fundamental impossible to write good software in C and move to Rust which fixes everything", but there are some good ideas in Rust which help write better software but there is also continuously improved tooling for C one can use, so we can also gradually improve this.

20 Jul 2024 at 11:01 | Open on mastodon.social

Hector Martin

@koteisaev Broken input data that caused a NULL pointer dereference. Which is a memory safety problem. NULL pointer dereferences are impossible in safe Rust code.

(Also memory leaks are *not* a memory safety problem and are possible in safe Rust; this is a common misconception about what "memory safety" means.)

20 Jul 2024 at 10:44 | Open on social.treehouse.systems

Kote Isaev

@marcan Null deref?
Like, some `switch` construct got optimized into kind of `foo[bar].doCrap(ctx, data)` construct? But the `foo[bar]` kind of construct without default pathway can happen in Rust anyway if that thing constructed dynamically (e. g. by other earlier part of arriving data).
this can be a common situation for some "rules engine" or other high-level code execution branching infrastructure.
Does Rust can enforce "always have a meaningful result for `foo[bar]` construct?

20 Jul 2024 at 13:16 | Open on mastodon.online

Hector Martin replied to Kote

@koteisaev There's no mention of an indexed array. Rust guarantees that nothing can be NULL at compile time. If you need to have optional values then you have to use Option<T> and the compiler forces you to choose how to handle the lack of value.

For switch statements and such, Rust requires them to be exhaustive or have a default case if that is required for correctness (e.g. because a value is returned).

Array/slice indexing with [] *can* panic (which is still a BSOD but at least guaranteed not an exploit) but it is possible to ban that in the compiler/linter and enforce the use of the .get() method which returns an Option<T>, and such policy would be a good idea for critical kernel code. You can set up a Rust build such that it is *impossible* for any operation to panic, e.g. by making the panic symbol undefined so the project fails to link if it is referenced. This even bans things like unchecked integer division by a non-constant (since div by zero is a panic). All of the panicking operations would have non-panicking versions that you use instead.

For switch statements and such, Rust requires them to be exhaustive or have a default case if that is required for correctness (e.g. because a value is returned).

Expand text...

20 Jul 2024 at 14:21 | Open on social.treehouse.systems

Kote Isaev replied to Hector

@marcan Still, this is NOT enough to prevent failure in situation when "user input" is involved, even if it is some "content update" for security driver.
Real problem there was not the fact the kernel panic happened but more the fact that recovery strategy that does not require manual intervention did not implemented.

20 Jul 2024 at 14:26 | Open on mastodon.online

Hector Martin replied to Kote

@koteisaev Yes it is. Again, you can statically forbid panic, and (safe) Rust already forbids unsafe memory accesses. Therefore, it is impossible to (memory error) BSOD regardless of how you handle user input in the general case. The language forces you to handle the bad input gracefully somehow (typically by returning and propagating an error).

About the worst you can do is infinite loop (but no language can protect against that because it equates to solving the halting problem).

20 Jul 2024 at 14:34 | Open on social.treehouse.systems

Kote Isaev replied to Hector

@marcan "w (typically by returning and propagating an error)." Propagating to the exit of process with an error code. Here we are again at need for some standard of how to deal with faulty drivers in general, such as "reboot with replacement by error reporting code that will send error dump somewhere", "isolate drivers in some container-like environment that would NOT cause complete boot BSOD, unless special cases like filesystem driver" (but then such EDR impossible per se), at os level...

20 Jul 2024 at 14:44 | Open on mastodon.online

Hector Martin replied to Kote

@koteisaev Huh? No OS crashes when a driver returns an error, be it from the init function or a callback. It doesn't propage to "exit of process", it propagates to the driver management layer and then the operation fails, be it an access or a driver init. If it's a user process invoking the driver, that operation returns an error code to the user process (if the user process chooses to handle that by crashing, that's its problem then).

On Windows when a driver fails to init that's a little exclamation mark on the device in Device Manager or similar, or a service error code, or whatever. On Linux the driver just doesn't bind to the device.

Expand text...

20 Jul 2024 at 14:52 | Open on social.treehouse.systems

Kote Isaev replied to Hector

@marcan If it was always so, then nobody would ever see THAT bsod, as it was caused by crashing kernel driver, a very privileged software. So it means that for windows at least it is not ended up with some standard protocol with 'exclamation on a software device".

20 Jul 2024 at 14:55 | Open on mastodon.online

Hector Martin replied to Kote

@koteisaev THAT BSOD was caused by a driver crashing, not a driver returning an error code, which is a very different thing because a crash is uncontrolled and cannot be safely handled, while an error code return is a safe and controlled condition.

Linux actually tries to prevent a full system panic, and only terminates the current process if the context is a user process. If you're lucky that means the machine keeps working as normal, if the crash didn't corrupt memory. More often than not, even in that case, the faulty driver had some mutexes locked and your system will slowly deadlock into oblivion as other processes try to lock the same mutex. There is no reasonable way around this. This is why uncontrolled crashes are bad and error returns are not.

Expand text...

20 Jul 2024 at 14:59 | Open on social.treehouse.systems

Kote Isaev replied to Hector

@marcan Sounds as argument against big kernel and in favor more isolated drivers, and against "hyper-privileged" software in general...
Kernel could unlock all mutexes on process death (and even if process leaked mutexes lock without crash), same way as file handles freed even if you use kill command on process....
At userspace it resembles how nodejs domains used to intercept error to prevent ungraceful process crash.

20 Jul 2024 at 15:05 | Open on mastodon.online

Hector Martin replied to Kote

@koteisaev

Sounds as argument against big kernel and in favor more isolated drivers, and against "hyper-privileged" software in general...

Which is what macOS did, and why this can't happen on the macOS version of crowdstrike (it uses userspace drivers).

Linux has similar mechanisms, but can't discourage kernel drivers by policy like macOS did since it's not as tightly controlled, so CrowdStrike on Linux still uses a kernel driver even though it could choose not to, because they suck.

Kernel could unlock all mutexes on process death (and even if process leaked mutexes lock without crash), same way as file handles freed even if you use kill command on process....

No. If a mutex is locked then there is no guarantee that the data protected by it is in a consistent state. You can't just "unlock all mutexes", then you just get data corruption which is worse than the partial deadlocks. Mutexes are low-level constructs. The whole point/job of the kernel is to keep track of resources in a safe manner so this can be done for userspace handles like file descriptors. The buck stops somewhere and within the kernel it is impossible to do this because at the end of the day there has to be some code in charge of atomicity/consistency for resource state and that code itself cannot be freely interruptible.

At userspace it resembles how nodejs domains used to intercept error to prevent ungraceful process crash.

... and this works because Javascript is a high-level, memory-safe language. You can't do this with C.

@koteisaev

Sounds as argument against big kernel and in favor more isolated drivers, and against "hyper-privileged" software in general...

Which is what macOS did, and why this can't happen on the macOS version of crowdstrike (it uses userspace drivers).

Expand text...

20 Jul 2024 at 15:12 | Open on social.treehouse.systems

Kote Isaev replied to Hector

@marcan Thanks for detailed explanations. Now it seems I better understand some things.

20 Jul 2024 at 15:15 | Open on mastodon.online

Martin Uecker

@marcan @koteisaev . My preferred solution is to use a subset of C and compile to eBPF which is then verified at run-time.

20 Jul 2024 at 9:35 | Open on mastodon.social