Email or username:

Password:

Forgot your password?
Top-level
Kote Isaev

@marcan I agree that memory-safe languages are necessary. And many others here would agree on this.
But many coders write in C and C++ in a way like these languages are memory-safe. Like, "Hey, Bob, why you check this parameter for array size bounds here? I already checked it in function which calls this code! Your check slows code for 0.3%!".
But problem that caused this outage is NOT a memory leak or out-of-bounds data read/write. It was malformed "content update". Broken input data.

15 comments
Martin Uecker

@koteisaev @marcan My point is that memory safety does not help here. Because a panic at a out-of-bounds is memory safe and would still have the exact same effect. Hector's argument seems to be that other things you can optionally do in Rust would potentially allow to avoid this, but this is not the result of using a memory safe language per se. I agree about the sad state of software engineering and I also I agree about the advantages of memory safety in general.

Hector Martin

@uecker @koteisaev My point is that you can do those things in Rust and you can't in C.

The actual crash here was a NULL deref. That is one of the most classic footguns of memory-unsafe languages (not just those, also others like Java for some reason). In Rust there are no NULLs, only explicit Option<T>s, which force you consider the case of there being no value. Yes, you can still just turn it into "panic if no value" but making it an explicit decision that the programmer has to make means it's a lot less likely to happen by accident and a lot more likely to be correctly handled with error propagation, and it also means you can outright ban that choice by policy and technical means.

@uecker @koteisaev My point is that you can do those things in Rust and you can't in C.

The actual crash here was a NULL deref. That is one of the most classic footguns of memory-unsafe languages (not just those, also others like Java for some reason). In Rust there are no NULLs, only explicit Option<T>s, which force you consider the case of there being no value. Yes, you can still just turn it into "panic if no value" but making it an explicit decision that the programmer has to make means it's...

Martin Uecker replied to Hector

@marcan @koteisaev I can almost agree with this, but my conclusion from this is not "let's dump C because it is fundamental impossible to write good software in C and move to Rust which fixes everything", but there are some good ideas in Rust which help write better software but there is also continuously improved tooling for C one can use, so we can also gradually improve this.

Hector Martin

@koteisaev Broken input data that caused a NULL pointer dereference. Which is a memory safety problem. NULL pointer dereferences are impossible in safe Rust code.

(Also memory leaks are *not* a memory safety problem and are possible in safe Rust; this is a common misconception about what "memory safety" means.)

Kote Isaev

@marcan Null deref?
Like, some `switch` construct got optimized into kind of `foo[bar].doCrap(ctx, data)` construct? But the `foo[bar]` kind of construct without default pathway can happen in Rust anyway if that thing constructed dynamically (e. g. by other earlier part of arriving data).
this can be a common situation for some "rules engine" or other high-level code execution branching infrastructure.
Does Rust can enforce "always have a meaningful result for `foo[bar]` construct?

Hector Martin replied to Kote

@koteisaev There's no mention of an indexed array. Rust guarantees that nothing can be NULL at compile time. If you need to have optional values then you have to use Option<T> and the compiler forces you to choose how to handle the lack of value.

For switch statements and such, Rust requires them to be exhaustive or have a default case if that is required for correctness (e.g. because a value is returned).

Array/slice indexing with [] *can* panic (which is still a BSOD but at least guaranteed not an exploit) but it is possible to ban that in the compiler/linter and enforce the use of the .get() method which returns an Option<T>, and such policy would be a good idea for critical kernel code. You can set up a Rust build such that it is *impossible* for any operation to panic, e.g. by making the panic symbol undefined so the project fails to link if it is referenced. This even bans things like unchecked integer division by a non-constant (since div by zero is a panic). All of the panicking operations would have non-panicking versions that you use instead.

@koteisaev There's no mention of an indexed array. Rust guarantees that nothing can be NULL at compile time. If you need to have optional values then you have to use Option<T> and the compiler forces you to choose how to handle the lack of value.

For switch statements and such, Rust requires them to be exhaustive or have a default case if that is required for correctness (e.g. because a value is returned).

Kote Isaev replied to Hector

@marcan Still, this is NOT enough to prevent failure in situation when "user input" is involved, even if it is some "content update" for security driver.
Real problem there was not the fact the kernel panic happened but more the fact that recovery strategy that does not require manual intervention did not implemented.

Hector Martin replied to Kote

@koteisaev Yes it is. Again, you can statically forbid panic, and (safe) Rust already forbids unsafe memory accesses. Therefore, it is impossible to (memory error) BSOD regardless of how you handle user input in the general case. The language forces you to handle the bad input gracefully somehow (typically by returning and propagating an error).

About the worst you can do is infinite loop (but no language can protect against that because it equates to solving the halting problem).

Kote Isaev replied to Hector

@marcan "w (typically by returning and propagating an error)." Propagating to the exit of process with an error code. Here we are again at need for some standard of how to deal with faulty drivers in general, such as "reboot with replacement by error reporting code that will send error dump somewhere", "isolate drivers in some container-like environment that would NOT cause complete boot BSOD, unless special cases like filesystem driver" (but then such EDR impossible per se), at os level...

Hector Martin replied to Kote

@koteisaev Huh? No OS crashes when a driver returns an error, be it from the init function or a callback. It doesn't propage to "exit of process", it propagates to the driver management layer and then the operation fails, be it an access or a driver init. If it's a user process invoking the driver, that operation returns an error code to the user process (if the user process chooses to handle that by crashing, that's its problem then).

On Windows when a driver fails to init that's a little exclamation mark on the device in Device Manager or similar, or a service error code, or whatever. On Linux the driver just doesn't bind to the device.

@koteisaev Huh? No OS crashes when a driver returns an error, be it from the init function or a callback. It doesn't propage to "exit of process", it propagates to the driver management layer and then the operation fails, be it an access or a driver init. If it's a user process invoking the driver, that operation returns an error code to the user process (if the user process chooses to handle that by crashing, that's its problem then).

Kote Isaev replied to Hector

@marcan If it was always so, then nobody would ever see THAT bsod, as it was caused by crashing kernel driver, a very privileged software. So it means that for windows at least it is not ended up with some standard protocol with 'exclamation on a software device".

Hector Martin replied to Kote

@koteisaev THAT BSOD was caused by a driver crashing, not a driver returning an error code, which is a very different thing because a crash is uncontrolled and cannot be safely handled, while an error code return is a safe and controlled condition.

Linux actually tries to prevent a full system panic, and only terminates the current process if the context is a user process. If you're lucky that means the machine keeps working as normal, if the crash didn't corrupt memory. More often than not, even in that case, the faulty driver had some mutexes locked and your system will slowly deadlock into oblivion as other processes try to lock the same mutex. There is no reasonable way around this. This is why uncontrolled crashes are bad and error returns are not.

@koteisaev THAT BSOD was caused by a driver crashing, not a driver returning an error code, which is a very different thing because a crash is uncontrolled and cannot be safely handled, while an error code return is a safe and controlled condition.

Linux actually tries to prevent a full system panic, and only terminates the current process if the context is a user process. If you're lucky that means the machine keeps working as normal, if the crash didn't corrupt memory. More often than not, even...

Kote Isaev replied to Hector

@marcan Sounds as argument against big kernel and in favor more isolated drivers, and against "hyper-privileged" software in general...
Kernel could unlock all mutexes on process death (and even if process leaked mutexes lock without crash), same way as file handles freed even if you use kill command on process....
At userspace it resembles how nodejs domains used to intercept error to prevent ungraceful process crash.

Hector Martin replied to Kote

@koteisaev

Sounds as argument against big kernel and in favor more isolated drivers, and against "hyper-privileged" software in general...

Which is what macOS did, and why this can't happen on the macOS version of crowdstrike (it uses userspace drivers).

Linux has similar mechanisms, but can't discourage kernel drivers by policy like macOS did since it's not as tightly controlled, so CrowdStrike on Linux still uses a kernel driver even though it could choose not to, because they suck.

Kernel could unlock all mutexes on process death (and even if process leaked mutexes lock without crash), same way as file handles freed even if you use kill command on process....

No. If a mutex is locked then there is no guarantee that the data protected by it is in a consistent state. You can't just "unlock all mutexes", then you just get data corruption which is worse than the partial deadlocks. Mutexes are low-level constructs. The whole point/job of the kernel is to keep track of resources in a safe manner so this can be done for userspace handles like file descriptors. The buck stops somewhere and within the kernel it is impossible to do this because at the end of the day there has to be some code in charge of atomicity/consistency for resource state and that code itself cannot be freely interruptible.

At userspace it resembles how nodejs domains used to intercept error to prevent ungraceful process crash.

... and this works because Javascript is a high-level, memory-safe language. You can't do this with C.

@koteisaev

Sounds as argument against big kernel and in favor more isolated drivers, and against "hyper-privileged" software in general...

Which is what macOS did, and why this can't happen on the macOS version of crowdstrike (it uses userspace drivers).

Linux has similar mechanisms, but can't discourage kernel drivers by policy like macOS did since it's not as tightly controlled, so CrowdStrike on Linux still uses a kernel driver even though it could choose not to, because they suck.

Kote Isaev replied to Hector

@marcan Thanks for detailed explanations. Now it seems I better understand some things.

Go Up