@marcan Still, this is NOT enough to prevent failure...

@marcan Still, this is NOT enough to prevent failure in situation when "user input" is involved, even if it is some "content update" for security driver.
Real problem there was not the fact the kernel panic happened but more the fact that recovery strategy that does not require manual intervention did not implemented.

Like 20 Jul 2024 at 14:26 | Wall-to-wall | Open on mastodon.online

8 comments

Hector Martin replied to Kote

@koteisaev Yes it is. Again, you can statically forbid panic, and (safe) Rust already forbids unsafe memory accesses. Therefore, it is impossible to (memory error) BSOD regardless of how you handle user input in the general case. The language forces you to handle the bad input gracefully somehow (typically by returning and propagating an error).

About the worst you can do is infinite loop (but no language can protect against that because it equates to solving the halting problem).

20 Jul 2024 at 14:34 | Open on social.treehouse.systems

Kote Isaev replied to Hector

@marcan "w (typically by returning and propagating an error)." Propagating to the exit of process with an error code. Here we are again at need for some standard of how to deal with faulty drivers in general, such as "reboot with replacement by error reporting code that will send error dump somewhere", "isolate drivers in some container-like environment that would NOT cause complete boot BSOD, unless special cases like filesystem driver" (but then such EDR impossible per se), at os level...

20 Jul 2024 at 14:44 | Open on mastodon.online

Hector Martin replied to Kote

@koteisaev Huh? No OS crashes when a driver returns an error, be it from the init function or a callback. It doesn't propage to "exit of process", it propagates to the driver management layer and then the operation fails, be it an access or a driver init. If it's a user process invoking the driver, that operation returns an error code to the user process (if the user process chooses to handle that by crashing, that's its problem then).

On Windows when a driver fails to init that's a little exclamation mark on the device in Device Manager or similar, or a service error code, or whatever. On Linux the driver just doesn't bind to the device.

Expand text...

20 Jul 2024 at 14:52 | Open on social.treehouse.systems

Kote Isaev replied to Hector

@marcan If it was always so, then nobody would ever see THAT bsod, as it was caused by crashing kernel driver, a very privileged software. So it means that for windows at least it is not ended up with some standard protocol with 'exclamation on a software device".

20 Jul 2024 at 14:55 | Open on mastodon.online

Hector Martin replied to Kote

@koteisaev THAT BSOD was caused by a driver crashing, not a driver returning an error code, which is a very different thing because a crash is uncontrolled and cannot be safely handled, while an error code return is a safe and controlled condition.

Linux actually tries to prevent a full system panic, and only terminates the current process if the context is a user process. If you're lucky that means the machine keeps working as normal, if the crash didn't corrupt memory. More often than not, even in that case, the faulty driver had some mutexes locked and your system will slowly deadlock into oblivion as other processes try to lock the same mutex. There is no reasonable way around this. This is why uncontrolled crashes are bad and error returns are not.

Expand text...

20 Jul 2024 at 14:59 | Open on social.treehouse.systems

Kote Isaev replied to Hector

@marcan Sounds as argument against big kernel and in favor more isolated drivers, and against "hyper-privileged" software in general...
Kernel could unlock all mutexes on process death (and even if process leaked mutexes lock without crash), same way as file handles freed even if you use kill command on process....
At userspace it resembles how nodejs domains used to intercept error to prevent ungraceful process crash.

20 Jul 2024 at 15:05 | Open on mastodon.online

Hector Martin replied to Kote

@koteisaev

Sounds as argument against big kernel and in favor more isolated drivers, and against "hyper-privileged" software in general...

Which is what macOS did, and why this can't happen on the macOS version of crowdstrike (it uses userspace drivers).

Linux has similar mechanisms, but can't discourage kernel drivers by policy like macOS did since it's not as tightly controlled, so CrowdStrike on Linux still uses a kernel driver even though it could choose not to, because they suck.

Kernel could unlock all mutexes on process death (and even if process leaked mutexes lock without crash), same way as file handles freed even if you use kill command on process....

No. If a mutex is locked then there is no guarantee that the data protected by it is in a consistent state. You can't just "unlock all mutexes", then you just get data corruption which is worse than the partial deadlocks. Mutexes are low-level constructs. The whole point/job of the kernel is to keep track of resources in a safe manner so this can be done for userspace handles like file descriptors. The buck stops somewhere and within the kernel it is impossible to do this because at the end of the day there has to be some code in charge of atomicity/consistency for resource state and that code itself cannot be freely interruptible.

At userspace it resembles how nodejs domains used to intercept error to prevent ungraceful process crash.

... and this works because Javascript is a high-level, memory-safe language. You can't do this with C.

@koteisaev

Sounds as argument against big kernel and in favor more isolated drivers, and against "hyper-privileged" software in general...

Which is what macOS did, and why this can't happen on the macOS version of crowdstrike (it uses userspace drivers).

Expand text...

20 Jul 2024 at 15:12 | Open on social.treehouse.systems

Kote Isaev replied to Hector

@marcan Thanks for detailed explanations. Now it seems I better understand some things.

20 Jul 2024 at 15:15 | Open on mastodon.online