Email or username:

Password:

Forgot your password?
Lennart Poettering

So, if you ask me what my takeaway from the Crowdstrike issue is, I'd say: boot counting/boot assessment/automatic fallback should really be a MUST for today's systems. *Before* you invoke your first kernel you need have tracking of boot attempts and a logic for falling back to older versions automatically. It's a major shortcoming that this is not default behaviour of today's distros, in particular commercial ones.

Of course systemd has supported this for a long time:

systemd.io/AUTOMATIC_BOOT_ASSE

40 comments
Lennart Poettering

And it's a shame that commercial distros do not hook into that, and the boot stack of them hasn't changed in more than a decade, is laughably bad at security (unsigned initrds, ffs!) and robustness, and the if you have boot assessment enabled at all turn it into a fantastic DoS (by showing you a boot menu instead of reverting to a working boot choice).

David Haller

@pid_eins Is there any distro that has implemented automatic boot assessment, as you suggested?

makefu

@raito @david There were a number of Pull Requests for this feature (one implementation was even merged to master for 20 minutes) but none is currently available, no? I'd really love to use this feature, just today one of my boxes would have been saved by that 👍

Raito Bezarius

@makefu @david for UKI, it was already available, for normal NixOS usecases, yes, it was merged recently but I have been using it for a while

karlggest

@david
Any inmutable. All of them works this way by design.
@pid_eins

Lennart Poettering

@karlggestd @david not really true. The ones that use systemd-boot migth, but boot counting in grub is pretty useless and manual if you ask me.

karlggest

@pid_eins @david wow, My mistake, I assumed that Aeon was the most delayed project (they are with the latest RC).

christian mock

@pid_eins @karlggestd @david You can do it, but it is involved. But basically, using BootNext entries in EFI and only setting the fixed boot order after the system is booted is what I implemented for an immutable digital signage project.

Bou

@karlggestd @david @pid_eins Fedora Silverblue doesn't, according to my experience.

Eric Curtin

@david @pid_eins Yes there are distros and commercial distros that hook into that RHEL for Edge and Red Hat In-Vehicle Operating System automatically rollback after a number of failed boots. Any rpm-ostree/ostree/bootc based OS is capable of it.

poleguy

@pid_eins I'm not disagreeing. It makes me wonder how you would categorize/assess/mitigate the security and operations risk of having a system that's supposed to be on one kernel fall back to a previous one?

Lennart Poettering

@poleguy the way automatic boot assessment with systemd works is that on each boot we make one of three assesments: "good", "bad", "dontknow". If we make the "bad" assessment we'll count down the entry's counter (and if it ist zero we give up on it in the future). if we make the "good" assessment we'll drop the counter entirely from the entry, marking it as good for basically all eternity. If we do "dontknow" we don't do a thing

Lennart Poettering

@poleguy this means that a bad actor can play games with us until the point we managed to do one boot that worked correctly, but from that point on, we'll never regress anymore.

I like to believe that that's quite a sensible and simple policy that should work for most cases. It balances robustness against chance for attackers to hold off updates indefinitely.

poleguy

@pid_eins thanks. That does seem reasonable and for remotely managed systems and better than the alternative, which is manual intervention. I worry a smidge about added complexity. I can't shake the feeling that we keep adding layers of complexity to our systems. It feels okay to add complexity that is proportional to the complexity of the problem being solved. In this case it seems sane. However these remotely managed systems all tend to have out of band methods to recover already, no?

Sheogorath 🦊

@pid_eins but would this really prevent it, when the configuration of a kernel driver goes bad? If I understand things correctly here (big if), only if you store that config in a volume that can be reverted it would be possible to fix the issue.

Otherwise you boot into the emergency shell and you are non the wiser than Windows systems are right now.

And given it's an endpoint protection that is supposed to react pretty instant to changes, I don't see how you would get theses in the A/B update.

Lennart Poettering

@sheogorath on linux drivers dont really have a "configuration" per se. At least not much you pass into the early, risky parts of the boot process. Subsystems might have some config. In a systemd world you wrap the im authenticated/signed PE addons or confext images, and those you drop next to a specific kernel image, thus you can revert them together as one or update as one and so on. Or in other words: the way we parameterize kernels in modern ways also makes it easy to do assessment/fallback.

Justin Azoff

@pid_eins how exactly is a successful boot defined though?

Boots to init?
Boot and all services are started successfully? Some services?

What happens if the system boots successfully, runs for ~60 seconds, and then the kernel panics when the first cron job/timer runs?

furicle

@JustinAzoff @pid_eins see the link at the start of the thread, flexible strategies available

Lennart Poettering

@JustinAzoff depends on the usecase. Different systems/OSes want different stuff there. Some might just check if system manages to reach some point in the boot process, others might want to also require network pings to work, other stuff might instead just want to check that some services stay up for some minimum amount of time and so on. systemd gives you the basic infra for this and some super basic tests in this sense, but individual OS images might want to fill in more tests/conditions.

John Gordon

@JustinAzoff I assume anything to make boot more complex also opens up new threats.

Matěj Cepl 🇪🇺 🇨🇿 🇺🇦

@pid_eins

Well, the lesson for me (aside for other obvious ones) is that for the industrial systems it should be absolutely mandatory to be something like #SUSE #SLEMicro (or its Red Hat equivalent): snapshot based, with R/O system, where the system would automatically boot from an older snapshot if the current one fails.

The fact that airline computers are not something like this, is just mind-blowing.

Yes, preaching the same gospel @sysrich preached for years.

youtu.be/idZEJ0OYfWU

James Henstridge

@pid_eins for a system like Crowdstrike, you'd want to extend that to cover data files the kernel loads. I wonder how well that'd work with the rate of updates they were pushing out?

Lennart Poettering

@jamesh i think everyone agrees you have to cover the kernel itself and the initrd with these assesment/fallback schemes. I personally would also then cover the rootfs you boot into with that, but people have different opinions how far the coverage should reach, and how much you "pin" through a boot attempt.

vurpo 🏳️‍⚧️

@pid_eins unfortunately this wasn't the kind of issue that would be solved by falling back to old versions. The bug in the kernel module was there for a long time or possibly from the beginning, and falling back to an older version would still just have crashed in the same way

Lennart Poettering

@vurpo nope, of course boot assessment would catch this. Key is just that you "pin" enough as part of an attempt, and thus can revert sufficient parts to get things working.

On Linux you'd pin kernel *and* initrd at the very leas, and in the model i propose even the entire /usr for each attempt, to maximize coverage of the assesment logic.

bse

@pid_eins @vurpo I would assume you also have to pin /lib/modules, or better get rid of that relic completely and move modules inside the UKI?

Lennart Poettering

@bse @vurpo kernel modules are pinned by the kernel's version number, i.e. looked for in /usr/lib/modules/`uname -r`/.

bse

@pid_eins @vurpo Yes, but what happens if you install a faulty out-of-tree module that gets built for all existing kernel versions, for example via dkms, and put into /lib/modules/*/?

Lars Marowsky-Brée 😷

@bse @pid_eins @vurpo openSUSE with snapper can reboot into a full older snapshot of the system (except user data), which has saved my butt a few times.

Gabe

@bse @pid_eins @vurpo If you deliberately bypass the system integrity and safety features, then they won't save you. It doesn't matter what those features are.

If you blindly sign initrds, checking won't help you. If you blindly mark a boot config as good, or if you replace your rollback image, or whatever...

The system won't save you from yourself infallibly. You'll still want staging, and monitoring, and disaster recovery.

Lennart Poettering

@bse @vurpo dkms really should synthesize separate menu items for its rebuilds. If it doesnt, it's simply broken and should be fixed.

bse

@pid_eins @vurpo Since both entries would be using the same kernel and hence use the same /lib/modules/$(uname -r)/, you need a mechanism to have multiple versions of your modules folder. If you're serious about preventing older boot entries from breaking retroactively, i think full system snapshots are the only option. Short of that, there might be some compromises like bundling a kernel and all modules, which of course does not protect userspace, but might be easier for commonly used distros.

Bou

@pid_eins wait, distros could just just enable it and they don't? How come?

Lennart Poettering

@bou they love grub too much and how things where done in 1999...

Lars Marowsky-Brée 😷

@pid_eins The shocking thing is that this was a requirement for Carrier Grade Linux two decades ago already.
When it comes to reliability and availability as part of dependable computing, our (distributed or not) systems have somewhat regressed as they were scaled up.

Anthk

@pid_eins

You mean, like keeping old grub/lilo entries and kernels since forever?

Go Up