2️⃣0️⃣ Here's the 20th post highlighting key new features of the upcoming v257 release of systemd. #systemd257
systemd-journald is systemd's log service that collects logs from system services, acquires some metadata and then stores all that in a mostly append-only linear database on disk, keyed by a multitude of metadata fields.
One such field is the "invocation ID". The invocation ID is a 128bit randomized ID that every systemd unit gets freshly assigned whenever it enters its start phase.
Thus, if the same service gets started 5 times iteratively, it will have 5 different invocation IDs assigned. And if you want to pinpoint a specific invocation of a service, it's the invocation ID you can use to uniquely identify it. (Besides the use as metadata when logging the invocation ID is also passed via the env var $INVOCATION_ID to the services btw; and systemctl status also shows it these days.)
One could say that what the Linux boot ID does for the system…
8️⃣ Here's the 8th post highlighting key new features of the upcoming v257 release of systemd.
A longer time ago systemd introduced JSON based user records as an extension of classic UNIX `struct passwd`. These records can be provided via Varlink IPC or via drop-in files. The much richer set of account settings is documented here:
These fields are consumed and acted on by systemd-logind, systemd-homed and other components of systemd.
One field is particularly interesting: it contains a field for the public SSH keys of an account (i.e. the stuff you traditionally find in ~/.ssh/authorized_keys). By making it part of the user record itself it can be accessed and used by SSH without access to the home directory of the user.
7️⃣ Here's the 7th post highlighting key new features of the upcoming v257 release of systemd. #systemd257
The graphical login prompt you see when your computer boots up is a sensitive UI: typically, when starting to work, without much thinking you'll type in your username and password, expecting it to log you in and provide you with your desktop session. However, what if someone just opened a website in a browser in full screen mode with contents that just *looks* like your login screen, …
… but actually is just some malware that exfiltrates the password you type in?
Since this kind of attack scenario is not new, many OSes provide a "SAK" concept, which stands for "Special Attention Key". The idea is that there's a special key combination you can hit first, which no web page, or web browser, or app, or even desktop environment could possibly hook into that always brings you back to your *real* login screen, regardless where you are.
5️⃣ Here's the 5th installment of posts highlighting key new features of the upcoming v257 release of systemd. #systemd257
Since its beginnings systemd has been a heavy user of the D-Bus IPC system. It provides D-Bus APIs, it calls D-Bus APIs, it schedules activation of the D-Bus broker, and even provides its own C D-Bus client library (sd-bus).
However, since early on our use of D-Bus was not without various major problems. One of the biggest goes something like this:
D-Bus' model is built around a central broker daemon, which is started during boot, but unfortunately relatively late (i.e. together with other, regular daemons instead of early boot or the initrd). However, systemd brings up the system as a whole and hence needs IPC from earliest moment on.
And then there are various components of systemd that the D-Bus broker relies on (i.e. consumes functionality of) and hence cannot themselves provide their services on D-Bus, …
4️⃣ Here's the 4th post highlighting key new features of the upcoming v257 release of systemd. #systemd257
One of the key features of systemd we have talked about in the past years are UKIs, i.e. "unified kernel images", which is a combination of a Linux kernel, an initrd, and more into a single unified PE binary, that can be signed as a whole for SecureBoot, measured as a whole and updated as a whole.
In my PoV UKIs are a central concept of securing the Linux boot process.
But: they do have some disadvantages. They typically imply (not strictly, but typically) that they are built on OS vendor build systems instead of locally. This is different from the status quo ante, where the initrd is typically built on the deployed system (at least on generic distros), and thus highly adapted to the local system.
UKIs being vendor-built hence means they are a lot more rigid, less flexible than the traditional way. So far this meant you'd have to settle…
@pid_eins UKIs are fine, but systemd-stub churn every release makes it more of a liability than a stable foundation to build on.
Stuff like sysexts etc could poke holes in established security models similar (but restricted in scope) to how systemd poked giant holes into full disk encryption when it added DDI automount.
It's a massive effort to keep up with changes and validate them and it's pretty concerning. So you might end up not doing that, and end up with vulnerabilities in your product.
In systemd we started to do more and more Varlink IPC (instead of or 9n addition to D-Bus), and you might wonder what that is all about. In this AllSystemsGo talk I try to explain things a bit, enjoy: https://media.ccc.de/v/all-systems-go-2024-276-varlink-now-
@pid_eins - Varlink reminds me of the IPC system we used at BlackBerry like 12 years ago. I guess the big difference was back then it was basically JSON with additional binary integer types. Not exactly like CBOR.
So, if you ask me what my takeaway from the Crowdstrike issue is, I'd say: boot counting/boot assessment/automatic fallback should really be a MUST for today's systems. *Before* you invoke your first kernel you need have tracking of boot attempts and a logic for falling back to older versions automatically. It's a major shortcoming that this is not default behaviour of today's distros, in particular commercial ones.
Of course systemd has supported this for a long time:
So, if you ask me what my takeaway from the Crowdstrike issue is, I'd say: boot counting/boot assessment/automatic fallback should really be a MUST for today's systems. *Before* you invoke your first kernel you need have tracking of boot attempts and a logic for falling back to older versions automatically. It's a major shortcoming that this is not default behaviour of today's distros, in particular commercial ones.
@pid_eins The shocking thing is that this was a requirement for Carrier Grade Linux two decades ago already.
When it comes to reliability and availability as part of dependable computing, our (distributed or not) systems have somewhat regressed as they were scaled up.
As I understand the BSOD today became more popular than ever, truly becoming mainstream and reported about all over the news. Of course, in systemd we are ahead of the curve, as usual, and if you too want to experience your very own BSOD we have your back. Enjoy:
1️⃣6️⃣ Here's the 16th installment of posts highlighting key new features of the upcoming v256 release of systemd.
(Sorry for dropping the ball on posting these for a while!)
The last feature I want to discuss in this series of postings is the new "capsule" concept of systemd v256.
systemd can be invoked in two contexts: as a system manager, i.e. PID 1, where it manages, well, the system. And as a user manager where it runs services for a specific user.
At any time on a single systemd machine you'll have 1 system manager and 0…n user managers running. This model is built around the traditional UNIX security model: UIDs are the primary security concept of the system, and thus we give each relevant UID a service scope of its own.
While the original purpose of the per-user service manager is to provide functionality for actual human users logging in interactively people have been using the concept for other purposes as well:
1️⃣3️⃣ Here's the 13th installment of posts highlighting key new features of the upcoming v256 release of systemd.
ssh is widely established as *the* mechanism for controlling Linux systems remotely, both interactively and with automated tools. It not only provides means for secure authentication and communication for a tty/shell, but also does this for file transfers (sftp), and IPC communication (D-Bus or Varlink).
It relies on TCP as network transport, which is great for remote operation around the globe but really sucks for local communication with a VM and similar, as it usually requires delegation of an address space, dhcp lease, dns and so on, which while manageable are certainly a major source of mistakes, fragility and headaches. In particular it means that logging into a system to debug networking doesnt really work since without working networking you cant even log in. Sad!
7️⃣ Here's the 7th installment of my series of posts highlighting key new features of the upcoming v256 release of systemd.
In systemd we put a lot of focus on operating with disk images, specifically file system images that carry an expressive GPT partition table – something that we call DDIs ("Discoverable Disk Images").
DDIs are supposed to carry dm-verity authentication information, i.e. every single access to them is typically cryptographically protected, and linked back to a set of signing keys maintained by the system (ideally in the kernel keyring). systemd uses DDIs for the system itself, for systemd-nspawn containers, for systemd portable services, for systemd-sysext system extensions, for systemd-confext configuration extensions and more.
5️⃣ Here's the 5th installment of my series of posts highlighting key new features of the upcoming v256 release of systemd.
I am pretty sure all of you are well aware of the venerable "sudo" tool that is a key component of most Linux distributions since a long time. At the surface it's a tool that allows an unprivileged user to acquire privileges temporarily, from within their existing login sessions, for just one command, or maybe for a subshell.
@pid_eins suid programs executing in the environment of the parent process means that I might become root in a user namespace and get the filesystem view of the current mount ns. This won't work with your approach, will it?
@pid_eins This is very nice - especially having it provide a clean context, that saves a ton of headaches and sanitization I need to do when using "sudo" in other programs! (not like that's an amazing thing anyway, but occasionally it's useful and justified)
Only having run0 coloring / changing the output of the called command is not something I always like, but maybe that can be optionally disabled...
It's used all over the place in userspace. In systemd we use it:
1. to detect if a block device has partition scanning off or on
2. In our udev test suite, to validate devices are in order
3. udev rules use it for some feature checks (in older versions of systemd).
It's used all over the place in userspace. In systemd we use it:
1. to detect if a block device has partition scanning off or on
2. In our udev test suite, to validate devices are in order
3. udev rules use it for some feature checks (in older versions of systemd).
Anyone knows where the kernel's github/gitlab project is? Would love to file an issue or placeholder revert PR, but somehow I cannot find it! Anyone?
(Yes, this is a joke, I am fully aware of the concept of mailing lists – as a historical concept from the 2005 era... Yes, I am too lazy to figuring out how to report this properly. Hence social media it is.)
Credit where credit is due! I'd really like to take a minute and thank Jia Tan how they helped us to finally get sd_notify() support merged into OpenSSH upstream!
It's pretty comprehensive (i.e. uses it for reload notification too), but still relatively short.
In the past, I have been telling anyone who wanted to listen that if all you want is sd_notify() then don't bother linking to libsystemd, since the protocol is stable and should be considered the API, not our C wrapper around it. After all, the protocol is so trivial
PSA: In context of the xzpocalypse we now added an example reimplementation of sd_notify() to our man page:
It's pretty comprehensive (i.e. uses it for reload notification too), but still relatively short.
In the past, I have been telling anyone who wanted to listen that if all you want is sd_notify() then don't bother linking to libsystemd, since the protocol is stable and should be considered the API, not our C wrapper...
that one can explain it in one sentence: send an AF_UNIX datagram containing READY=1 to a socket whose path you find in the $NOTIFY_SOCKET env var.
But apparently turning that sentence (which appears in similar fashion in the man page) into code is not trivial, hence this new example code.
Hence, copy away, the thing is MIT licensed. And the protocol has been stable for a decade, and I am pretty sure it's going to remain stable for another decade at least.
You might think the answer to this is 7, i.e. regular files, directories, symlinks, block device nodes, char device nodes, fifos, and sockets. But you are actually are wrong: there's an 8th one. There's the concept of an anonymous inode on Linux which has the file type of zero. You can easily acquire fds to inodes of this type via eventfd(). If you call fstat() on such fds, then (.st_mode & S_IFMT) == 0 will hold. 🤯
And I am pretty sure there's a lot of software you might be able to break given that they do not expect this case on the most basic of fs concepts.
Also note that these anonymous inodes are not actually as anonymous as one might think: because open fds appear in /proc/self/fd/ as magic symlinks you can easily get am fs path when you call stat() on will return you a zero inode type.
Here's another little feature we scheduled for the next systemd release. Everyone knows SSH well, and it's great to connect to hosts remotely, and even do file transfer. It's probably *the* single most relevant way to talk to some host for administration and various other tasks. It's a bit fragile though: it requires networking, and that even if we talk to a local VM or full OS container. But precisely networking is one of the things you might want to administer via SSH, hence you have a cyclic…
…and risky dependency. But for the VM and full OS container case there's no real need to use SSH via the network: these things run on the local system, hence why bother with IP? To address that we are adding a small generator (that means: a plugin for systemd that generates units on the fly, based on system state, configuration) which binds SSH to a local AF_VSOCK socket in a VM, and to an AF_UNIX socket in a container. You can then use these to directly connect to the system without involving…
I recently implemented a fun little feature for systemd: inspired by MacOS' "target disk mode", a tiny tool called systemd-storagetm, that exposes all local block devices as NVMe-TCP devices, as they pop up. The idea is that if available in your initrd you can just boot into that (instead of into your full OS), and can access your disks via NVMe-TCP (in case you wonder what that is: it's the new hot shit for exposing block devices over the network, kinda like iSCSI, NBD, …, but cool).
@pid_eins super cool. NVMe-oF ftw! I'm actually looking for a way of trimming down some of the dependencies in systemd-udevd and udevadm to make it even smaller for an super small inirtd. I only want systemd-udevd to initialize local storage devices for my used case. The Fedora systemd-udevd is dynamically linked to a large systemd .so is there a way of building it against smaller systemd libs like libudev etc.?
Thus, if the same service gets started 5 times iteratively, it will have 5 different invocation IDs assigned. And if you want to pinpoint a specific invocation of a service, it's the invocation ID you can use to uniquely identify it. (Besides the use as metadata when logging the invocation ID is also passed via the env var $INVOCATION_ID to the services btw; and systemctl status also shows it these days.)
One could say that what the Linux boot ID does for the system…