opencontainers/runc - runc - Linuxmonk: Open Source Repository Mirror

mirror of https://github.com/opencontainers/runc.git synced 2026-02-05 18:45:28 +01:00

Author	SHA1	Message	Date
Curd Becker	536e183451	Replace os.Is* error checking functions with their errors.Is counterpart Signed-off-by: Curd Becker <me@curd-becker.de>	2025-12-11 03:16:02 +01:00
Kir Kolyshkin	88f897160c	libct: startInitialization: add defer close This function calls Init what normally never returns, so the defer only works if there is an error and we can safely use it to close those fds we opened. This was done for most but not all fds. Reported in issue 5008. Reported-by: Arina Cherednik <arinacherednik034@gmail.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-12-02 15:15:23 -08:00
Aleksa Sarai	435cc81be6	init: use securejoin for /proc/self/setgroups Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2025-11-01 21:24:05 +11:00
Aleksa Sarai	531ef794e4	console: use TIOCGPTPEER when allocating peer PTY When opening the peer end of a pty, the old kernel API required us to open /dev/pts/$num inside the container (at least since we fixed console handling many years ago in commit `244c9fc426` (": console rewrite")). The problem is that in a hostile container it is possible for /dev/pts/$num to be an attacker-controlled symlink that runc can be tricked into resolving when doing bind-mounts. This allows the attacker to (among other things) persist /proc/... entries that are later masked by runc, allowing an attacker to escape through the kernel.core_pattern sysctl (/proc/sys/kernel/core_pattern). This is the original issue reported by Lei Wang and Li Fu Bang in CVE-2025-52565. However, it should be noted that this is not entirely a newly-discovered problem. Way back in Linux 4.13 (2017), I added the TIOCGPTPEER ioctl, which allows us to get a pty peer without touching the /dev/pts inside the container. The original threat model was around an attacker replacing /dev/pts/$n or /dev/pts/ptmx with some malicious inode (a DoS inode, or possibly a PTY they wanted a confused deputy to operate on). Unfortunately, there was no practical way for runc to cache a safe O_PATH handle to /dev/pts/ptmx (unlike other runtimes like LXC, which switched to TIOCGPTPEER way back in 2017). Since it wasn't clear how we could protect against the main attack TIOCGPTPEER was meant to protect against, we never switched to it (even though I implemented it specifically to harden container runtimes). Unfortunately, It turns out that mount sources* are a threat we didn't fully consider. Since TIOCGPTPEER already solves this problem entirely for us in a race free way, we should just use that. In a later patch, we will add some hardening for /dev/pts/$num opening to maintain support for very old kernels (Linux 4.13 is very old at this point, but RHEL 7 is still kicking and is stuck on Linux 3.10). Fixes: GHSA-qw9x-cqr3-wc7r CVE-2025-52565 Reported-by: Lei Wang <ssst0n3@gmail.com> (CVE-2025-52565) Reported-by: lfbzhm <lifubang@acmcoder.com> (CVE-2025-52565) Reported-by: Aleksa Sarai <cyphar@cyphar.com> (TIOCGPTPEER) Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2025-11-01 21:24:03 +11:00
Kir Kolyshkin	8476df83b5	libct: add/use isDevNull, verifyDevNull The /dev/null in a container should not be trusted, because when /dev is a bind mount, /dev/null is not created by runc itself. 1. Add isDevNull which checks the fd minor/major and device type, and verifyDevNull which does the stat and the check. 2. Rewrite maskPath to open and check /dev/null, and use its fd to perform mounts. Move the loop over the MaskPaths into the function, and rename it to maskPaths. 3. reOpenDevNull: use verifyDevNull and isDevNull. 4. fixStdioPermissions: use isDevNull instead of stat. Fixes: GHSA-9493-h29p-rfm2 CVE-2025-31133 Co-authored-by: Rodrigo Campos <rodrigoca@microsoft.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2025-11-01 21:24:02 +11:00
Antti Kervinen	eda7bdf80c	Add memory policy support Implement support for Linux memory policy in OCI spec PR: https://github.com/opencontainers/runtime-spec/pull/1282 Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>	2025-10-07 15:06:37 +03:00
lifubang	bf38646497	libct: we should set envs after we are in the jail of the container Because we have to set a default HOME env for the current container user, so we should set it after we are in the jail of the container, or else we'll use host's `/etc/passwd` to get a wrong HOME value. Please see: #4688. Signed-off-by: lifubang <lifubang@acmcoder.com>	2025-04-01 15:22:29 +00:00
Kir Kolyshkin	431b8bb4d8	int/linux: add/use Getwd Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-26 14:16:53 -07:00
Rodrigo Campos	9c5e687b6f	libct: Use chown(uid, -1) to not change the gid There is no behavior change, it is just more readable to use -1 to mean don't touch this. Please note that if the GID is not mapped in the userns, by using -1 for that no error is returned. We just avoid dealing with it completely, as we want here. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2025-03-14 16:52:20 +01:00
Kir Kolyshkin	10ca66bff5	runc exec: implement CPU affinity As per - https://github.com/opencontainers/runtime-spec/pull/1253 - https://github.com/opencontainers/runtime-spec/pull/1261 CPU affinity can be set in two ways: 1. When creating/starting a container, in config.json's Process.ExecCPUAffinity, which is when applied to all execs. 2. When running an exec, in process.json's CPUAffinity, which applied to a given exec and overrides the value from (1). Add some basic tests. Note that older kernels (RHEL8, Ubuntu 20.04) change CPU affinity of a process to that of a container's cgroup, as soon as it is moved to that cgroup, while newer kernels (Ubuntu 24.04, Fedora 41) don't do that. Because of the above, - it's impossible to really test initial CPU affinity without adding debug logging to libcontainer/nsenter; - for older kernels, there can be a brief moment when exec's affinity is different than either initial or final affinity being set; - exec's final CPU affinity, if not specified, can be different depending on the kernel, therefore we don't test it. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-02 19:17:41 -08:00
Kir Kolyshkin	a75076b4a4	Switch to opencontainers/cgroups This removes libcontainer/cgroups packages and starts using those from github.com/opencontainers/cgroups repo. Mostly generated by: git rm -f libcontainer/cgroups find . -type f -name "*.go" -exec sed -i \ 's\|github.com/opencontainers/runc/libcontainer/cgroups\|github.com/opencontainers/cgroups\|g' \ {} + go get github.com/opencontainers/cgroups@v0.0.1 make vendor gofumpt -w . Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-28 15:20:33 -08:00
Kir Kolyshkin	99f9ed94dc	runc exec: fix setting process.Scheduler Commit `770728e1` added Scheduler field into both Config and Process, but forgot to add a mechanism to actually use Process.Scheduler. As a result, runc exec does not set Process.Scheduler ever. Fix it, and a test case (which fails before the fix). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-11 18:01:30 -08:00
Kir Kolyshkin	b9114d91e2	runc exec: fix setting process.ioPriority Commit `bfbd0305b` added IOPriority field into both Config and Process, but forgot to add a mechanism to actually use Process.IOPriority. As a result, runc exec does not set Process.IOPriority ever. Fix it, and a test case (which fails before the fix). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-11 18:01:30 -08:00
Kir Kolyshkin	73849e797f	libct: simplify Caps inheritance For all other properties that are available in both Config and Process, the merging is performed by newInitConfig. Let's do the same for Capabilities for the sake of code uniformity. Also, thanks to the previous commit, we no longer have to make sure we do not call capabilities.New(nil). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-11 18:01:30 -08:00
Kir Kolyshkin	f26ec92221	libct: rm Rootless* properties from initConfig They are passed in initConfig twice, so it does not make sense. NB: the alternative to that would be to remove Config field from initConfig, but it results in a much bigger patch and more maintenance down the road. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-11 18:01:30 -08:00
Kir Kolyshkin	2a86c35768	libct: document initConfig and friends This is one of the dark corners of runc / libcontainer, so let's shed some light on it. initConfig is a structure which is filled in [mostly] by newInitConfig, and one of its hidden aspects is it contains a process config which is the result of merge between the container and the process configs. Let's document how all this happens, where the fields are coming from, which one has a preference, and how it all works. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-11 18:01:30 -08:00
Kir Kolyshkin	52f702af56	libct: earlier Rootless vs AdditionalGroups check Since the UID/GID/AdditonalGroups fields are now numeric, we can address the following TODO item in the code (added by commit `d2f49696` back in 2016): > TODO: We currently can't do > this check earlier, but if libcontainer.Process.User was typesafe > this might work. Move the check to much earlier phase, when we're preparing to start a process in a container. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-06 17:49:17 -08:00
Kir Kolyshkin	7dc2486889	libct: switch to numeric UID/GID/groups This addresses the following TODO in the code (added back in 2015 by commit `845fc65e5`): > // TODO: fix libcontainer's API to better support uid/gid in a typesafe way. Historically, libcontainer internally uses strings for user, group, and additional (aka supplementary) groups. Yet, runc receives those credentials as part of runtime-spec's process, which uses integers for all of them (see [1], [2]). What happens next is: 1. runc start/run/exec converts those credentials to strings (a User string containing "UID:GID", and a []string for additional GIDs) and passes those onto runc init. 2. runc init converts them back to int, in the most complicated way possible (parsing container's /etc/passwd and /etc/group). All this conversion and, especially, parsing is totally unnecessary, but is performed on every container exec (and start). The only benefit of all this is, a libcontainer user could use user and group names instead of numeric IDs (but runc itself is not using this feature, and we don't know if there are any other users of this). Let's remove this back and forth translation, hopefully increasing runc exec performance. The only remaining need to parse /etc/passwd is to set HOME environment variable for a specified UID, in case $HOME is not explicitly set in process.Env. This can now be done right in prepareEnv, which simplifies the code flow a lot. Alas, we can not use standard os/user.LookupId, as it could cache host's /etc/passwd or the current user (even with the osusergo tag). PS Note that the structures being changed (initConfig and Process) are never saved to disk as JSON by runc, so there is no compatibility issue for runc users. Still, this is a breaking change in libcontainer, but we never promised that libcontainer API will be stable (and there's a special package that can handle it -- github.com/moby/sys/user). Reflect this in CHANGELOG. For 3998. [1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/config.md#posix-platform-user [2]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/specs-go/config.go#L86 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-06 17:49:17 -08:00
Kir Kolyshkin	06f1e07655	libct: speedup process.Env handling The current implementation sets all the environment variables passed in Process.Env in the current process, one by one, then uses os.Environ to read those back. As pointed out in [1], this is slow, as runc calls os.Setenv for every variable, and there may be a few thousands of those. Looking into how os.Setenv is implemented, it is indeed slow, especially when cgo is enabled. Looking into why it was implemented the way it is, I found commit `9744d72c` and traced it to [2], which discusses the actual reasons. It boils down to these two: - HOME is not passed into container as it is set in setupUser by os.Setenv and has no effect on config.Env; - there is a need to deduplicate the environment variables. Yet it was decided in [2] to not go ahead with this patch, but later [3] was opened with the carry of this patch, and merged. Now, from what I see: 1. Passing environment to exec is way faster than using os.Setenv and os.Environ (tests show ~20x speed improvement in a simple Go test, and ~3x improvement in real-world test, see below). 2. Setting environment variables in the runc context may result is some ugly side effects (think GODEBUG, LD_PRELOAD, or _LIBCONTAINER_*). 3. Nothing in runtime spec says that the environment needs to be deduplicated, or the order of preference (whether the first or the last value of a variable with the same name is to be used). We should stick to what we have in order to maintain backward compatibility. So, this patch: - switches to passing env directly to exec; - adds deduplication mechanism to retain backward compatibility; - takes care to set PATH from process.Env in the current process (so that supplied PATH is used to find the binary to execute), also to retain backward compatibility; - adds HOME to process.Env if not set; - ensures any StartContainer CommandHook entries with no environment set explicitly are run with the same environment as before. Thanks to @lifubang who noticed that peculiarity. The benchmark added by the previous commit shows ~3x improvement: │ before │ after │ │ sec/op │ sec/op vs base │ ExecInBigEnv-20 61.53m ± 1% 21.87m ± 16% -64.46% (p=0.000 n=10) [1]: https://github.com/opencontainers/runc/pull/1983 [2]: https://github.com/docker-archive/libcontainer/pull/418 [3]: https://github.com/docker-archive/libcontainer/pull/432 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-01-09 18:22:53 +08:00
Kir Kolyshkin	7334ee01e6	libct/configs: rm IOPrioClassMapping This is an internal implementation detail and should not be either public or visible. Amend setIOPriority to do own class conversion. Fixes: `bfbd0305` ("Add I/O priority") Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 18:17:44 -08:00
Kir Kolyshkin	5d3942eec3	libct: unify IOPriority setting For some reason, io priority is set in different places between runc start/run and runc exec: - for runc start/run, it is done in the middle of (linuxStandardInit).Init, close to the place where we exec runc init. - for runc exec, it is done much earlier, in (setnsProcess) start(). Let's move setIOPriority call for runc exec to (linuxSetnsInit).Init, so it is in the same logical place as for runc start/run. Also, move the function itself to init_linux.go as it's part of init. Should not have any visible effect, except part of runc init is run with a different I/O priority. While at it, rename setIOPriority to setupIOPriority, and make it accept the whole configs.Config, for uniformity with other similar functions. Fixes: `bfbd0305` ("Add I/O priority") Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 18:15:31 -08:00
Kir Kolyshkin	2dc3ea4b87	libct: simplify setIOPriority/setupScheduler calls Move the nil check inside, simplifying the callers. Fixes: `bfbd0305` ("Add I/O priority") Fixes: `770728e1` ("Support `process.scheduler`") Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 18:06:20 -08:00
Kir Kolyshkin	a56f85f87b	libct/*: switch from configs to cgroups Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-11 19:08:40 -08:00
lifubang	871057d863	drop runc-dmz solution according to overlay solution Because we have the overlay solution, we can drop runc-dmz binary solution since it has too many limitations. Signed-off-by: lifubang <lifubang@acmcoder.com>	2024-10-28 15:18:07 +00:00
Kir Kolyshkin	f2d56241d8	Merge pull request #4405 from amghazanfari/main replace strings.SplitN with strings.Cut	2024-10-04 14:01:23 -07:00
Amir M. Ghazanfari	faffe1b9ee	replace strings.SplitN with strings.Cut Signed-off-by: Amir M. Ghazanfari <a.m.ghazanfari76@gmail.com>	2024-09-28 10:02:21 +03:30
lifubang	10c951e335	add ErrCgroupNotExist For some rootless container, runc has no access to cgroup, But the container is still running. So we should return the `ErrNotRunning` and `ErrCgroupNotExist` error seperatlly. Signed-off-by: lifubang <lifubang@acmcoder.com>	2024-09-23 23:27:35 +00:00
Akihiro Suda	429e06a518	libct: Signal: honor RootlessCgroups `signalAllProcesses()` depends on the cgroup and is expected to fail when runc is running in rootless without an access to the cgroup. When `RootlessCgroups` is set to `true`, runc just ignores the error from `signalAllProcesses` and may leak some processes running. (See the comments in PR 4395) In the future, runc should walk the process tree to avoid such a leak. Note that `RootlessCgroups` is a misnomer; it is set to `false` despite the name when cgroup v2 delegation is configured. This is expected to be renamed in a separate commit. Fix issue 4394 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2024-09-11 03:54:52 +09:00
Akihiro Suda	e7848482e2	Revert "libcontainer: seccomp: pass around *os.File for notifyfd" This reverts commit `20b95f23ca`. > Conflicts: > libcontainer/init_linux.go Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2024-07-03 17:28:12 +09:00
Kir Kolyshkin	584afc6756	libct/system: ClearRlimitNofileCache for go 1.23 Go 1.23 tightens access to internal symbols, and even puts runc into "hall of shame" for using an internal symbol (recently added by commit `da68c8e3`). So, while not impossible, it becomes harder to access those internal symbols, and it is a bad idea in general. Since Go 1.23 includes https://go.dev/cl/588076, we can clean the internal rlimit cache by setting the RLIMIT_NOFILE for ourselves, essentially disabling the rlimit cache. Once Go 1.22 is no longer supported, we will remove the go:linkname hack. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-06-01 13:02:29 -07:00
ls-ggg	da68c8e37b	libct: clean cached rlimit nofile in go runtime As reported in issue #4195, the new version(since 1.19) of go runtime will cache rlimit-nofile. Before executing execve, the rlimit-nofile of the process will be restored with the cache. In runc, this will cause the rlimit-nofile set by the parent process for the container to become invalid. It can be solved by clearing the cache. Signed-off-by: ls-ggg <335814617@qq.com> (cherry picked from commit `f9f8abf310`) Signed-off-by: lifubang <lifubang@acmcoder.com>	2024-05-08 10:40:13 +00:00
Kir Kolyshkin	0eb8bb5f66	Format sources with gofumpt v0.6 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-04-25 08:39:26 -07:00
Aleksa Sarai	02120488a4	Merge pull request from GHSA-xr7r-f8xq-vfvv fix GHSA-xr7r-f8xq-vfvv and harden fd leaks	2024-02-01 07:04:29 +11:00
Kir Kolyshkin	8454bbb613	Merge pull request #4175 from cyphar/fd-file-switch init: use *os.File for passed file descriptors	2024-01-31 10:40:28 -08:00
lifubang	35aa63ea87	never send procError after the socket closed Signed-off-by: lifubang <lifubang@acmcoder.com>	2024-01-25 04:52:11 +00:00
Aleksa Sarai	f2f16213e1	init: close internal fds before execve If we leak a file descriptor referencing the host filesystem, an attacker could use a /proc/self/fd magic-link as the source for execve to execute a host binary in the container. This would allow the binary itself (or a process inside the container in the 'runc exec' case) to write to a host binary, leading to a container escape. The simple solution is to make sure we close all file descriptors immediately before the execve(2) step. Doing this earlier can lead to very serious issues in Go (as file descriptors can be reused, any (*os.File) reference could start silently operating on a different file) so we have to do it as late as possible. Unfortunately, there are some Go runtime file descriptors that we must not close (otherwise the Go scheduler panics randomly). The only way of being sure which file descriptors cannot be closed is to sneakily go:linkname the runtime internal "internal/poll.IsPollDescriptor" function. This is almost certainly not recommended but there isn't any other way to be absolutely sure, while also closing any other possible files. In addition, we can keep the logrus forwarding logfd open because you cannot execve a pipe and the contents of the pipe are so restricted (JSON-encoded in a format we pick) that it seems unlikely you could even construct shellcode. Closing the logfd causes issues if there is an error returned from execve. In mainline runc, runc-dmz protects us against this attack because the intermediate execve(2) closes all of the O_CLOEXEC internal runc file descriptors and thus runc-dmz cannot access them to attack the host. Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626 Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-01-24 00:20:58 +11:00
Aleksa Sarai	8e1cd2f56d	init: verify after chdir that cwd is inside the container If a file descriptor of a directory in the host's mount namespace is leaked to runc init, a malicious config.json could use /proc/self/fd/... as a working directory to allow for host filesystem access after the container runs. This can also be exploited by a container process if it knows that an administrator will use "runc exec --cwd" and the target --cwd (the attacker can change that cwd to be a symlink pointing to /proc/self/fd/... and wait for the process to exec and then snoop on /proc/$pid/cwd to get access to the host). The former issue can lead to a critical vulnerability in Docker and Kubernetes, while the latter is a container breakout. We can (ab)use the fact that getcwd(2) on Linux detects this exact case, and getcwd(3) and Go's Getwd() return an error as a result. Thus, if we just do os.Getwd() after chdir we can easily detect this case and error out. In runc 1.1, a /sys/fs/cgroup handle happens to be leaked to "runc init", making this exploitable. On runc main it just so happens that the leaked /sys/fs/cgroup gets clobbered and thus this is only consistently exploitable for runc 1.1. Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626 Co-developed-by: lifubang <lifubang@acmcoder.com> Signed-off-by: lifubang <lifubang@acmcoder.com> [refactored the implementation and added more comments] Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-01-24 00:20:58 +11:00
Aleksa Sarai	7094efb192	init: use *os.File for passed file descriptors While it doesn't make much of a practical difference, it seems far more reasonable to use os.NewFile to wrap all of our passed file descriptors to make sure they're tracked by the Go runtime and that we don't double-close them. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-01-22 17:34:14 +11:00
Aleksa Sarai	8e8b136c49	tree-wide: use /proc/thread-self for thread-local state With the idmap work, we will have a tainted Go thread in our thread-group that has a different mount namespace to the other threads. It seems that (due to some bad luck) the Go scheduler tends to make this thread the thread-group leader in our tests, which results in very baffling failures where /proc/self/mountinfo produces gibberish results. In order to avoid this, switch to using /proc/thread-self for everything that is thread-local. This primarily includes switching all file descriptor paths (CLONE_FS), all of the places that check the current cgroup (technically we never will run a single runc thread in a separate cgroup, but better to be safe than sorry), and the aforementioned mountinfo code. We don't need to do anything for the following because the results we need aren't thread-local: * Checks that certain namespaces are supported by stat(2)ing /proc/self/ns/... * /proc/self/exe and /proc/self/cmdline are not thread-local. * While threads can be in different cgroups, we do not do this for the runc binary (or libcontainer) and thus we do not need to switch to the thread-local version of /proc/self/cgroups. * All of the CLONE_NEWUSER files are not thread-local because you cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER) is blocked for multi-threaded programs). Note that we have to use runtime.LockOSThread when we have an open handle to a tid-specific procfs file that we are operating on multiple times. Go can reschedule us such that we are running on a different thread and then kill the original thread (causing -ENOENT or similarly confusing errors). This is not strictly necessary for most usages of /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since only operating on the actual inodes associated with the tid requires this locking, but because of the pre-3.17 fallback for CentOS, we have to do this in most cases. In addition, CentOS's kernel is too old for /proc/thread-self, which requires us to emulate it -- however in rootfs_linux.go, we are in the container pid namespace but /proc is the host's procfs. This leads to the incredibly frustrating situation where there is no way (on pre-4.1 Linux) to figure out which /proc/self/task/... entry refers to the current tid. We can just use /proc/self in this case. Yes this is all pretty ugly. I also wish it wasn't necessary. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:41 +11:00
Aleksa Sarai	ba0b5e2698	libcontainer: remove all mount logic from nsexec With open_tree(OPEN_TREE_CLONE), it is possible to implement both the id-mapped mounts and bind-mount source file descriptor logic entirely in Go without requiring any complicated handling from nsexec. However, implementing it the naive way (do the OPEN_TREE_CLONE in the host namespace before the rootfs is set up -- which is what the existing implementation did) exposes issues in how mount ordering (in particular when handling mount sources from inside the container rootfs, but also in relation to mount propagation) was handled for idmapped mounts and bind-mount sources. In order to solve this problem completely, it is necessary to spawn a thread which joins the container mount namespace and provides mountfds when requested by the rootfs setup code (ensuring that the mount order and mount propagation of the source of the bind-mount are handled correctly). While the need to join the mount namespace leads to other complicated (such as with the usage of /proc/self -- fixed in a later patch) the resulting code is still reasonable and is the only real way to solve the issue. This allows us to reduce the amount of C code we have in nsexec, as well as simplifying a whole host of places that were made more complicated with the addition of id-mapped mounts and the bind sourcefd logic. Because we join the container namespace, we can continue to use regular O_PATH file descriptors for non-id-mapped bind-mount sources (which means we don't have to raise the kernel requirement for that case). In addition, we can easily add support for id-mappings that don't match the container's user namespace. The approach taken here is to use Go's officially supported mechanism for spawning a process in a user namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a different process. The most efficient way to implement this would be to do clone() in cgo directly to run a function that just does kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out this approach is too slow. It should be noted that the included micro-benchmark seems to indicate this is Fast Enough(TM): goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/userns cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkSpawnProc BenchmarkSpawnProc-8 1670 770065 ns/op Fixes: `fda12ab101` ("Support idmap mounts on volumes") Fixes: `9c444070ec` ("Open bind mount sources from the host userns") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:40 +11:00
Aleksa Sarai	9387eac3a5	init: don't pre-flight-check the set[ug]id arguments While we do cache the mappings when using userns paths, there's no need to do this in this particular case, since we are in the namespace and set[ug]id() give unambiguous EINVAL error codes if the id is unmapped. This appears to also be the only code which does Host[UG]ID calculations from inside "runc init". Ref: `1a5fdc1c5f` ("init: support setting -u with rootless containers") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-05 17:46:08 +11:00
Kir Kolyshkin	dcf1b731f5	runc kill: fix sending KILL to non-pidns container Commit `f8ad20f` made it impossible to kill leftover processes in a stopped container that does not have its own PID namespace. In other words, if a container init is gone, it is no longer possible to use `runc kill` to kill the leftover processes. Fix this by moving the check if container init exists to after the special case of handling the container without own PID namespace. While at it, fix the minor issue introduced by commit `9583b3d`: if signalAllProcesses is used, there is no need to thaw the container (as freeze/thaw is either done in signalAllProcesses already, or not needed at all). Also, make signalAllProcesses return an error early if the container cgroup does not exist (as it relies on it to do its job). This way, the error message returned is more generic and easier to understand ("container not running" instead of "can't open file"). Finally, add a test case. Fixes: `f8ad20f` Fixes: `9583b3d` Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-11-27 09:15:39 -08:00
lfbzhm	95a93c132c	Merge pull request #4045 from fuweid/support-pidfd-socket [feature request] *: introduce pidfd-socket flag	2023-11-22 09:13:55 +08:00
Wei Fu	94505a046a	*: introduce pidfd-socket flag The container manager like containerd-shim can't use cgroup.kill feature or freeze all the processes in cgroup to terminate the exec init process. It's unsafe to call kill(2) since the pid can be recycled. It's good to provide the pidfd of init process through the pidfd-socket. It's similar to the console-socket. With the pidfd, the container manager like containerd-shim can send the signal to target process safely. And for the standard init process, we can have polling support to get exit event instead of blocking on wait4. Signed-off-by: Wei Fu <fuweid89@gmail.com>	2023-11-21 18:28:50 +08:00
Zheao.Li	98511bb40e	linux: Support setting execution domain via linux personality carry #3126 Co-authored-by: Aditya R <arajan@redhat.com> Signed-off-by: Zheao.Li <me@manjusaka.me>	2023-10-27 19:33:37 +08:00
Akihiro Suda	0274ca2580	Merge pull request #4025 from lifubang/feat-sched-carry-3962 [Carry 3962] Support `process.scheduler`	2023-10-12 08:07:50 +09:00
Bjorn Neergaard	6f7266c3f7	libcontainer: drop system.Setxid Since Go 1.16, [Go issue 1435][1] is solved, and the stdlib syscall implementations work on Linux. While they are a bit more flexible/heavier-weight than the implementations that were copied to libcontainer/system (working across all threads), we compile with Cgo, and using the libc wrappers should be just as suitable. [1]: https://github.com/golang/go/issues/1435 Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>	2023-10-11 13:04:34 -06:00
utam0k	770728e16e	Support `process.scheduler` Spec: https://github.com/opencontainers/runtime-spec/pull/1188 Fix: https://github.com/opencontainers/runc/issues/3895 Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: utam0k <k0ma@utam0k.jp> Signed-off-by: lifubang <lifubang@acmcoder.com>	2023-10-04 15:53:18 +08:00
Kir Kolyshkin	6538e6d0bd	libct: fix a typo syncrhonisation ==> synchronisation Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-10-03 17:51:54 -07:00
Aleksa Sarai	8da42aaec2	sync: split init config (stream) and synchronisation (seqpacket) pipes We have different requirements for the initial configuration and initWaiter pipe (just send netlink and JSON blobs with no complicated handling needed for message coalescing) and the packet-based synchronisation pipe. Tests with switching everything to SOCK_SEQPACKET lead to endless issues with runc hanging on start-up because random things would try to do short reads (which SOCK_SEQPACKET will not allow and the Go stdlib explicitly treats as a streaming source), so splitting it was the only reasonable solution. Even doing somewhat dodgy tricks such as adding a Read() wrapper which actually calls ReadPacket() and makes it seem like a stream source doesn't work -- and is a bit too magical. One upside is that doing it this way makes the difference between the modes clearer -- INITPIPE is still used for initWaiter syncrhonisation but aside from that all other synchronisation is done by SYNCPIPE. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-24 20:31:14 +08:00

1 2 3 4

169 Commits