1
0
mirror of https://github.com/opencontainers/runc.git synced 2026-02-05 18:45:28 +01:00

125 Commits

Author SHA1 Message Date
Kir Kolyshkin
652269729d libc/int: use strings.Builder
Generated by modernize@latest (v0.21.0).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-16 15:04:04 -08:00
Curd Becker
536e183451 Replace os.Is* error checking functions with their errors.Is counterpart
Signed-off-by: Curd Becker <me@curd-becker.de>
2025-12-11 03:16:02 +01:00
Akihiro Suda
64c3c8eea6 Merge pull request #4994 from kolyshkin/gofumpt-extra
Enable gofumpt extra rules
2025-11-28 09:30:57 +09:00
Kir Kolyshkin
5fbc3bb019 libct/int: TestFdLeaks: deflake
Since the recent CVE fixes, TestFdLeaksSystemd sometimes fails:

	=== RUN   TestFdLeaksSystemd
	    exec_test.go:1750: extra fd 9 -> /12224/task/13831/fd
	    exec_test.go:1753: found 1 extra fds after container.Run
	--- FAIL: TestFdLeaksSystemd (0.10s)

It might have been caused by the change to the test code in commit
ff6fe13 ("utils: use safe procfs for /proc/self/fd loop code") -- we are
now opening a file descriptor during the logic to get a list of file
descriptors. If the file descriptor happens to be allocated to a
different number, you'll get an error.

Let's try to filter out the fd used to read a directory.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-20 15:36:14 +08:00
Aleksa Sarai
3b75374cc7 runtime-spec: update pids.limit handling to match new guidance
The main update is actually in github.com/opencontainers/cgroups, but we
need to also update runtime-spec to a newer pre-release version to get
the updates from there as well.

In short, the behaviour change is now that "0" is treated as a valid
value to set in "pids.max", "-1" means "max" and unset/nil means "do
nothing". As described in the opencontainers/cgroups PR, this change is
actually backwards compatible because our internal state.json stores
PidsLimit, and that entry is marked as "omitempty". So, an old runc
would omit PidsLimit=0 in state.json, and this will be parsed by a new
runc as being "nil" -- and both would treat this case as "do not set
anything".

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-11 15:15:27 +11:00
Kir Kolyshkin
67840cce4b Enable gofumpt extra rules
Commit b2f8a74d "clothed" the naked return as inflicted by gofumpt
v0.9.0. Since gofumpt v0.9.2 this rule was moved to "extra" category,
not enabled by default. The only other "extra" rule is to group adjacent
parameters with the same type, which also makes sense.

Enable gofumpt "extra" rules, and reformat the code accordingly.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-10 13:18:45 -08:00
Aleksa Sarai
ff6fe13246 utils: use safe procfs for /proc/self/fd loop code
From a safety perspective this might not be strictly required, but it
paves the way for us to remove utils.ProcThreadSelf.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:04 +11:00
Kir Kolyshkin
89e59902c4 Modernize code for Go 1.24
Brought to you by

	modernize -fix -test ./...

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-08-27 19:11:02 -07:00
Kir Kolyshkin
a638f1330b .golangci.yml: add nolintlint, fix found issues
The errrolint linter can finally ignore errors from Close,
and it also ignores direct comparisons of errors from x/sys/unix.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-24 11:59:54 -07:00
Kir Kolyshkin
65e0f2b719 libct/int: use destroyContainer
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-24 10:02:47 -07:00
Kir Kolyshkin
1aebfa3eab libct/int: don't use _ = runContainerOk
There is no need to explicitly ignore returned value.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-24 10:02:47 -07:00
Kir Kolyshkin
5ac77ed6d9 libct/int: add/use needUserNS helper
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-13 08:40:27 -07:00
Kir Kolyshkin
a75076b4a4 Switch to opencontainers/cgroups
This removes libcontainer/cgroups packages and starts
using those from github.com/opencontainers/cgroups repo.

Mostly generated by:

  git rm -f libcontainer/cgroups

  find . -type f -name "*.go" -exec sed -i \
    's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g' \
    {} +

  go get github.com/opencontainers/cgroups@v0.0.1
  make vendor
  gofumpt -w .

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-28 15:20:33 -08:00
Kir Kolyshkin
7dc2486889 libct: switch to numeric UID/GID/groups
This addresses the following TODO in the code (added back in 2015
by commit 845fc65e5):

> // TODO: fix libcontainer's API to better support uid/gid in a typesafe way.

Historically, libcontainer internally uses strings for user, group, and
additional (aka supplementary) groups.

Yet, runc receives those credentials as part of runtime-spec's process,
which uses integers for all of them (see [1], [2]).

What happens next is:

1. runc start/run/exec converts those credentials to strings (a User
   string containing "UID:GID", and a []string for additional GIDs) and
   passes those onto runc init.
2. runc init converts them back to int, in the most complicated way
   possible (parsing container's /etc/passwd and /etc/group).

All this conversion and, especially, parsing is totally unnecessary,
but is performed on every container exec (and start).

The only benefit of all this is, a libcontainer user could use user and
group names instead of numeric IDs (but runc itself is not using this
feature, and we don't know if there are any other users of this).

Let's remove this back and forth translation, hopefully increasing
runc exec performance.

The only remaining need to parse /etc/passwd is to set HOME environment
variable for a specified UID, in case $HOME is not explicitly set in
process.Env. This can now be done right in prepareEnv, which simplifies
the code flow a lot. Alas, we can not use standard os/user.LookupId, as
it could cache host's /etc/passwd or the current user (even with the
osusergo tag).

PS Note that the structures being changed (initConfig and Process) are
never saved to disk as JSON by runc, so there is no compatibility issue
for runc users.

Still, this is a breaking change in libcontainer, but we never promised
that libcontainer API will be stable (and there's a special package
that can handle it -- github.com/moby/sys/user). Reflect this in
CHANGELOG.

For 3998.

[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/config.md#posix-platform-user
[2]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/specs-go/config.go#L86

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 17:49:17 -08:00
Kir Kolyshkin
6171da6005 libct/configs: add HookList.SetDefaultEnv
1. Make CommandHook.Command a pointer, which reduces the amount of data
   being copied when using hooks, and allows to modify command hooks.

2. Add SetDefaultEnv, which is to be used by the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-09 18:22:53 +08:00
Kir Kolyshkin
a56f85f87b libct/*: switch from configs to cgroups
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-11 19:08:40 -08:00
Sebastiaan van Stijn
30b530ca94 libct/userns: split userns detection from internal userns code
Commit 4316df8b53 isolated RunningInUserNS
to a separate package to make it easier to consume without bringing in
additional dependencies, and with the potential to move it separate in
a similar fashion as libcontainer/user was moved to a separate module
in commit ca32014adb. While RunningInUserNS
is fairly trivial to implement, it (or variants of this utility) is used
in many codebases, and moving to a separate module could consolidate
those implementations, as well as making it easier to consume without
large dependency trees (when being a package as part of a larger code
base).

Commit 1912d5988b and follow-ups introduced
cgo code into the userns package, and code introduced in those commits
are not intended for external use, therefore complicating the potential
of moving the userns package separate.

This commit moves the new code to a separate package; some of this code
was included in v1.1.11 and up, but I could not find external consumers
of `GetUserNamespaceMappings` and `IsSameMapping`. The `Mapping` and
`Handles` types (added in ba0b5e2698) only
exist in main and in non-stable releases (v1.2.0-rc.x), so don't need
an alias / deprecation.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-06-30 20:06:30 +02:00
lifubang
4ea0bf88fd update/add some tests for rlimit
issues:
https://github.com/opencontainers/runc/issues/4195
https://github.com/opencontainers/runc/pull/4265#discussion_r1588599809

Signed-off-by: lifubang <lifubang@acmcoder.com>
2024-05-08 10:57:10 +00:00
Aleksa Sarai
8e8b136c49 tree-wide: use /proc/thread-self for thread-local state
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.

In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:

 * Checks that certain namespaces are supported by stat(2)ing
   /proc/self/ns/...

 * /proc/self/exe and /proc/self/cmdline are not thread-local.

 * While threads can be in different cgroups, we do not do this for the
   runc binary (or libcontainer) and thus we do not need to switch to
   the thread-local version of /proc/self/cgroups.

 * All of the CLONE_NEWUSER files are not thread-local because you
   cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
   is blocked for multi-threaded programs).

Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.

In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.

Yes this is all pretty ugly. I also wish it wasn't necessary.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:41 +11:00
Aleksa Sarai
09822c3da8 configs: disallow ambiguous userns and timens configurations
For userns and timens, the mappings (and offsets, respectively) cannot
be changed after the namespace is first configured. Thus, configuring a
container with a namespace path to join means that you cannot also
provide configuration for said namespace. Previously we would silently
ignore the configuration (and just join the provided path), but we
really should be returning an error (especially when you consider that
the configuration userns mappings are used quite a bit in runc with the
assumption that they are the correct mapping for the userns -- but in
this case they are not).

In the case of userns, the mappings are also required if you _do not_
specify a path, while in the case of the time namespace you can have a
container with a timens but no mappings specified.

It should be noted that the case checking that the user has not
specified a userns path and a userns mapping needs to be handled in
specconv (as opposed to the configuration validator) because with this
patchset we now cache the mappings of path-based userns configurations
and thus the validator can't be sure whether the mapping is a cached
mapping or a user-specified one. So we do the validation in specconv,
and thus the test for this needs to be an integration test.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-05 17:46:09 +11:00
Francis Laniel
c47f58c4e9 Capitalize [UG]idMappings as [UG]IDMappings
Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
2023-07-21 13:55:34 +02:00
Kir Kolyshkin
f8ad20f500 runc kill: drop -a option
As of previous commit, this is implied in a particular scenario. In
fact, this is the one and only scenario that justifies the use of -a.

Drop the option from the documentation. For backward compatibility, do
recognize it, and retain the feature of ignoring the "container is
stopped" error when set.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-06-08 09:30:40 -07:00
Kir Kolyshkin
9583b3d1c2 libct: move killing logic to container.Signal
By default, the container has its own PID namespace, and killing (with
SIGKILL) its init process from the parent PID namespace also kills all
the other processes.

Obviously, it does not work that way when the container is sharing its
PID namespace with the host or another container, since init is no
longer special (it's not PID 1). In this case, killing container's init
will result in a bunch of other processes left running (and thus the
inability to remove the cgroup).

The solution to the above problem is killing all the container
processes, not just init.

The problem with the current implementation is, the killing logic is
implemented in libcontainer's initProcess.wait, and thus only available
to libcontainer users, but not the runc kill command (which uses
nonChildProcess.kill and does not use wait at all). So, some workarounds
exist:
 - func destroy(c *Container) calls signalAllProcesses;
 - runc kill implements -a flag.

This code became very tangled over time. Let's simplify things by moving
the killing all processes from initProcess.wait to container.Signal,
and documents the new behavior.

In essence, this also makes `runc kill` to automatically kill all container
processes when the container does not have its own PID namespace.
Document that as well.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-06-08 09:29:25 -07:00
Kir Kolyshkin
2a7dcbbb40 libct: fix shared pidns detection
When someone is using libcontainer to start and kill containers from a
long lived process (i.e. the same process creates and removes the
container), initProcess.wait method is used, which has a kludge to work
around killing containers that do not have their own PID namespace.

The code that checks for own PID namespace is not entirely correct.
To be exact, it does not set sharePidns flag when the host/caller PID
namespace is implicitly used. As a result, the above mentioned kludge
does not work.

Fix the issue, add a test case (which fails without the fix).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-06-08 09:23:29 -07:00
Kir Kolyshkin
5b8f8712a4 libct: signalAllProcesses: remove child reaping
There are two very distinct usage scenarios for signalAllProcesses:

* when used from the runc binary ("runc kill" command), the processes
  that it kills are not the children of "runc kill", and so calling
  wait(2) on each process is totally useless, as it will return ECHLD;

* when used from a program that have created the container (such as
  libcontainer/integration test suite), that program can and should call
  wait(2), not the signalling code.

So, the child reaping code is totally useless in the first case, and
should be implemented by the program using libcontainer in the second
case. I was not able to track down how this code was added, my best
guess is it happened when this code was part of dockerd, which did not
have a proper child reaper implemented at that time.

Remove it, and add a proper documentation piece.

Change the integration test accordingly.

PS the first attempt to disable the child reaping code in
signalAllProcesses was made in commit bb912eb00c, which used a
questionable heuristic to figure out whether wait(2) should be called.
This heuristic worked for a particular use case, but is not correct in
general.

While at it:
 - simplify signalAllProcesses to use unix.Kill;
 - document (container).Signal.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-06-08 09:23:29 -07:00
Kir Kolyshkin
f2e71b085d libct/int: make TestFdLeaks more robust
The purpose of this test is to check that there are no extra file
descriptors left open after repeated calls to runContainer. In fact,
the first call to runContainer leaves a few file descriptors opened,
and this is by design.

Previously, this test relied on two things:
1. some other tests were run before it (and thus all such opened-once
   file descriptors are already opened);
2.  explicitly excluding fd opened to /sys/fs/cgroup.

Now, if we run this test separately, it will fail (because of 1 above).
The same may happen if the tests are run in a random order.

To fix this, add a container run before collection the initial fd list,
so those fds that are opened once are included and won't be reported.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-02-22 02:58:47 -08:00
Kir Kolyshkin
be7e03940f libct/int: wording nits
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-02-22 02:58:47 -08:00
Kir Kolyshkin
7c75e84e22 libc/int: add/use runContainerOk wrapper
This is to de-duplicate the code that checks that err is nil
and that the exit code is zero.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-02-22 02:58:47 -08:00
Kir Kolyshkin
98fe566c52 runc: do not set inheritable capabilities
Do not set inheritable capabilities in runc spec, runc exec --cap,
and in libcontainer integration tests.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-05-12 08:14:50 +10:00
Kir Kolyshkin
0fec1c2d8c libct: Mount: rm {Pre,Post}mountCmds
Those were added by commit 59c5c3ac0 back in Apr 2015, but AFAICS were
never used and are obsoleted by more generic container hooks (initially
added by commit 05567f2c94 in Sep 2015).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-26 15:51:55 -08:00
Kir Kolyshkin
953e56c56f libct/int: runContainer: drop console arg
It is not and was never ever used.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-29 20:10:22 -08:00
Kir Kolyshkin
972aea3af0 libct/configs/validate: allow / in sysctl names
Runtime spec says:

> sysctl (object, OPTIONAL) allows kernel parameters to be modified at
> runtime for the container. For more information, see the sysctl(8)
> man page.

and sysctl(8) says:

> variable
>    The name of a key to read from. An example is
>    kernel.ostype. The '/' separator is also accepted in place of a '.'.

Apparently, runc config validator do not support sysctls with / as a
separator. Fortunately this is a one-line fix.

Add some more test data where / is used as a separator.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-10-29 09:45:55 -07:00
Akihiro Suda
95f8ecdd53 fix libcontainer/integration/exec_test.go:1859:8: undefined: ioutil
Fix 4d17654479

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-10-28 14:56:03 +09:00
Akihiro Suda
4d17654479 Merge pull request #2576 from kinvolk/alban/userns-2484-take2
Open bind mount sources from the host userns
2021-10-28 14:50:33 +09:00
Mauricio Vásquez
8542322dfe libcontainer: Add unit tests with userns and mounts
Add a unit test to check that bind mounts that have a part of its
path non accessible by others still work when using user namespaces.

To do this, we also modify newRoot() to return rootfs directories that
can be traverse by others, so the rootfs created works for all test
(either running in a userns or not).

Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-10-16 17:29:33 +02:00
Kir Kolyshkin
5516294172 Remove io/ioutil use
See https://golang.org/doc/go1.16#ioutil

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-10-14 13:46:02 -07:00
Kir Kolyshkin
3bc606e9d3 libct/int: adapt to Go 1.15
1. Use t.TempDir instead of ioutil.TempDir. This means no need for an
   explicit cleanup, which removes some code, including newTestBundle
   and newTestRoot.

2. Move newRootfs invocation down to newTemplateConfig, removing a need
   for explicit rootfs creation. Also, remove rootfs from tParam as it
   is no longer needed (there was a since test case in which two
   containers shared the same rootfs, but it does not look like it's
   required for the test).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-07-27 01:41:47 -07:00
Kir Kolyshkin
5dc3260431 libct/int/TestFreeze: test freeze/thaw via Set
In addition to freezing and thawing a container via Pause/Resume,
there is a way to also do so via Set.

This way was broken though and is being fixed by a few preceding
commits. The test is added to make sure this is fixed and won't regress.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-07-14 23:42:35 -07:00
Kir Kolyshkin
e969d42156 libct/int/testPids: logging nits
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-02 17:31:03 -07:00
Kir Kolyshkin
e6048715e4 Use gofumpt to format code
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.

Brought to you by

	git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w

Looking at the diff, all these changes make sense.

Also, replace gofmt with gofumpt in golangci.yml.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-01 12:17:27 -07:00
Aleksa Sarai
ed4781029f merge branch 'pr-2781'
Sebastiaan van Stijn (7):
  errcheck: utils
  errcheck: signals
  errcheck: tty
  errcheck: libcontainer
  errcheck: libcontainer/nsenter
  errcheck: libcontainer/configs
  errcheck: libcontainer/integration

LGTM: AkihiroSuda cyphar
Closes #2781
2021-05-25 12:31:52 +10:00
Aleksa Sarai
54904516e6 libcontainer: fix integration failure in "make test"
When running inside a Docker container, systemd is not available. The
new TestFdLeaksSystemd forgot to include the relevant t.Skip section.

Fixes: a7feb42395 ("libct/int: add TestFdLeaksSystemd")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-05-23 17:55:09 +10:00
Aleksa Sarai
c7c70ce810 *: clean t.Skip messages
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-05-23 17:53:01 +10:00
Sebastiaan van Stijn
a899505377 errcheck: libcontainer/integration
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-05-20 14:17:40 +02:00
Kir Kolyshkin
a7feb42395 libct/int: add TestFdLeaksSystemd
Add a test to check that container.Run do not leak file descriptors.

Before the previous commit, it fails like this:

    exec_test.go:2030: extra fd 8 -> socket:[659703]
    exec_test.go:2030: extra fd 11 -> socket:[658715]
    exec_test.go:2033: found 2 extra fds after container.Run

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-05-06 12:37:55 -07:00
Kir Kolyshkin
6faed0e486 libct/int: use ok(t, err)
... in all the places it makes sense to use it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-15 13:03:17 -07:00
Kir Kolyshkin
af3c5699a5 libct/int: remove unused code
Since commit 88e8350de2 the error message is different, so the check
is not working. In addition, for the cgroup v2 case, and it seems that
PID controller is always available these days.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-15 12:46:17 -07:00
Kir Kolyshkin
7b802a7da4 libct/int: better test container names
1. Do not create the same container named "test" over and over.

2. Fix randomization issues when generating container and cgroup names.
   The issues were:

    * math/rand used without seeding
    * complex rand/md5/hexencode sequence

   In both cases, replace with nanosecond time encoded with digits and
   lowercase letters.

3. Add test name to container and cgroup names. For example, this is
   how systemd log has changed:

   Before: Started libcontainer container test16ddfwutxgjte.
   After: Started libcontainer container TestPidsSystemd-4oaqvr.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-15 12:37:59 -07:00
Qiang Huang
2d38476c96 Merge pull request #2840 from kolyshkin/ignore-kmem
Ignore kernel memory settings
2021-04-13 09:44:14 +08:00
Kir Kolyshkin
52390d6804 Ignore kernel memory settings
This is somewhat radical approach to deal with kernel memory.

Per-cgroup kernel memory limiting was always problematic. A few
examples:

 - older kernels had bugs and were even oopsing sometimes (best example
   is RHEL7 kernel);
 - kernel is unable to reclaim the kernel memory so once the limit is
   hit a cgroup is toasted;
 - some kernel memory allocations don't allow failing.

In addition to that,

 - users don't have a clue about how to set kernel memory limits
   (as the concept is much more complicated than e.g. [user] memory);
 - different kernels might have different kernel memory usage,
   which is sort of unexpected;
 - cgroup v2 do not have a [dedicated] kmem limit knob, and thus
   runc silently ignores kernel memory limits for v2;
 - kernel v5.4 made cgroup v1 kmem.limit obsoleted (see
   https://github.com/torvalds/linux/commit/0158115f702b).

In view of all this, and as the runtime-spec lists memory.kernel
and memory.kernelTCP as OPTIONAL, let's ignore kernel memory
limits (for cgroup v1, same as we're already doing for v2).

This should result in less bugs and better user experience.

The only bad side effect from it might be that stat can show kernel
memory usage as 0 (since the accounting is not enabled).

[v2: add a warning in specconv that limits are ignored]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-12 12:18:11 -07:00