Originally, the file mode was indeed written in octal (see e.g.
commit 5273b3d), but it was found out later that JSON does not
allow octal values so the examples were changed to decimal in
commit ccf3a24, but the "typically an octal value" bit (added
by commit cdcabde) remains.
Change it to emphasize the fact that this is in decimal.
Also, add a note to config-linux.md saying the same thing.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The history of this is a little complicated, but in short there is an
argument to be made that several misunderstandings resulted in the spec
sometimes implying (and runtimes interpreting) a pids.limit value of 0
to be equivalent to "max" or otherwise having unfortunate handling of
the value.
The slightly longer background is the following:
1. When commit 834fb5db52 ("spec: linux: add support for the PIDs
cgroup") added support, we did not yet have textual documentation of
cgroup configuration values. In addition, we had not yet started
using pointers to indicate optional fields and detect unset fields.
However, the initial commit did imply that pids.limit=0 should be
treated as a real value.
2. Commit 2ce2c866ff ("runtime: config: linux: add cgroups
information") labeled "pids.limit" as being a REQUIRED field. This
may seem trivial, but consider this foreshadowing for point 5.
3. Later, commit 9b19cd2fab ("config: linux: update description of
PidsLimit") was added to explicitly make pids.limit=0 equivalent to
max (at the time there was a kernel patch proposed to make setting
pids.max to 0 illegal, though it was never merged).
This is often pointed to as being the reason for runtimes
interpreting this behaviour this way, however...
4. Soon after, 488f174af9 ("Make optional Cgroup related config params
pointers along with `omitempty` json tag.") converted it to a pointer
and changed the code comment to state that the "default value" means
"no limit" -- and the default value was now a pointer so the default
value is nil not 0. At this stage, using 0 to mean "no limit" would
arguably no longer be correct.
5. However, because the field was marked as REQUIRED in point 2, a while
later commit ef9ce84cf9 ("specs-go/config: fix required items
type") changed the value back to a non-pointer but didn't modify the
code comment -- and so ended up codifying the "0 means no limit"
behaviour.
I would argue this commit is the reason why runtimes have interpreted
the behaviour this way (though runc likely did it because of point 3
since I authored both patches, and other runtimes probably looked at
runc to see how they should interpret this confusing history -- my
bad!).
So, let's finally have some clarity and add wording to conclusively
state that the correct representation of max is -1 (like every other
cgroup configuration value) and that users should not treat 0 as a
special value of any kind. A nil value means "do not touch it" (just
like every other cgroup configuration value too).
Note that a pids.max value of 0 is actually different to 1 now that
CLONE_INTO_CGROUP exists (at the time pids was added to the kernel and
the spec, this feature didn't exist and so it may have seemed redundant
to have two equivalent values -- hence my attempt to make 0 an illegal
value for the kernel implementation).
For the Go API, this is effectively a partial revert of commit
ef9ce84cf9 ("specs-go/config: fix required items type") which turned
the limit value into a bare int64.
Fixes: 2ce2c866ff ("runtime: config: linux: add cgroups information")
Fixes: 9b19cd2fab ("config: linux: update description of PidsLimit")
Fixes: ef9ce84cf9 ("specs-go/config: fix required items type")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The thinking is that the runtimes should not do the filtering of values,
but instead just apply the values in order. This way the possible
MB-lines in l3CacheSchema will become overwritten by memBwSchema values
(if the domains overlap).
Note that we can't just concatenate the values because kernel will error
out if the same domain is attempted to be set multiple times within one
write() call.
Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>
Nodes is required only in some memory policy modes, while some other
modes require that there must be no nodes.
Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
Specify "/" as an explicit value for linux.intelRdt.closID to assign a
container to the default CLOS, corresponding to the root of the resctrl
filesystem.
This addition is important after the recently introduced
intelRdt.enableMonitoring field. There is no way to express "enable
monitoring but keep the container in the default CLOS". Users would
otherwise have to rely on pre-created CLOSes or may quickly exhaust
available CLOS entries - in some configurations the number of available
CLOSes (on top of the default) may be as low as three.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
Add a parameter for enabling per-container resctrl monitoring.
This supersedes and replaces the previous "enableCMT" and "enableMBM"
settings whose functionality was very vaguely specified. Separate
parameter for every monitoring metric does not seem to make much sense, in
particular because in the resctrl filesystem it is not possible to
selectively enable a subset of the monitoring features. You always get
all the metrics that the system provides. Also, with separate settings
(and corresponding check if the specific metric is available) the user
cannot specify "enable whatever is available" - setting everything to
"true" might fail because one of the metrics is not available on the
platform. In addition, having separate parameters is very
future-unproof, making support for new monitoring metrics unnecessarily
cumbersome to add. New metrics are certain to be added in new hardware
generations, e.g. perf/energy monitoring in the near future
(https://lkml.org/lkml/2025/5/21/1631), and requiring an update to the
runtime-spec for each one of them feels like an overkill without much
benefits. It is easier to have one switch for "enable container-specific
metrics" and let the user read whatever metrics the platform provides.
Moreover, it is not even possible to turn off monitoring (from the
resctrl filesystem). For example, you always get the metrics for all
CTRL_MON groups (closIDs). However, that is not always very useful as
there likely are a lot of applications packed in the same group. The new
intelRdt.enableMontoring parameter will enable creation of a MON group
specific to a single container allowing monitoring of resctrl metrics on
per-container granularity.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
* config-linux: add schemata field to IntelRdt
Add a new "schemata" field to the Linux IntelRdt configuration. This
addresses the complexity of separate schema fields and resolves the
issue of supporting currently uncovered RDT features like L2 cache
allocation and CDP (Code and Data Prioritization).
The new field is for specifying the complete schemata (all schemas) to
be written to the schemata file in Linux resctrl fs. The aim is for
simple usage and runtime implementation (by not requiring any
parsing/filtering of data or otherwise re-implement parsing or
validation of the Linux resctrl interface) and also to support all RDT
features now and in the future (i.e. schemas like L2, L2CODE, L2DATA,
L3CODE and L3DATA and who knows L4 or something else in the future).
Behavior of existing fields is not changed but it is required that the
new schemata field is applied last.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
* Add linux.intelRdt.schemata to features.md
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
---------
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
Enable setting a NUMA memory policy for the container. New
linux.memoryPolicy object contains inputs to the set_mempolicy(2)
syscall.
Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
The proposed "netdevices" field provides a declarative way to
specify which host network devices should be moved into a container's
network namespace.
This approach is similar than the existing "devices" field used for block
devices but uses a dictionary keyed by the interface name instead.
The proposed scheme is based on the existing representation of network
device by the `struct net_device`
https://docs.kernel.org/networking/netdevices.html.
This proposal focuses solely on moving existing network devices into
the container namespace. It does not cover the complexities of
network configuration or network interface creation, emphasizing the
separation of device management and network configuration.
Signed-off-by: Antonio Ojea <aojea@google.com>
Most of these either redirect (so changing saves an extra redirect),
or have a TLS version available.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The time namespace is a new kernel feature available in 5.6+ to
isolate the system monotonic and boot-time clocks.
Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
Linux 5.19 introduced a new seccomp flag:
SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV
It is useful for seccomp notify when handling notification from Golang
programs which are often preempted by the runtime with SIGURG.
Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
Burstable CFS controller is introduced in Linux 5.14. This helps with
parallel workloads that might be bursty. They can get throttled even
when their average utilization is under quota. And they may be latency
sensitive at the same time so that throttling them is undesired.
This feature borrows time now against the future underrun, at the cost
of increased interference against the other system users, by introducing
`cfs_burst_us` into CFS bandwidth control to enact the cap on unused
bandwidth accumulation, which will then used additionally for burst.
The patch adds the support/control for CFS bandwidth burst.
Fixes https://github.com/opencontainers/runtime-spec/issues/1119
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
This setting can be used to mimic cgroup v1 behavior on cgroup v2,
when setting the new memory limit during update operation.
In cgroup v1, a limit which is lower than the current usage is rejected.
In cgroup v2, such a low limit is causing an OOM kill.
Ref: https://github.com/opencontainers/runc/issues/3509
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Clarify that device nodes need not be under `/dev`, but that the runtimes need
to be informed of all the device nodes that are used by the container.
Virtual-machine based runtimes such as Kata Containers need to be able to
perform adjustment on device nodes, and cannot be required to deep-scan
file-systems to do so.
The proposed wording was chosen to avoid any regression for any workload
mounding nodes elsewhere, while at the same time clarifying that correct
behaviour cannot be guaranteed if a device node is created on the host and used
by the container without being passed in the devices list.
This fixes issue #1147.
Signed-off-by: Christophe de Dinechin <christophe@dinechin.org>
cgroups v2 supports secure delegation of cgroups. Accordingly,
control over a cgroup (that is, creation of new child cgroups and
movement of processes and threads among the cgroup subtree exposed
to a container) can be safely delegated to a container. Adjusting
the ownership enables real-world use cases like systemd-based
containers fully isolated in user namespaces.
To encourage adoption of this feature, and secure implementation,
define the semantics of cgroup ownership. Changing/setting the
cgroup ownership should only be performed when:
- using cgroups v2, and
- container will have a new cgroup namespace, and
- cgroupfs will be mounted read/write.
The specific files whose ownership should be changed are listed.
In terms of current practice, this is already the behaviour of crun
(which also chown's the memory.oom.group file), and there is a pull
request for runc: https://github.com/opencontainers/runc/pull/3057.
Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>
It makes sense for runtime to reject a cgroup which is frozen
(for both new and existing container), otherwise the runtime
command (create/run/exec) may end up being stuck.
It makes sense for runtime to make sure the cgroup for a new container
is empty (i.e. there are no processes it in), and reject it otherwise.
The scenario in which a non-empty cgroup is used for a new container
has multiple problems, for example:
* If two or more containers share the same cgroup, and each container
has its own limits configured, the order of container starts
ultimately determines whose limits will be effectively applied.
* If two or more containers share the same cgroup, and one of containers
is paused/unpaused, all others are paused, too.
* If cgroup.kill is used to forcefully kill the container, it will also
kill other processes that are not part of this container but merely
belong to the same cgroup.
* When a systemd cgroup manager is used, this becomes even worse. Such
as, stop (or even failed start) of any container results in
stopTransientUnit command being sent to systemd, and so (depending
on unit properties) other containers can receive SIGTERM, be killed
after a timeout etc.
* Many other bad scenarios are possible, as the implicit assumption
of 1:1 container:cgroup mapping is broken.
https://github.com/opencontainers/runc/issues/3132https://github.com/containers/crun/issues/716
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The previous non-rsvd max/limit_in_bytes does not account for reserved
huge page memory, making it possible for a process to reserve all the
huge page memory, without being able to allocate it (due to hugetlb
cgroup page fault accounting restrictions).
In practice this makes it possible to successfully mmap more huge page
memory than allowed via the cgroup settings, but when using the memory
the process will get a SIGBUS and crash. This is bad for applications
trying to mmap at startup (and it succeeds), but the program crashes
when starting to use the memory. eg. postgres is doing this by default.
This patch updates and clarifies `LinuxResources.HugepageLimits` and
`LinuxHugepageLimit` by defaulting the configurations go to rsvd hugetlb
cgroup (when supported) and fallback to page fault accounting if not
supported.
Fixes https://github.com/opencontainers/runtime-spec/issues/1050
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
An attempt to make the spec easier to interpret by grouping all ClosID
related contraints in one place.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
Per-cgroup kernel memory accounting (and explicit limiting) is
problematic in the Linux kernel for too many reasons to quote here.
Besides, cgroup v2 does not even have a kernel memory limit knob,
and the one in cgroup v1 is made obsoleted in kernel v5.4 [1].
Mark memory.kernel and memory.kernelTCP as NOT RECOMMENDED, in additon
to OPTIONAL. This is a way to say "we do not anyone (runtimes or users)
to set those limits, unless they have good understanding and strong
reasons to do so".
[1] https://github.com/torvalds/linux/commit/0158115f702b0ba208ab0b
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
the specs already support overriding the errno code for the syscalls
but the default value is hardcoded to EPERM.
Add a new attribute to override the default value.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>