runtime-spec

mirror of https://github.com/opencontainers/runtime-spec.git synced 2026-02-05 18:45:18 +01:00

Author	SHA1	Message	Date
Kir Kolyshkin	09ec668274	config-linux,schema: fix FileMode description Originally, the file mode was indeed written in octal (see e.g. commit `5273b3d`), but it was found out later that JSON does not allow octal values so the examples were changed to decimal in commit `ccf3a24`, but the "typically an octal value" bit (added by commit `cdcabde`) remains. Change it to emphasize the fact that this is in decimal. Also, add a note to config-linux.md saying the same thing. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-10-14 16:28:38 -07:00
Kir Kolyshkin	5e89a5d370	Merge pull request #1196 from ipuustin/rdt-clarifications Clarify Intel RDT configuration	2025-10-10 15:15:44 -07:00
Aleksa Sarai	869b2d5b0c	linux: clarify pids cgroup settings The history of this is a little complicated, but in short there is an argument to be made that several misunderstandings resulted in the spec sometimes implying (and runtimes interpreting) a pids.limit value of 0 to be equivalent to "max" or otherwise having unfortunate handling of the value. The slightly longer background is the following: 1. When commit `834fb5db52` ("spec: linux: add support for the PIDs cgroup") added support, we did not yet have textual documentation of cgroup configuration values. In addition, we had not yet started using pointers to indicate optional fields and detect unset fields. However, the initial commit did imply that pids.limit=0 should be treated as a real value. 2. Commit `2ce2c866ff` ("runtime: config: linux: add cgroups information") labeled "pids.limit" as being a REQUIRED field. This may seem trivial, but consider this foreshadowing for point 5. 3. Later, commit `9b19cd2fab` ("config: linux: update description of PidsLimit") was added to explicitly make pids.limit=0 equivalent to max (at the time there was a kernel patch proposed to make setting pids.max to 0 illegal, though it was never merged). This is often pointed to as being the reason for runtimes interpreting this behaviour this way, however... 4. Soon after, `488f174af9` ("Make optional Cgroup related config params pointers along with `omitempty` json tag.") converted it to a pointer and changed the code comment to state that the "default value" means "no limit" -- and the default value was now a pointer so the default value is nil not 0. At this stage, using 0 to mean "no limit" would arguably no longer be correct. 5. However, because the field was marked as REQUIRED in point 2, a while later commit `ef9ce84cf9` ("specs-go/config: fix required items type") changed the value back to a non-pointer but didn't modify the code comment -- and so ended up codifying the "0 means no limit" behaviour. I would argue this commit is the reason why runtimes have interpreted the behaviour this way (though runc likely did it because of point 3 since I authored both patches, and other runtimes probably looked at runc to see how they should interpret this confusing history -- my bad!). So, let's finally have some clarity and add wording to conclusively state that the correct representation of max is -1 (like every other cgroup configuration value) and that users should not treat 0 as a special value of any kind. A nil value means "do not touch it" (just like every other cgroup configuration value too). Note that a pids.max value of 0 is actually different to 1 now that CLONE_INTO_CGROUP exists (at the time pids was added to the kernel and the spec, this feature didn't exist and so it may have seemed redundant to have two equivalent values -- hence my attempt to make 0 an illegal value for the kernel implementation). For the Go API, this is effectively a partial revert of commit `ef9ce84cf9` ("specs-go/config: fix required items type") which turned the limit value into a bare int64. Fixes: `2ce2c866ff` ("runtime: config: linux: add cgroups information") Fixes: `9b19cd2fab` ("config: linux: update description of PidsLimit") Fixes: `ef9ce84cf9` ("specs-go/config: fix required items type") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2025-09-18 11:56:22 +10:00
Ismo Puustinen	a6c310aa55	config-linux: clarify when the RDT sub-directory should be removed. Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>	2025-08-28 19:37:30 +03:00
Ismo Puustinen	b280c07d44	config-linux: clarify the "MB:"-line filtering in RDT. The thinking is that the runtimes should not do the filtering of values, but instead just apply the values in order. This way the possible MB-lines in l3CacheSchema will become overwritten by memBwSchema values (if the domains overlap). Note that we can't just concatenate the values because kernel will error out if the same domain is attempted to be set multiple times within one write() call. Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>	2025-08-28 19:34:19 +03:00
Antti Kervinen	84b6c2c45c	docs: fix and elaborate the nodes field in Linux memory policy Nodes is required only in some memory policy modes, while some other modes require that there must be no nodes. Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>	2025-08-27 16:52:15 +03:00
Akihiro Suda	8675d5698f	Merge pull request #1289 from marquiz/devel/rdt-default-clos config-linux: define default clos for linux.intelRdt	2025-08-27 14:58:32 +09:00
Markus Lehtonen	e51a839d16	config-linux: define default clos for linux.intelRdt Specify "/" as an explicit value for linux.intelRdt.closID to assign a container to the default CLOS, corresponding to the root of the resctrl filesystem. This addition is important after the recently introduced intelRdt.enableMonitoring field. There is no way to express "enable monitoring but keep the container in the default CLOS". Users would otherwise have to rely on pre-created CLOSes or may quickly exhaust available CLOS entries - in some configurations the number of available CLOSes (on top of the default) may be as low as three. Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>	2025-08-15 13:50:28 +03:00
Akihiro Suda	bfdffd548a	Merge pull request #1282 from askervin/5aD-oci-mempolicy Add support for Linux memory policy	2025-08-04 17:16:26 +09:00
Markus Lehtonen	34a39b9070	config-linux: add intelRdt.enableMonitoring (#1287 ) Add a parameter for enabling per-container resctrl monitoring. This supersedes and replaces the previous "enableCMT" and "enableMBM" settings whose functionality was very vaguely specified. Separate parameter for every monitoring metric does not seem to make much sense, in particular because in the resctrl filesystem it is not possible to selectively enable a subset of the monitoring features. You always get all the metrics that the system provides. Also, with separate settings (and corresponding check if the specific metric is available) the user cannot specify "enable whatever is available" - setting everything to "true" might fail because one of the metrics is not available on the platform. In addition, having separate parameters is very future-unproof, making support for new monitoring metrics unnecessarily cumbersome to add. New metrics are certain to be added in new hardware generations, e.g. perf/energy monitoring in the near future (https://lkml.org/lkml/2025/5/21/1631), and requiring an update to the runtime-spec for each one of them feels like an overkill without much benefits. It is easier to have one switch for "enable container-specific metrics" and let the user read whatever metrics the platform provides. Moreover, it is not even possible to turn off monitoring (from the resctrl filesystem). For example, you always get the metrics for all CTRL_MON groups (closIDs). However, that is not always very useful as there likely are a lot of applications packed in the same group. The new intelRdt.enableMontoring parameter will enable creation of a MON group specific to a single container allowing monitoring of resctrl metrics on per-container granularity. Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>	2025-06-29 12:08:38 +09:00
Markus Lehtonen	d2f4f9097a	config-linux: add schemata field to IntelRdt (#1230 ) * config-linux: add schemata field to IntelRdt Add a new "schemata" field to the Linux IntelRdt configuration. This addresses the complexity of separate schema fields and resolves the issue of supporting currently uncovered RDT features like L2 cache allocation and CDP (Code and Data Prioritization). The new field is for specifying the complete schemata (all schemas) to be written to the schemata file in Linux resctrl fs. The aim is for simple usage and runtime implementation (by not requiring any parsing/filtering of data or otherwise re-implement parsing or validation of the Linux resctrl interface) and also to support all RDT features now and in the future (i.e. schemas like L2, L2CODE, L2DATA, L3CODE and L3DATA and who knows L4 or something else in the future). Behavior of existing fields is not changed but it is required that the new schemata field is applied last. Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com> * Add linux.intelRdt.schemata to features.md Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com> --------- Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>	2025-05-09 21:00:57 +09:00
sharmaann	0ed7cf6493	docs: add missing backticks for code formatting Signed-off-by: sharmaann <shakerian.arman@yahoo.com>	2025-04-25 20:13:58 +03:30
Antti Kervinen	57c949588e	Add support for Linux memory policy Enable setting a NUMA memory policy for the container. New linux.memoryPolicy object contains inputs to the set_mempolicy(2) syscall. Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>	2025-04-23 10:32:29 +03:00
Antonio Ojea	e935f995dd	Define Linux Network Devices (#1271 ) The proposed "netdevices" field provides a declarative way to specify which host network devices should be moved into a container's network namespace. This approach is similar than the existing "devices" field used for block devices but uses a dictionary keyed by the interface name instead. The proposed scheme is based on the existing representation of network device by the `struct net_device` https://docs.kernel.org/networking/netdevices.html. This proposal focuses solely on moving existing network devices into the container namespace. It does not cover the complexities of network configuration or network interface creation, emphasizing the separation of device management and network configuration. Signed-off-by: Antonio Ojea <aojea@google.com>	2025-04-01 18:56:57 +09:00
z63d	221c198895	Fix description of errnoRet in Seccomp Signed-off-by: z63d <kaita.nakamura0830@gmail.com>	2025-02-07 13:04:48 +09:00
Akihiro Suda	9de64c0aea	config-linux: update for libseccomp v2.6.0 libseccomp v2.6.0 was released on Jan 23, 2025. https://github.com/seccomp/libseccomp/releases/tag/v2.6.0 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2025-01-29 09:39:54 +09:00
Akihiro Suda	8cfc4074b2	specs-go: sync SCMP_ARCH_* constants with libseccomp main (#1229 ) The following constants are defined in the main branch of libseccomp, but not included in its latest release (v2.5) yet: * SCMP_ARCH_LOONGARCH64 (seccomp/libseccomp@6966ec7) * SCMP_ARCH_M68K (seccomp/libseccomp@dd5c9c2) * SCMP_ARCH_SH (seccomp/libseccomp@c12945d) * SCMP_ARCH_SHEB (seccomp/libseccomp@c12945d) These constant names are unlikely to change before v2.6 GA, so we can safely refer to them in specs-go. Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2024-12-09 20:36:42 +09:00
Sebastiaan van Stijn	9ceba9f40b	update http links to https Most of these either redirect (so changing saves an extra redirect), or have a TLS version available. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2024-11-04 12:28:14 +01:00
Kir Kolyshkin	2149fb504e	config-linux: describe the format of cpus and mems Also, s/in/on/g. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-06-11 13:01:23 -07:00
utam0k	f66aad4730	Update ociVersion in config-linux.md example Signed-off-by: utam0k <k0ma@utam0k.jp>	2023-04-30 03:34:08 +00:00
utam0k	9d7c878757	Clarify I/O throttling differences between cgroup v1 and v2 Signed-off-by: utam0k <k0ma@utam0k.jp>	2023-04-03 13:02:11 +00:00
Kir Kolyshkin	8a09257551	Merge pull request #1116 from kailun-qin/add-hugetlb-rsvd config-linux: add support for rsvd hugetlb cgroup	2023-03-21 09:48:51 -07:00
daobao qiao	77c37f1e9a	Update config-linux.md fix time_namespaces url error. Signed-off-by: daobao qiao <201028369@qq.com>	2023-03-06 09:50:36 +08:00
Akihiro Suda	58ec43f9fc	Merge pull request #1148 from c3d/issue/1147-device-location config-linux: Clarify where device nodes can be created	2023-02-15 18:04:56 +09:00
Qiang Huang	7301c34549	Merge pull request #1151 from KentaTada/add-time-namespac Add support for time namespace	2023-02-01 11:38:51 +08:00
Kenta Tada	36bb632767	Add support for time namespace The time namespace is a new kernel feature available in 5.6+ to isolate the system monotonic and boot-time clocks. Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>	2023-01-24 21:20:51 +09:00
Akihiro Suda	6188d9e9ef	Merge pull request #1120 from kailun-qin/add-cfs-burst config-linux: add CFS bandwidth burst	2023-01-23 20:05:01 +09:00
Kir Kolyshkin	494a5a6aca	Merge pull request #1158 from kolyshkin/check-before-update config-linux: add memory.checkBeforeUpdate	2022-09-09 13:48:39 -07:00
Alban Crequy	4bcd065f24	seccomp: Add flag SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV Linux 5.19 introduced a new seccomp flag: SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV It is useful for seccomp notify when handling notification from Golang programs which are often preempted by the runtime with SIGURG. Signed-off-by: Alban Crequy <albancrequy@microsoft.com>	2022-09-07 12:11:41 +02:00
Kailun Qin	d931d4b8ab	config-linux: add CFS bandwidth burst Burstable CFS controller is introduced in Linux 5.14. This helps with parallel workloads that might be bursty. They can get throttled even when their average utilization is under quota. And they may be latency sensitive at the same time so that throttling them is undesired. This feature borrows time now against the future underrun, at the cost of increased interference against the other system users, by introducing `cfs_burst_us` into CFS bandwidth control to enact the cap on unused bandwidth accumulation, which will then used additionally for burst. The patch adds the support/control for CFS bandwidth burst. Fixes https://github.com/opencontainers/runtime-spec/issues/1119 Signed-off-by: Kailun Qin <kailun.qin@intel.com>	2022-09-02 09:40:53 -04:00
Kir Kolyshkin	9e658bcd71	config-linux: add memory.checkBeforeUpdate This setting can be used to mimic cgroup v1 behavior on cgroup v2, when setting the new memory limit during update operation. In cgroup v1, a limit which is lower than the current usage is rejected. In cgroup v2, such a low limit is causing an OOM kill. Ref: https://github.com/opencontainers/runc/issues/3509 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-08-29 10:48:45 -07:00
Christophe de Dinechin	3565df5d7e	config-linux: Clarify where device nodes can be created Clarify that device nodes need not be under `/dev`, but that the runtimes need to be informed of all the device nodes that are used by the container. Virtual-machine based runtimes such as Kata Containers need to be able to perform adjustment on device nodes, and cannot be required to deep-scan file-systems to do so. The proposed wording was chosen to avoid any regression for any workload mounding nodes elsewhere, while at the same time clarifying that correct behaviour cannot be guaranteed if a device node is created on the host and used by the container without being passed in the devices list. This fixes issue #1147. Signed-off-by: Christophe de Dinechin <christophe@dinechin.org>	2022-08-10 10:25:39 +02:00
Vincent Batts	e54040a9b1	Merge pull request #1136 from wineway/main config-linux: add idle option for container cgroup	2022-04-20 10:56:59 -04:00
Aleksa Sarai	6969a0a09a	merge branch 'pr-1133' Akihiro Suda (1): typo: seccompFD -> seccompFd LGTMs: guiseppe cyphar Closes #1133	2022-03-11 13:09:03 +11:00
Fraser Tweedale	600a8bd6d6	cgroup ownership: clarify that some files may not exist Not all files listed in /sys/kernel/cgroup/delegate necessarily exist in all cgroups. For example, see this issue and PR: - https://github.com/opencontainers/runc/issues/3387 - https://github.com/opencontainers/runc/pull/3389 Expand the cgroup ownership semantics to ensure that runtime authors are aware of this possibility and implementations handle it gracefully. Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>	2022-02-22 12:44:36 +10:00
wineway	9d363b36f6	config-linux: add idle option for container cgroup Signed-off-by: wineway <wangyuweihx@gmail.com>	2022-02-16 16:22:14 +08:00
Akihiro Suda	b05eb53f3d	typo: seccompFD -> seccompFd Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2022-02-02 14:24:29 +09:00
Aleksa Sarai	8958f93039	merge branch 'pr-1125' Kir Kolyshkin (1): config-linux: MAY reject an unfit cgroup LGTMs: guiseppe tianon cyphar Closes #1125	2021-12-14 18:12:23 +11:00
Fraser Tweedale	f4ef391443	specify cgroup ownership semantics cgroups v2 supports secure delegation of cgroups. Accordingly, control over a cgroup (that is, creation of new child cgroups and movement of processes and threads among the cgroup subtree exposed to a container) can be safely delegated to a container. Adjusting the ownership enables real-world use cases like systemd-based containers fully isolated in user namespaces. To encourage adoption of this feature, and secure implementation, define the semantics of cgroup ownership. Changing/setting the cgroup ownership should only be performed when: - using cgroups v2, and - container will have a new cgroup namespace, and - cgroupfs will be mounted read/write. The specific files whose ownership should be changed are listed. In terms of current practice, this is already the behaviour of crun (which also chown's the memory.oom.group file), and there is a pull request for runc: https://github.com/opencontainers/runc/pull/3057. Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>	2021-10-22 16:44:51 +10:00
Kir Kolyshkin	104385da20	config-linux: MAY reject an unfit cgroup It makes sense for runtime to reject a cgroup which is frozen (for both new and existing container), otherwise the runtime command (create/run/exec) may end up being stuck. It makes sense for runtime to make sure the cgroup for a new container is empty (i.e. there are no processes it in), and reject it otherwise. The scenario in which a non-empty cgroup is used for a new container has multiple problems, for example: * If two or more containers share the same cgroup, and each container has its own limits configured, the order of container starts ultimately determines whose limits will be effectively applied. * If two or more containers share the same cgroup, and one of containers is paused/unpaused, all others are paused, too. * If cgroup.kill is used to forcefully kill the container, it will also kill other processes that are not part of this container but merely belong to the same cgroup. * When a systemd cgroup manager is used, this becomes even worse. Such as, stop (or even failed start) of any container results in stopTransientUnit command being sent to systemd, and so (depending on unit properties) other containers can receive SIGTERM, be killed after a timeout etc. * Many other bad scenarios are possible, as the implicit assumption of 1:1 container:cgroup mapping is broken. https://github.com/opencontainers/runc/issues/3132 https://github.com/containers/crun/issues/716 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-29 19:14:45 -07:00
Vincent Batts	0d6cc581ae	Merge pull request #1076 from Creatone/creatone/mon-support config-linux: Add Intel RDT CMT and MBM Linux support	2021-09-10 07:50:17 -04:00
Kailun Qin	a650533920	config-linux: add support for rsvd hugetlb cgroup The previous non-rsvd max/limit_in_bytes does not account for reserved huge page memory, making it possible for a process to reserve all the huge page memory, without being able to allocate it (due to hugetlb cgroup page fault accounting restrictions). In practice this makes it possible to successfully mmap more huge page memory than allowed via the cgroup settings, but when using the memory the process will get a SIGBUS and crash. This is bad for applications trying to mmap at startup (and it succeeds), but the program crashes when starting to use the memory. eg. postgres is doing this by default. This patch updates and clarifies `LinuxResources.HugepageLimits` and `LinuxHugepageLimit` by defaulting the configurations go to rsvd hugetlb cgroup (when supported) and fallback to page fault accounting if not supported. Fixes https://github.com/opencontainers/runtime-spec/issues/1050 Signed-off-by: Kailun Qin <kailun.qin@intel.com>	2021-08-06 13:31:00 -04:00
Paweł Szulik	cc7f6ec598	config-linux: Add Intel RDT CMT and MBM Linux support Add support for Intel Resource Director Technology (RDT) / Cache Monitoring Technology (CMT) and Memory Bandwidth Monitoring (MBM). Example: "linux": { "intelRdt": { "enableCMT": true, "enableMBM": true } } This is the prerequisite of this runc proposal: https://github.com/opencontainers/runc/issues/2519 For more information about Intel RDT CMT and MBM, please refer to: https://github.com/opencontainers/runc/issues/2519 Signed-off-by: Paweł Szulik <pawel.szulik@intel.com>	2021-07-13 08:53:11 +02:00
Markus Lehtonen	0c021c1a44	config-linux: clarify the handling of ClosID RDT parameter An attempt to make the spec easier to interpret by grouping all ClosID related contraints in one place. Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>	2021-04-26 11:49:36 +03:00
Markus Lehtonen	9e6594453b	config-linux: fix indentation on IntelRdt Also, split out the rules regarding interdependency of parameters into a separate list. Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>	2021-04-26 11:49:26 +03:00
Rodrigo Campos	58798e75e9	Add Seccomp Notify support This adds the specification for Seccomp Userspace Notification and the Golang bindings. This contains: - New fields in the seccomp section to use with seccomp userspace notification. - Additional SeccompState struct containing the container state and file descriptors passed for seccomp. This was discussed in the OCI Weekly Discussion on September 16th, 2020. After review on github, this implementation was changed to the "Proposal with listenerPath and listenerExtraMetadata". For more information see: - https://github.com/opencontainers/runtime-spec/pull/1073#issuecomment-719465555 Docs presented on the community meeting (for the old implementation using hooks): - https://hackmd.io/El8Dd2xrTlCaCG59ns5cwg#September-16-2020 - https://docs.google.com/document/d/1xHw5GQjMj6ZKR-40aKmTWZRkvlPuzMGQRu-YpOFQc30/edit Documentation for this feature: - https://www.kernel.org/doc/html/v5.0/userspace-api/seccomp_filter.html#userspace-notification - man pages: seccomp_user_notif.2 at https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif - brauner's blog: https://brauner.github.io/2020/07/23/seccomp-notify.html This PR is an alternative proposal to PR 1038. While similar in nature, the main difference is that this PR adds optional metadata to be sent to the seccomp agent and specifies how the UNIX socket MUST be used. Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>	2021-03-09 18:54:39 +01:00
Kir Kolyshkin	f02cd4a427	config-linux: mark memory.kernel[TCP] as NOT RECOMMENDED Per-cgroup kernel memory accounting (and explicit limiting) is problematic in the Linux kernel for too many reasons to quote here. Besides, cgroup v2 does not even have a kernel memory limit knob, and the one in cgroup v1 is made obsoleted in kernel v5.4 [1]. Mark memory.kernel and memory.kernelTCP as NOT RECOMMENDED, in additon to OPTIONAL. This is a way to say "we do not anyone (runtimes or users) to set those limits, unless they have good understanding and strong reasons to do so". [1] https://github.com/torvalds/linux/commit/0158115f702b0ba208ab0b Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-03-08 10:50:37 -08:00
Giuseppe Scrivano	f7ef278d1b	seccomp: allow to override default errno return code the specs already support overriding the errno code for the syscalls but the default value is hardcoded to EPERM. Add a new attribute to override the default value. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2021-02-22 16:47:57 +01:00
Giuseppe Scrivano	ec964dfa30	seccomp: expect error with invalid errnoRet Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2021-02-22 16:47:57 +01:00
Iceber Gu	2978430a52	config-linux: fix personality link Signed-off-by: Iceber Gu <wei.cai-nat@daocloud.io>	2021-02-13 14:45:23 +08:00

1 2 3 4 5

228 Commits