The proposed "netdevices" field provides a declarative way to
specify which host network devices should be moved into a container's
network namespace.
This approach is similar than the existing "devices" field used for block
devices but uses a dictionary keyed by the interface name instead.
The proposed scheme is based on the existing representation of network
device by the `struct net_device`
https://docs.kernel.org/networking/netdevices.html.
This proposal focuses solely on moving existing network devices into
the container namespace. It does not cover the complexities of
network configuration or network interface creation, emphasizing the
separation of device management and network configuration.
Signed-off-by: Antonio Ojea <aojea@google.com>
This PR proposes updates to the OCI runtime spec with
z/OS platform-specific details, including adding
namespaces, adding noNewPrivileges flag, and removing
devices. These changes are currently in use by the
IBM z/OS Container Platform (zOSCP) product - details
can be found here:
https://www.ibm.com/products/zos-container-platform.
Signed-off-by: Neil Johnson <najohnsn@us.ibm.com>
Signed-off-by: Kershaw Mehta <kershaw@us.ibm.com>
Add `features.md` and `features-linux.md`, to formalize the `runc features` JSON that was introduced in runc v1.1.0.
A runtime caller MAY use this JSON to detect the features implemented by the runtime.
The spec corresponds to https://github.com/opencontainers/runc/blob/v1.1.0/types/features/features.go
(opencontainers/runc PR 3296, opencontainers/runc PR 3310)
Differences since runc v1.1.0:
- Add `.linux.intelRdt.enabled` field
- Add `.linux.cgroup.rdma` field
- Add `.linux.seccomp.knownFlags` and `.linux.seccomp.supportedFlags` fields (Implemented in runc PR 3588)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The time namespace is a new kernel feature available in 5.6+ to
isolate the system monotonic and boot-time clocks.
Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
Burstable CFS controller is introduced in Linux 5.14. This helps with
parallel workloads that might be bursty. They can get throttled even
when their average utilization is under quota. And they may be latency
sensitive at the same time so that throttling them is undesired.
This feature borrows time now against the future underrun, at the cost
of increased interference against the other system users, by introducing
`cfs_burst_us` into CFS bandwidth control to enact the cap on unused
bandwidth accumulation, which will then used additionally for burst.
The patch adds the support/control for CFS bandwidth burst.
Fixes https://github.com/opencontainers/runtime-spec/issues/1119
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
This setting can be used to mimic cgroup v1 behavior on cgroup v2,
when setting the new memory limit during update operation.
In cgroup v1, a limit which is lower than the current usage is rejected.
In cgroup v2, such a low limit is causing an OOM kill.
Ref: https://github.com/opencontainers/runc/issues/3509
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
add the domainname entity so that container runtimes can add special handling similar to hostname. The current workaround of adding a sysctl for kernel.domainname only works with rootful execution in most cases. This will allow for rootless execution.
container runtimes will be able to add special handling as they do for hostname, using setdomainname to add the entry to /proc/sys/kernel/domainname.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Currently the docs don't say anything about what the "pageSize" is other
than the fact that it is a string. This makes it easier for developers
to understand how it works, and may help avoiding mistakes which are
hard to spot.
Signed-off-by: Odin Ugedal <odin@ugedal.com>
"man 7 user_namespaces" explains the format of uid_map and gid_map:
<containerID> <hostID> <mapSize>
The order of map entries in JSON does not matter. But for the clarity of
the spec, I find it easier to understand if the order of the JSON fields is
the same as the order of the fields in the underlying uid_map/gid_map
files.
I am about to file a PR in runtime-tools because the fields in
uid_map/gid_map were parsed in the wrong order.
Signed-off-by: Alban Crequy <alban@kinvolk.io>
It's backed by memory.oom_control, so this commit moves it in with
the rest of the memory-controller config.
Looking at the history, the initial request landing a setting for this
in the Docker/OCI ecosystem seems to be [1], which added
Cgroup.OomKillDisable. That commit was carried from libcontainer into
runC [2] where it is now Resources.OomKillDisable [3]. From runC it
was carried into this repo (with some renaming) in [4]. Subsequent
early doc updates landed in [5,6]. In none of those can I find
discussion about why the setting is not already under memory. I
expect the reason is that the runC structures are flat, so "under
memory" is not a thing there. But in this spec, resources has
per-controller sub-properties. The fact that disableOOMKiller
belonged to the memory controller may have been overlooked in [4] and
never revisited until now.
[1]: https://github.com/docker/libcontainer/pull/417
Subject: cgroups: add support for oom control
[2]: 295c70865d
Subject: cgroups: add support for oom control
[3]: https://github.com/opencontainers/runc/blob/v1.0.0-rc3/libcontainer/configs/cgroup_unix.go#L113-L114
[4]: https://github.com/opencontainers/runtime-spec/pull/51
Subject: Add Go types for specification
[5]: https://github.com/opencontainers/runtime-spec/pull/137
Subject: Adding cgroups path to the Spec.
[6]: https://github.com/opencontainers/runtime-spec/pull/199
Subject: runtime: config: linux: add cgroups informations
Signed-off-by: W. Trevor King <wking@tremily.us>
The kernel ABI to these values is a string, which accepts the value `-1`
to mean "unlimited" or an integer up to 2^63 for an amount of memory in
bytes.
While the internal representation in the kernel is unsigned, this is not
exposed in any ABI directly. Because of the user-kernel memory split, values
over 2^63 are not really useful; indeed that much memory is not supported,
as physical memory is limited to 52 bits in the forthcoming switch to five
level page tables. So it is much more natural to support the value `-1` for
unlimited, especially as the actual number needed to represent the maximum
has varied in different kernel versions, and across 32 and 64 bit architectures,
so determining the value to use is not possible, so it is necessary to write
the string `-1` to the cgroup files.
See also discussion in
- https://github.com/opencontainers/runc/pull/1494
- https://github.com/opencontainers/runc/pull/1492
- https://github.com/opencontainers/runc/pull/1375
- https://github.com/opencontainers/runc/issues/1421
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
These are long enough without the prefix, and
linux.resources.blockIO.blkioWeight, etc. are just as specific as
linux.resources.blockIO.weight.
Generated with:
$ sed -i s/blkioWeight/weight/g $(git grep -l blkioWeight)
$ sed -i s/blkioLeaf/leaf/g $(git grep -l blkioLeaf)
$ sed -i s/blkioThrottle/throttle/g $(git grep -l blkioThrottle)
Signed-off-by: W. Trevor King <wking@tremily.us>
It's optional since c41ea83d (config: Make process optional,
2017-02-27, #701) which landed yesterday.
Mrunal wanted to continue testing a config which has enough for a
'start' invocation [1], so I've kept the old JSON as
minimal-for-start.json (washing it through 'make -C schema fmt' to
adjust the args indenting).
[1]: https://github.com/opencontainers/runtime-spec/pull/805#issuecomment-300811461
Signed-off-by: W. Trevor King <wking@tremily.us>
And fill in some known-good and known-bad examples. We can make this
as detailed as we want, but this commit just adds enough to know that:
* The full-file spec examples are valid.
* The JSON Schema can distinguish valid examples from invalid JSON.
This will help catch JSON Schema typos like those being addressed by
[1].
[1]: https://github.com/opencontainers/runtime-spec/pull/784
Signed-off-by: W. Trevor King <wking@tremily.us>