Add a parameter for enabling per-container resctrl monitoring.
This supersedes and replaces the previous "enableCMT" and "enableMBM"
settings whose functionality was very vaguely specified. Separate
parameter for every monitoring metric does not seem to make much sense, in
particular because in the resctrl filesystem it is not possible to
selectively enable a subset of the monitoring features. You always get
all the metrics that the system provides. Also, with separate settings
(and corresponding check if the specific metric is available) the user
cannot specify "enable whatever is available" - setting everything to
"true" might fail because one of the metrics is not available on the
platform. In addition, having separate parameters is very
future-unproof, making support for new monitoring metrics unnecessarily
cumbersome to add. New metrics are certain to be added in new hardware
generations, e.g. perf/energy monitoring in the near future
(https://lkml.org/lkml/2025/5/21/1631), and requiring an update to the
runtime-spec for each one of them feels like an overkill without much
benefits. It is easier to have one switch for "enable container-specific
metrics" and let the user read whatever metrics the platform provides.
Moreover, it is not even possible to turn off monitoring (from the
resctrl filesystem). For example, you always get the metrics for all
CTRL_MON groups (closIDs). However, that is not always very useful as
there likely are a lot of applications packed in the same group. The new
intelRdt.enableMontoring parameter will enable creation of a MON group
specific to a single container allowing monitoring of resctrl metrics on
per-container granularity.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
* config-linux: add schemata field to IntelRdt
Add a new "schemata" field to the Linux IntelRdt configuration. This
addresses the complexity of separate schema fields and resolves the
issue of supporting currently uncovered RDT features like L2 cache
allocation and CDP (Code and Data Prioritization).
The new field is for specifying the complete schemata (all schemas) to
be written to the schemata file in Linux resctrl fs. The aim is for
simple usage and runtime implementation (by not requiring any
parsing/filtering of data or otherwise re-implement parsing or
validation of the Linux resctrl interface) and also to support all RDT
features now and in the future (i.e. schemas like L2, L2CODE, L2DATA,
L3CODE and L3DATA and who knows L4 or something else in the future).
Behavior of existing fields is not changed but it is required that the
new schemata field is applied last.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
* Add linux.intelRdt.schemata to features.md
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
---------
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
Enable setting a NUMA memory policy for the container. New
linux.memoryPolicy object contains inputs to the set_mempolicy(2)
syscall.
Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
The proposed "netdevices" field provides a declarative way to
specify which host network devices should be moved into a container's
network namespace.
This approach is similar than the existing "devices" field used for block
devices but uses a dictionary keyed by the interface name instead.
The proposed scheme is based on the existing representation of network
device by the `struct net_device`
https://docs.kernel.org/networking/netdevices.html.
This proposal focuses solely on moving existing network devices into
the container namespace. It does not cover the complexities of
network configuration or network interface creation, emphasizing the
separation of device management and network configuration.
Signed-off-by: Antonio Ojea <aojea@google.com>
The time namespace is a new kernel feature available in 5.6+ to
isolate the system monotonic and boot-time clocks.
Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
Burstable CFS controller is introduced in Linux 5.14. This helps with
parallel workloads that might be bursty. They can get throttled even
when their average utilization is under quota. And they may be latency
sensitive at the same time so that throttling them is undesired.
This feature borrows time now against the future underrun, at the cost
of increased interference against the other system users, by introducing
`cfs_burst_us` into CFS bandwidth control to enact the cap on unused
bandwidth accumulation, which will then used additionally for burst.
The patch adds the support/control for CFS bandwidth burst.
Fixes https://github.com/opencontainers/runtime-spec/issues/1119
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
This setting can be used to mimic cgroup v1 behavior on cgroup v2,
when setting the new memory limit during update operation.
In cgroup v1, a limit which is lower than the current usage is rejected.
In cgroup v2, such a low limit is causing an OOM kill.
Ref: https://github.com/opencontainers/runc/issues/3509
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
the specs already support overriding the errno code for the syscalls
but the default value is hardcoded to EPERM.
Add a new attribute to override the default value.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
allow users to specify cgroup v2 resources.
Each element in the map refers to a file in the cgroup v2 hierarchy
and the element value has its content.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Currently the docs don't say anything about what the "pageSize" is other
than the fact that it is a string. This makes it easier for developers
to understand how it works, and may help avoiding mistakes which are
hard to spot.
Signed-off-by: Odin Ugedal <odin@ugedal.com>
Add support for Intel Resource Director Technology (RDT) /
Memory Bandwidth Allocation (MBA). Add memory bandwidth resource
constraints in Linux-specific configuration.
In this PR, the spec for memory bandwidth (memBwSchema) keeps
the same format as existed spec for L3 cache (l3CacheSchema)
for consistency and compatibility in runtime-spec 1.x.
Example:
"linux": {
"intelRdt": {
"closID": "guaranteed_group",
"l3CacheSchema": "L3:0=7f0;1=1f",
"memBwSchema": "MB:0=20;1=70"
}
}
This is the prerequisite of this runc proposal:
https://github.com/opencontainers/runc/issues/1596
For more information about Intel RDT/MBA, please refer to:
https://github.com/opencontainers/runc/issues/1596
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Creating a dedicated RDT Class of Service (CLOS) for each running
container, even they have exactly same Scheam, will lead to short of
CLOS, since there is a hardware limit for the number of CLOS, around
16 CLOS per platform.
This PR add one parameter 'closID' into existed spec to allow user
to specify which RDT Class of Service (CLOS) the container will be
located. So it can place these containers with same Schema into one
single CLOS.
Example:
"linux": {
"intelRdt": {
"closID": "guaranteed_group",
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Lin Yang <lin.a.yang@intel.com>
We're using JSON Schema draft-04 [1], as declared by our '$schema'
properties [2]. In draft-04, the 'id' keyword alters the resolution
scope. But our current '$ref' values use JSON Pointers [3,4] with
relative references like 'defs-linux.json#/definitions/Device' that
ignore the 'id's.
By draft-07, 'id' has become '$id', and [5]:
The root schema of a JSON Schema document SHOULD contain an "$id"
keyword with a URI (containing a scheme).
But since [6], including any URI that cannot be retrieved generates an
error:
$ ./validate config-schema.json test/config/good/minimal.json
Could not read schema from HTTP, response status is 404 Not Found
While a root 'id' entry would be nice, we don't currently host these
anywhere with a useful URI. We could use [7], but then testing pull
requests would be difficult.
By draft-07, the purpose of internal '$id' entries is clearly
explained [5]:
Providing a plain name fragment enables a subschema to be relocated
within a schema without requiring that JSON Pointer references are
updated.
We don't need that, because we control all the references. In the
infrequent event of a subschema move, we can update the consuming
references in the same commit.
The draft-07 $ref docs also explain that $ref targets may be URNs [8]:
The URI is not a network locator, only an identifier. A schema need
not be downloadable from the address if it is a network-addressable
URL, and implementations SHOULD NOT assume they should perform a
network operation when they encounter a network-addressable URI.
I haven't found analogous wording for $id, but it's possible that
gojsonschema is being overly agressive with its attempted retrievals.
This commit removes all of our 'id' entries. The resulting JSON
Schema is valid (regardless of where you host it) and does not
generate the 404s.
Reported by Tom Godkin [9] and William Martin [10].
[1]: https://tools.ietf.org/html/draft-zyp-json-schema-04#section-7.2
[2]: https://tools.ietf.org/html/draft-zyp-json-schema-04#section-6
[3]: https://tools.ietf.org/html/draft-ietf-appsawg-json-pointer-07
[4]: https://tools.ietf.org/html/rfc6901
[5]: https://tools.ietf.org/html/draft-handrews-json-schema-00#section-9.2
[6]: 83a7f6369d
[7]: https://raw.githubusercontent.com/opencontainers/runtime-spec/v1.0.1/schema/config-schema.json
[8]: https://tools.ietf.org/html/draft-handrews-json-schema-00#section-8
[9]: https://github.com/opencontainers/runc/issues/1680
[10]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/L9ME-YRPmmc
Subject: runtime-spec validation questions
Date: Thu, 4 Jan 2018 15:47:50 +0000
Message-ID: <CAMp6QwMTJab5K25=CVy=6OZV6NRX0s-nMLGwqC8ZMpFEp5bF_Q@mail.gmail.com>
Signed-off-by: W. Trevor King <wking@tremily.us>
It's backed by memory.oom_control, so this commit moves it in with
the rest of the memory-controller config.
Looking at the history, the initial request landing a setting for this
in the Docker/OCI ecosystem seems to be [1], which added
Cgroup.OomKillDisable. That commit was carried from libcontainer into
runC [2] where it is now Resources.OomKillDisable [3]. From runC it
was carried into this repo (with some renaming) in [4]. Subsequent
early doc updates landed in [5,6]. In none of those can I find
discussion about why the setting is not already under memory. I
expect the reason is that the runC structures are flat, so "under
memory" is not a thing there. But in this spec, resources has
per-controller sub-properties. The fact that disableOOMKiller
belonged to the memory controller may have been overlooked in [4] and
never revisited until now.
[1]: https://github.com/docker/libcontainer/pull/417
Subject: cgroups: add support for oom control
[2]: 295c70865d
Subject: cgroups: add support for oom control
[3]: https://github.com/opencontainers/runc/blob/v1.0.0-rc3/libcontainer/configs/cgroup_unix.go#L113-L114
[4]: https://github.com/opencontainers/runtime-spec/pull/51
Subject: Add Go types for specification
[5]: https://github.com/opencontainers/runtime-spec/pull/137
Subject: Adding cgroups path to the Spec.
[6]: https://github.com/opencontainers/runtime-spec/pull/199
Subject: runtime: config: linux: add cgroups informations
Signed-off-by: W. Trevor King <wking@tremily.us>
The kernel ABI to these values is a string, which accepts the value `-1`
to mean "unlimited" or an integer up to 2^63 for an amount of memory in
bytes.
While the internal representation in the kernel is unsigned, this is not
exposed in any ABI directly. Because of the user-kernel memory split, values
over 2^63 are not really useful; indeed that much memory is not supported,
as physical memory is limited to 52 bits in the forthcoming switch to five
level page tables. So it is much more natural to support the value `-1` for
unlimited, especially as the actual number needed to represent the maximum
has varied in different kernel versions, and across 32 and 64 bit architectures,
so determining the value to use is not possible, so it is necessary to write
the string `-1` to the cgroup files.
See also discussion in
- https://github.com/opencontainers/runc/pull/1494
- https://github.com/opencontainers/runc/pull/1492
- https://github.com/opencontainers/runc/pull/1375
- https://github.com/opencontainers/runc/issues/1421
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
These are long enough without the prefix, and
linux.resources.blockIO.blkioWeight, etc. are just as specific as
linux.resources.blockIO.weight.
Generated with:
$ sed -i s/blkioWeight/weight/g $(git grep -l blkioWeight)
$ sed -i s/blkioLeaf/leaf/g $(git grep -l blkioLeaf)
$ sed -i s/blkioThrottle/throttle/g $(git grep -l blkioThrottle)
Signed-off-by: W. Trevor King <wking@tremily.us>
The only discussion related to this is in [1,2], where the
relationship between oomScoreAdj and disableOOMKiller is raised. But
since 429f936 (Adding cgroups path to the Spec, 2015-09-02, #137)
resources has been tied to cgroups, and oomScoreAdj is not about
cgroups. For example, we currently have (in config-linux.md):
You can configure a container's cgroups via the resources field of
the Linux configuration.
I suggested we move the property from linux.resources.oomScoreAdj to
linux.oomScoreAdj so config authors and runtimes don't have to worry
about what cgroupsPath means if the only entry in resources is
oomScoreAdj. Michael responded with [4]:
If anything it should probably go on the process
So that's what this commit does.
I've gone with the four-space indents here to keep Pandoc happy (see
7795661 (runtime.md: Fix sub-bullet indentation, 2016-06-08, #495),
but have left the existing entries in this list unchanged to reduce
churn.
[1]: https://github.com/opencontainers/runtime-spec/pull/236
[2]: https://github.com/opencontainers/runtime-spec/pull/292
[3]: https://github.com/opencontainers/runtime-spec/pull/137
[4]: https://github.com/opencontainers/runtime-spec/issues/782#issuecomment-299990075
Signed-off-by: W. Trevor King <wking@tremily.us>
Before this commit, linux.seccomp.sycalls was required, but we didn't
require an entry in the array. That means '"syscalls": []' would be
technically valid, and I'm pretty sure that's not what we want.
If it makes sense to have a seccomp property that does not need
syscalls entries, then syscalls should be optional (which is what this
commit is doing).
If it does not makes sense to have an empty/unset syscalls then it
should be required and have a minimum length of one.
Before 652323c (improve seccomp format to be more expressive,
2017-01-13, #657), syscalls was omitempty (and therefore more
optional-feeling, although there was no real Markdown spec for seccomp
before 3ca5c6c, config-linux.md: fix seccomp, 2017-03-02, #706, so
it's hard to know). This commit has gone with OPTIONAL, because a
seccomp config which only sets defaultAction seems potentially valid.
The SCMP_ACT_KILL example is prompted by:
On Tue, Apr 25, 2017 at 01:32:26PM -0700, David Lyle wrote [1]:
> Technically, OPTIONAL is the right value, but unless you specify the
> default action for seccomp to be SCMP_ACT_ALLOW the result will be
> an error at run time.
>
> I would suggest an additional clarification to this fact in
> config-linux.md would be very helpful if marking syscall as
> OPTIONAL.
I've phrased the example more conservatively, because I'm not sure
that SCMP_ACT_ALLOW is the only possible value to avoid an error. For
example, perhaps a SCMP_ACT_TRACE default with an empty syscalls array
would not die on the first syscall. The point of the example is to
remind config authors that without a useful syscalls array, the
default value is very important ;).
Also add the previously-missing 'required' property to the seccomp
JSON Schema entry.
[1]: https://github.com/opencontainers/runtime-spec/pull/768#issuecomment-297156102
Signed-off-by: W. Trevor King <wking@tremily.us>
Maintainers feel (and I agree) that there's no point in explicitly
allowing a null value when callers can simply leave the property unset
[1]. This commit removes all references to "pointer" and "null" from
the JSON Schema to support that decision. While optional properties
may sometimes be represented as pointer types in Go [2], optional
properties should be represented in JSON Schema by not including the
properties in the 'required' array.
[1]: https://github.com/opencontainers/runtime-spec/pull/555#issuecomment-272020515
[2]: style.md "Optional settings should not have pointer Go types"
Signed-off-by: W. Trevor King <wking@tremily.us>