Virtualize /proc/pressure/{cpu, io, memory} by doing a simple
passthrough of write()/poll() syscalls to the underlying cgroup's
/sys/fs/cgroup/x/y/z/{cpu, io, memory}.pressure file.
Implementation is a bit tricky because FUSE notifications must
be issued asynchroniously and we have to use a separate thread for this.
My main concern here was to ensure that no thread leaks are possible,
cause it can be a potenial DoS for host.
If PSITRIGGERTEST macro is defined, then instead of poll-ing on a real
fd we do a simple nanosleep() with 1 second delay. This needed to enable
CI testing of this feature.
For a "real-world" testing, I was using an example program from [1],
but with cpu counter instead of memory. To make cpu pressure I use
"sysbench --threads="$(nproc)" cpu run" command.
Link: https://www.kernel.org/doc/Documentation/accounting/psi.rst [1]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Define SIG_NOTIFY_POLL_WAKEUP as SIGRTMIN + 0 and install
noop signal handler. This signal will be used to manage
notification threads.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
This change is safe, cause I'm adding a union and we have enough
space for void *.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
The following new metrics are now available in /proc/meminfo:
- Zswap: the total amount of memory consumed by the zswap compression
backend
- Zswapped: amount of application memory swapped out to zswap
Signed-off-by: Kadin Sayani <kadin.sayani@canonical.com>
If the loadavg thread falls behind schedule for any reason, the calculation can
overflow, resulting in an unintended sleep duration of approximately 70 munutes.
To prevent this, the logic has been updated to skip the sleep in cases where the
calculation would overflow.
Signed-off-by: Deyan Doychev <deyan@siteground.com>
Kernel support psi(pressure stall information) since 4.20
with procfs /proc/pressure/{io,cpu,memory} and
cgroupv2 {io.pressure, cpu.pressure, memory.pressure}.
This patch add read-only psi procfs,
and people can get pressure information now.
Full functional feature for monitoring are still under investigation.
Signed-off-by: Feng Sun <loyou85@gmail.com>
When LXCFS daemon runs in a root cgroup of cgroup2 tree,
we need to go down the tree when checking for memory.swap.current.
We already have some logic to go up the tree from LXCFS daemon's
cgroup, but it's useless when LXCFS daemon sits in the cgroup2 tree root.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Adds a --runtime-dir cli flag which overrides the /run dir in the lxcfslib.
This ended up being kind of tricky because of how lxcfslib can be reloaded and
its use of a library constructor.
In order read the cli flag and then set a variable in the library, I removed
the contstructor and made init happen as part of the fuse load/reload.
I also added the runtime field to the lxcfs_opts struct and upped its version
for backwards compatibility.
Signed-off-by: Sebastien Dabdoub <sebastien.dabdoub@gmail.com>
- permission mask change 0755 -> 0700
- prevent potential NULL-pointer dereference in lxcfs_fuse_init()
- commit message edits
- one commit was squashed
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
What was RUNTIME_PATH is now named DEFAULT_RUNTIME_PATH so this should not change current behavior.
It is done in preparation for adding a flag to override the runtime path on startup.
Signed-off-by: Sebastien Dabdoub <sebastien.dabdoub@gmail.com>
- permission mask change 0755 -> 0700
- commit message edits
- use tabs everywhere instead of spaces
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
096972f7 and fc8f593b introduces task personalities retrieval to fix
incorrect /proc files info in some cases.
Linux governs access to personalities based on system ptrace policy,
which may be restricted by an LSM (e.g. Yama).
This patch implements a simple check for init's personality access to
make sure ptrace usage is allowed, and prevent access from containers to
proc files with "Permission denied" error if not.
> closes #636 (follow-up to #553 and #609).
Signed-off-by: Samuel FORESTIER <samuel+dev@forestier.app>
Since memory.swap.max = 0 is valid under v2, limits of 0 must not be
treated differently. Instead, use UINT64_MAX as the default limit. This aligns
with cgroups v1 behaviour anyway since 'limit_in_bytes' files contain a large
number for unspecified limits (2^63).
Resolves: #534
Signed-off-by: Alex Hudspith <alex@hudspith.io>
On cgroups v2, there are no swap current/max files at the cgroup root, so
can_use_swap must look lower in the hierarchy to determine if swap accounting
is enabled. To also account for memory accounting being turned off at some
level, walk the hierarchy upwards from lxcfs' own cgroup.
Signed-off-by: Alex Hudspith <alex@hudspith.io>
[ added check cgroup pointer is not NULL in lxcfs_init() ]
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
During our private discussion, Stéphane proposed
to add a new option --enable-cgroup to explicitly
enable old cgroup emulation code
It's worth mentioning that cgroup code in LXCFS
is not widely used, because it was written before
cgroup namespace era and not actual these days.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
It's just dangerous to allow passthrough of write()
syscall anywhere under emulated sysfs subtree.
Let's forbid it.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
The "total_cache" from memory.stat of cgroup includes
the memory used by tmpfs files ("total_shmem"). Considering
it as available memory is wrong because files created
on a tmpfs file system cannot be simply reclaimed.
So the available memory is calculated with the sum of:
* Memory the kernel knows is free
* Memory that contained in the kernel active file LRU,
that can be reclaimed if necessary
* Memory that is contained in the kernel non-active file
LRU, that can be reclaimed if necessary
Signed-off-by: Kyeong Yoo <kyeong.yoo@alliedtelesis.co.nz>
struct cg_proc_stat *cur;
...
lxcfs_debug("Removing stat node for %s\n", cur);
should be:
lxcfs_debug("Removing stat node for %s\n", cur->cg);
Only reproducible when DEBUG macro is defined.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Cleanup start_loadavg code:
- add a new external symbol load_daemon_v2 with the pthread_create-like signature
- make hacky casts of pthread_t to int (and reverse) unnecessary for new API users
Related to: #610
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>