Ironically, if running sudo root the path to test-bwrap may
be unreachable in the user namespace as root does not have
permission, and CAP_DAC_OVERRIDE only works for mapped uids.
Fix by using /proc/self/exe for nested bwrap.
`--bind /proc /proc` and `--unshare-all` results in `/proc` being
for the wrong pid namespace causing `namespace_ids_read` to fail,
either reading the wrong process dying with an error.
For example try: unshare -rpfm --mount-proc make check
There's an effort to migrate Linux filesystems to handle the y2038
problem, which is great. However, recently a kernel change landed
that emits a warning when mounting a filesystem that doesn't
handle it, and this notably shows up even when *remounting* e.g.
for a read-only bind mount:
Using e.g. `rpm-ostree install cowsay` there's a spam of:
```
[ 189.529594] xfs filesystem being remounted at /sysroot supports timestamps until 2038 (0x7fffffff)
```
Now particularly when creating a our bind mounts, let's
ask the kernel to be quiet about it. This is not a major event
worthy of a kernel log.
As pointed out by Stephen Röttger <sroettger@google.com>, in
drop_privs() we only drop root in the setuid case if geteuid() is
0. Typically geteuid() == 0 means we were setuid root and have not yet
switched away from it.
However, it is possible to make the geteuid call fail by passing a
--userns2 namespace which doesn't have 0 mapped (i.e. where geteuid()
will return the owerflow uid instead).
If you do this, the pid 1 process in the sandbox will continue running
as host uid 0, while dropping the dumpable flag, and at this point the
user can ptrace attach the process and have root permissions.
We fix this by not relying on the geteuid() call to know when we need
to drop root uid, but rather keep track of whether we already switched
from it.
In the non-setuid case if we're not running as uid 0 in the final
namespace but we need devpts (e.g. use --dev) we mount the devpts as
uid and then change to the actual numberical uid at the end. This
final unshare(CLONE_NEWPID) will reset tha cap bounding set we
previously cleared.
This change clears the cap bounding set again after the unshare call.
This is not really a security problem because we always set
NO_NEW_PRIVS which is essentially a superset of capability bounds, so
there is no way the container can use the bounding set to gain
caps. However its nice to be consistent and not display setting
which look like potential problems.
Fixes https://github.com/containers/bubblewrap/issues/350
See 6b3dd4f10c for the original change
the drops the cap bounding set in the first location.
Release 0.4.0
- Add support for reusing existing namespaces with --userns and --pidns
- Stores namespace info in status json
- In setuid mode pid 1 is now marked dumpable
- Now builds with musl libc
This enables these options in this case and also ensures we set[ug]id
to the destination ids early in entering the namespace because
otherwise creating files during sandbox setup fails if the real user
id isn't mapped in the destination user namespace (and to make us
actually be that user/group).
This allows a sandbox to share a pid namespace with another sandbox.
For this to work the namespace passed in must be owned by the user
namespace that bwrap is using, which implies either that you pass in
--userns pointing, or run under that user namespace already. In the
former case you'd typically take the userns from a running bwrap
--unshare-user instance, whereas the second case happens when using
bwrap in the setuid mode without user namespaces.
If both --unshare-pid and --pidns are specified then we first
switch to the pid namespace, and then unshare from there. This is
useful if you want a pid-isolated sandob that is visible to another
sandbox.
The implementation is a bit tricky, as it needs to fork() in order
to activate the setns():ed pid namespaces, which means we have to
pass through the final pid via a socket to make the kernel translate
the pid to the initial pid namespace for us to waitpid() on it.
This allows you to reuse an existing user namespace to set up all the
other namespaces, entering that instead of creating a new one. The
reason you want to do this is that you can then also reuse other
namespaces that are owned by the user namespace. Typically you use
this to partially re-enter a previoulsy created bubblewrap sandbox.
This also adds --userns2 which is similar to --userns, but this is
switched into at the end instead of the start. Bubblewrap sometimes
creates nested such user namespaces[1], and to be able to reuse such a
setup we need to similarly reuse both namespaces via --userns2.
Technically using setns() is probably safe even in the privileged
case, because we got passed in a file descriptor to the namespace, and
that can only be gotten if you have ptrace permissions against the
target, and then you could do whatever to the namespace
anyway. However, for practical reasons this isn't useable for bwrap,
because (as described in a comment in acquire_privs()) setuid mode
causes root to own the namespaces that it creates. So as you will not
be able to access these namespaces for reuse anyway, its best to
disable it (in case of unexpected security issues).
[1] This is to work around an issue with mounting devpts without uid 0
mapped in the user namespace, where the outer namespace owns all the
other namespaces but the inner one has the right mappings.
Now that we're properly getting rid of root in these we can mark it
dumpable, which enables use of some /proc files, like /proc/$pid/root that
was previously not accessible for pid1 in the sandbox.
It turns out we have this check in drop_privs():
if (getuid () == 0 && setuid (opt_sandbox_uid) < 0)
Which is supposed to drop back to the regular uid in the case
we're in setuid mode and we're in the monitor_child() or do_init()
processes.
Unfortunately we're setuid, not plain root, so uid is not 0, but euid is zero.
This caused the monitoring processes to be running partially as root
which shows up weird in /proc.
Fix this by checking euid for 0 instead.
Make sure the namespace information that is written to info.json
and json-status.json matches the namespace id inside the sandbox.
Closes: #323
Approved by: alexlarsson