This enables these options in this case and also ensures we set[ug]id
to the destination ids early in entering the namespace because
otherwise creating files during sandbox setup fails if the real user
id isn't mapped in the destination user namespace (and to make us
actually be that user/group).
This allows a sandbox to share a pid namespace with another sandbox.
For this to work the namespace passed in must be owned by the user
namespace that bwrap is using, which implies either that you pass in
--userns pointing, or run under that user namespace already. In the
former case you'd typically take the userns from a running bwrap
--unshare-user instance, whereas the second case happens when using
bwrap in the setuid mode without user namespaces.
If both --unshare-pid and --pidns are specified then we first
switch to the pid namespace, and then unshare from there. This is
useful if you want a pid-isolated sandob that is visible to another
sandbox.
The implementation is a bit tricky, as it needs to fork() in order
to activate the setns():ed pid namespaces, which means we have to
pass through the final pid via a socket to make the kernel translate
the pid to the initial pid namespace for us to waitpid() on it.
This allows you to reuse an existing user namespace to set up all the
other namespaces, entering that instead of creating a new one. The
reason you want to do this is that you can then also reuse other
namespaces that are owned by the user namespace. Typically you use
this to partially re-enter a previoulsy created bubblewrap sandbox.
This also adds --userns2 which is similar to --userns, but this is
switched into at the end instead of the start. Bubblewrap sometimes
creates nested such user namespaces[1], and to be able to reuse such a
setup we need to similarly reuse both namespaces via --userns2.
Technically using setns() is probably safe even in the privileged
case, because we got passed in a file descriptor to the namespace, and
that can only be gotten if you have ptrace permissions against the
target, and then you could do whatever to the namespace
anyway. However, for practical reasons this isn't useable for bwrap,
because (as described in a comment in acquire_privs()) setuid mode
causes root to own the namespaces that it creates. So as you will not
be able to access these namespaces for reuse anyway, its best to
disable it (in case of unexpected security issues).
[1] This is to work around an issue with mounting devpts without uid 0
mapped in the user namespace, where the outer namespace owns all the
other namespaces but the inner one has the right mappings.
Now that we're properly getting rid of root in these we can mark it
dumpable, which enables use of some /proc files, like /proc/$pid/root that
was previously not accessible for pid1 in the sandbox.
It turns out we have this check in drop_privs():
if (getuid () == 0 && setuid (opt_sandbox_uid) < 0)
Which is supposed to drop back to the regular uid in the case
we're in setuid mode and we're in the monitor_child() or do_init()
processes.
Unfortunately we're setuid, not plain root, so uid is not 0, but euid is zero.
This caused the monitoring processes to be running partially as root
which shows up weird in /proc.
Fix this by checking euid for 0 instead.
When writing the info-fd or the json-status-fd and a new pid name-
space is created, write the namespace id to the info/json. The
namespace id is encoded in the inode (and device) number of the
symbolic link in /proc/<pid>/ns/<namespace>, see namespaces(7)
for more information.
We obtain the information before we drop privileges, which is needed
when bwrap is run as setuid because then the namespace information
is not readable after having dropped the privileges.
Additionally we no retain the CAP_SYS_PTRACE capability because
that is needed to de-reference the links in /proc/<pid>/ns.
The user namespace information is omitted, because at the time when
we obtain the namespace id information, the user namespace is not yet
the new one, i.e. unshare (CLONE_NEWUSER), comes later.
Closes: #323
Approved by: alexlarsson
When optional cgroup unsharing was requested and the test for it
succeeds, set opt_unshare_cgroup to TRUE so it can be used later
to inspect if cgroup unsharing is enabled.
Closes: #323
Approved by: alexlarsson
An attacker could pre-create /tmp/.bubblewrap-$UID and make it a
non-directory, non-symlink (in which case mounting our tmpfs would fail,
causing denial of service), or make it a symlink under their control
(potentially allowing bad things if the protected_symlinks sysctl is
not enabled).
Instead, temporarily mount the tmpfs on a directory that we are sure
exists and is not attacker-controlled. /tmp (the directory itself, not
a subdirectory) will do.
Fixes: #304
Bug-Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=923557
Signed-off-by: Simon McVittie <smcv@debian.org>
Closes: #305
Approved by: cgwalters
The exit code is only reported if it exited after a successful exec.
This is accomplished with a pipe, where the write end is closed on exec.
To distinguish between pipe-close pre-exec and at-exec,
data is written to the pipe immediately before calling exec
so if it is closed before exec the pipe is empty
and if it is closed during exec it contains a 1 byte value.
To further distinguish between a successful exec and a failed exec,
on exec failure a second value is written.
Signed-off-by: Richard Maw <richard.maw@codethink.co.uk>
Closes: #257
Closes: #293
Approved by: cgwalters
These ignore source files not existing which allows bwrap using
applications to avoid repeatedly checking if files exist.
Closes: #283
Approved by: alexlarsson
Some variables like base_path ("/run/user/%d/.bubblewrap") are
declared with the cleanup attribute in main(), but this cleanup is not
run when in the parent process, since it calls exit() in monitor_child().
Use return statements instead of exit() so that cleanup attributes
will be run.
Closes: #271
Approved by: smcv
Keep a reference to the previous working directory and use it for the
umount.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Closes: #256
Approved by: cgwalters
Skip these mounts when the process will keep CAP_SYS_ADMIN as it will
anyway able to umount them.
This fix the case of running bwrap inside of a bwrap with a new pid
namespace and mount /proc.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Closes: #256
Approved by: cgwalters
This is preparatory work for supporting recursive bwrap. Without this, using
`mount()` on the second `/` won't work, since it won't be a mount point.
Closes: #256
Approved by: cgwalters
This shouldn't matter unless someone wants to run an inadvisably-named
executable, but it's best-practice for commands that pass on some
of their arguments to a subsequent command.
It allows an invocation like:
bwrap --ro-bind /container / -- "$@"
to search PATH in the container for an executable named according to
"$1", even if $1 has a pathological value like
"--this-has-a-stupid-name--", or even a value that might be
deliberately trying to break bwrap's parsing like "--bind".
Fixes: #259
Signed-off-by: Simon McVittie <smcv@collabora.com>
Closes: #261
Approved by: cgwalters
For NFS mounts if we call mkdir() on a read-only mount (such as when
we've created a read-only bind mount) the kernel will nor return EEXIST
even when the directory exists, instead returning EROFS.
So, we add (and use) an ensure_dir() helper that stats before calling
mkdir.
Closes: #258
Approved by: giuseppe
Per open(2) man page:
When O_PATH is specified in flags, flag bits other than O_CLOEXEC
, O_DIRECTORY, and O_NOFOLLOW are ignored.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Closes: #244
Approved by: cgwalters
The options --file, --bind-data and --ro-bind-data take a file
descriptor as first argument. This FD is then used to set up the new
root file system. These FDs are then closed after this step. However,
when the privileged mode is used, this step is run in a child process,
hence leaking the FDs in the parent process.
To avoid future omissions, this patch walk through all the file
descriptors tied to an option and try to close them. To avoid a double
close, it marks FDs closed by setup_newroot as such.
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Closes: #243
Approved by: cgwalters
This means we can compile on Debian 8 'jessie', currently the
"oldstable" distribution. It's consistent with what would happen
if we knew PR_CAP_AMBIENT at compile-time but the kernel didn't support
it at runtime.
Signed-off-by: Simon McVittie <smcv@collabora.com>
Closes: #242
Approved by: cgwalters
The first attempt caused a use-after-free because the arguments parsed
from --args are passed to parse_args_recurse(), and the other cases
there may take those pointers (without copying) into SetupOp structures,
which persist after data is freed.
Fix that by treating data more like the argv to main(): an allocation
which exists throughout the life of the program. Do that by hoisting its
declaration out as a global, and then pulling the allocated data into a
cleanup_free variable in main(), to tie its lifecycle to main().
The alternative is to strdup() each one of the argv elements when they
are used in parse_args_recurse(), but that would mean a lot more
allocations and frees, and a lot of code churn.
Signed-off-by: Philip Withnall <withnall@endlessm.com>
https://github.com/projectatomic/bubblewrap/issues/224
Closes: #237
Approved by: smcv
I was looking at what fds flatpak injects, and realized this is actually just a
bubblewrap bug; there's no reason for us to leak this to the child, so don't.
It took me a while to work out/remember that our `close_extra_fds()` bits are
only intended to handle processes *other* than the final target, i.e. we want to
close fds in our init process. For the final child process, we need to support
passing arbitrary fds though, so `close_extra_fds()` can't apply to the child.
Closes: #221
Approved by: smcv
This reverts the actual leak fix from commit
4b8fa95704, which seems to make --args
fail to work.
Signed-off-by: Simon McVittie <smcv@collabora.com>
Closes: #230
Approved by: cgwalters
As specified by mount(2):
Specifying MS_MGC_VAL was required in kernel versions prior to 2.4, but
since Linux 2.4 is no longer required and is ignored if specified.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Closes: #220
Approved by: cgwalters
`systemd-nspawn` and `docker` at least both have these by default;
the only difference AFAICS now is that nspawn also adds `/dev/mqueue`
by default, but we require a separate arg for that.
This should increase compatibility with apps using the `/dev/fd`.
Closes: https://github.com/projectatomic/bubblewrap/issues/191
Closes: #207
Approved by: alexlarsson
These make explicit various implicit assumptions about how the SetupOps
are constructed. This fixes various Coverity issues about potential NULL
pointer dereferences.
Signed-off-by: Philip Withnall <withnall@endlessm.com>
Closes: #209
Approved by: cgwalters
This is pretty unnecessary, since they will automatically be closed by
the kernel when bubblewrap’s PID 1 exits, but cleaning them up shuts up
Coverity.
Signed-off-by: Philip Withnall <withnall@endlessm.com>
Closes: #209
Approved by: cgwalters
In <https://github.com/projectatomic/bubblewrap/pull/101>, specifically
commit cde7fab7ec we started dropping
all capabilities, even if the caller was privileged.
This broke rpm-ostree, which runs RPM scripts using bwrap, and some
of those scripts depend on capabilities (mostly `CAP_DAC_OVERRIDE`).
Fix this by retaining capabilities by default if the caller's uid is zero.
I considered having the logic be to simply retain any capabilities the invoking
process has (imagine filecaps binaries like `ping` or
`/usr/bin/gnome-keyring-daemon` using bwrap) but we currently explicitly abort
in that scenario to catch broken packages which used file capabilites for bwrap
itself (we switched to suid). For now this works, and if down the line there's a
real-world use case for capability-bearing non-zero-uid processes to invoke
bwrap *and* retain those privileges, we can revisit.
Another twist here is that we need to do some gymnastics to first avoid calling
`capset()` if we don't need to, as that can fail due to systemd installing a
seccomp filter that denies it (for dubious reasons). Then we also need to ignore
`EPERM` when dropping caps in the init process. (I considered unilaterally
handling `EPERM`, but it seems nicer to avoid calling `capset()` unless we need to)
Closes: https://github.com/projectatomic/bubblewrap/issues/197
Closes: #205
Approved by: alexlarsson
When --unshare-user is used in the unprivileged case, all caps are
left to the sandboxed application. Change it to leave only the
specified ones.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Closes: #101
Approved by: alexlarsson
When using namespaces, permit to leave some capabilities in the
sandbox. This can be helpful to run a system instance of systemd.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Closes: #101
Approved by: alexlarsson
This apparently only is emitted with `CFLAGS='-O2 -Wp,-D_FORTIFY_SOURCE=2'`,
which is used by RPM builds in Fedora at least. Enable that by default.
Note that I fixed it by using `TEMP_FAILURE_RETRY`, which should also ensure
that we don't spuriously continue if the `read` was interrupted by `EINTR` for
some reason.
Closes: #190
Approved by: alexlarsson
In scenarios such as running bwrap in test frameworks (`bwrap make check`),
one wants all of the processes to go away if the parent process
dies, or if the bwrap process is directly killed.
This ensures that in all cases (both with `--unshare-pid` and without), we use
`prctl(PR_SET_PDEATHSIG)` on both our outer and inner init procesesses if
`--die-with-parent` is specified.
Tests-by: Colin Walters <walters@verbum.org>
Closes: #165
Approved by: emdej
Bwrap fails to start on host missing the magic sysrq-trigger file,
this change skips the read only bind mount when the file is absent.
Closes: #179
Approved by: cgwalters