Not so long ago I stumbled upon some internet conversation about a topic which I
found fairly interesting. It was something like this: does running applications
within containers help security at all? And does it help if I do run the container
as an unprivileged user vs root
?
As it is always the case, the amount of comments/answers to that question was diverse and somewhat contradictory, depending on the responder. Some people claimed that yes, it does, and that you should think of containers as a sort of VM, some people mentioned that containers have not been designed as a security tool, and therefore running in a container does not have any benefit vs running outside.
Were this a different time, I would have spent a few hours of my time arguing
discussing with said internet users, but I decided instead to collect my thoughts
and provide an answer to that question in a more organised way. Bonus points for
the fact that by writing on this blog, the information will not be attached to
some fleeting Social Media post, and will hopefully remain available for a longer
time.
Back to the Basics - What is a container?
Ok, I partially lied with the title, there are many good articles on the web about containers, so I won't duplicate that information once again as it is one kagi search away, but I will summarize nevertheless the fundamental building blocks that I will later use to expand on the security benefits.
Containers, to some extent, can be seen as the combination of a few features:
- Namespaces
- Cgroups
- Seccomp
- Capabilities
In fact, the general idea behind containers is to package an application together with its own dependencies/runtime, and provide an isolated view of the system when running it, so that there are no conflicts between libraries, dependencies, etc.
Aha! So it is not a security feature!
While providing a "separate" view of the system, however, we also get some side benefits that improve the security of the system. Let's go in order though.
Namespaces
There are a bunch of namespaces in Linux. A namespace is a feature that allows
processes to see a set of resources as the only resources in the
system. There are mount
namespaces, PID
namespaces, UTS
namespaces, etc.
What this means concretely is that we can add a bunch of processes to a specific mount
namespace, which means those processes will have a perspective of the
filesystem which is completely different from the "real" perspective outside of it.
Let's make an example to see namespaces in practice.
In one shell we can run the following:
$ docker run alpine sleep 10000
This will create a sleep process that will give us the opportunity to explore the system from the perspective of processes inside the container.
We can then track down the PID of the sleep process and use nsenter
to enter
the "container" namespace:
$ ps aux | grep sleep
root 10689 0.2 0.0 1608 4 ? Ss 21:03 0:00 sleep 10000
$ sudo nsenter -m -u -i -n -p -t 10689 /bin/sh
/ #
The nsenter
command means essentially the following: enter in the namespace for
the t
arget PID 10689 and run /bin/sh
, doing so for the m
ount namespace,
the u
TS namespace (hostname) the i
PC namespace, the n
etwork namespace
and the p
ID namespace.
Essentially I am entering in all the namespaces for the container which are used
by default by Docker. Note that there is one notable namespace missing: the user
namespace. This is because in my local machine, where I am doing these tests, I
don't have docker
configured to use user
namespaces, which is a non-default
feature which requires explicit
configuration.
Exploring Namespaces
Now that we have a shell inside the container namespaces, let's have a look around. First, we can check some simple stuff:
/ # cat /etc/hostname
59f18340216d
Obviously the hostname
is the ID of the container, not the one of the host
system, right? This is thanks to the UTS namespace (flag -u
of my nsenter
command).
In fact, if I would have not included -u
in the nsenter
command, the
output of the hostname
command would be the hostname of my host machine (try it!).
Let's have a look at the network:
/ # ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
12: eth0@if13: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
link/ether 02:42:ac:11:00:04 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.4/16 brd 172.17.255.255 scope global eth0
valid_lft forever preferred_lft forever
Inside the container I have just a loopback interface and some virtual interface
which is actually managed by docker
(notice the 172.17.0.0/16
address).
Obviously in the case of the network, docker
does a lot of other things besides
creating a separate namespace, that's why I have by default internet access,
for example.
Similar considerations can be made looking around the filesystem:
/ # cat /etc/passwd
root:x:0:0:root:/root:/bin/ash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
[...]
vpopmail:x:89:89::/var/vpopmail:/sbin/nologin
ntp:x:123:123:NTP:/var/empty:/sbin/nologin
smmsp:x:209:209:smmsp:/var/spool/mqueue:/sbin/nologin
guest:x:405:100:guest:/dev/null:/sbin/nologin
nobody:x:65534:65534:nobody:/:/sbin/nologin
This file is not my system's /etc/passwd
, it is a completely separate one.
Also, I can't see any of the host mounts if I run mount
.
Namespaces Wrapping Up
How to check which namespaces the container runs in? The simplest way (I know) is to look
at the namespace id inside the /proc/PID/ns/*
path.
For example, we can compare the namespaces of the process running inside our container,
with those of our init
process on the host system:
#for f in $(ls /proc/10689/ns); do readlink "/proc/10689/ns/$f"; done
cgroup:[4026531835]
ipc:[4026535492]
mnt:[4026535490]
net:[4026535495]
pid:[4026535493]
time:[4026531834]
user:[4026531837]
uts:[4026535491]
#for f in $(ls /proc/1/ns); do readlink "/proc/1/ns/$f"; done
cgroup:[4026531835]
ipc:[4026531839]
mnt:[4026531840]
net:[4026532000]
pid:[4026531836]
time:[4026531834]
user:[4026531837]
uts:[4026531838]
We can see that only the cgroup
, time
and user
namespaces are the same.
Cgroups
The second feature that we need to discuss, even if much more briefly compared to namespaces,
are cgroups
. There is a lot of complexity around cgroups
, there are two different
versions, many different options etc., but on a fundamental level, cgroups allow
to restrict certain resources for a group of processes.
In concrete terms, cgroups
is what docker
uses when we run a container using
the --memory
, --cpus
or --pids-limit
flags.
We can make a very simple example using the pids
controller.
Let's spawn a single shell in a window and check its own PID:
$ echo $$
687
Let's now create a new cgroup
(in another window):
# cd /sys/fs/cgroup
# mkdir test
# echo 687 > test/cgroup.procs
At this point the PID 687
is part of the test
cgroup.
Let's check the number of PID it currently has:
# cat pids.current
1
To see this number changing, in the shell we spawned, we can run twice in a row sh
.
Checking again we get:
# cat pids.current
3
Now we can set a limit to, for example, 5
PIDs.
# echo 5 > pids.max
So what happens if we keep spanwing shells?
sh: 1: Cannot fork
At some point we notice some error, because we exhausted the PIDs in this cgroup
.
Seccomp
Seccomp is another very complex topic, especially because under the hood it uses BPF
filters.
However, for the purpose of this post, I will simplify it down to the core. The Kernel exposes
a number of functions, which we call syscalls
. Some of these syscalls are dangerous
or just risky, and there is no need for all applications to potentially have access to all
of them. Seccomp allows to limit the Syscalls available to programs.
In the context of containers, it is possible to create Seccomp profiles which essentially
are JSON specs that define which syscalls we want to allow or deny for a given container.
In fact, by default docker
already uses a
default profile!
Let's now create the simplest possible profile, and try to forbid the chmod
syscall:
{
"defaultAction": "SCMP_ACT_ALLOW",
"architectures": [
"SCMP_ARCH_X86_64"
],
"syscalls": [
{
"name": "chmod",
"action": "SCMP_ACT_ERRNO"
}
]
}
Basically this profile says: "allow everything (default), but throw an error in case the chmod syscall is invoked".
Let's call this profile profile.json
and let's run a container using it:
$ docker run --security-opt seccomp=profile.json -it alpine sh
/ #
You can verify that even though we are the root
user, we can't chmod
anything:
/ # cd tmp/
/tmp # touch test
/tmp # chmod 777 test
chmod: test: Operation not permitted
/tmp # id
uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)
Capabilities
The last feature I want to discuss are Linux capabilities. These have absolutely nothing to do specifically with containers, but containers have decent tooling around them and make it easier to use them.
Capabilities in short are slices of the root
user's power. There are quite a lot of them,
which you can see by running man 7 capabilities
, and some of them are particularly risky or
interesting, depending on your point of view.
Anyway, why talking about capabilities in the context of containers? Well, because as I said
containers have good tooling around capabilities, in particular they support a simple way to
drop all capabilities inside containers, or to allow a few of them only, in a granular way.
Capabilities are also the feature that allows us to do certain privileged actions within
containers, without running containers as full root
user.
For example, let's run our container dropping all capabilities:
$ docker run --cap-drop=ALL -it alpine sh
Note: dropping all capabilities should be the default for any single container which does not require_ them. Abusing capabilities (i.e., privileges) is one of the simplest way to escape containers and compromise the host system! If you run in Kubernetes, dropping capabilities can be done via the
securityContext
spec.
To try some privileged action, we can install tcpdump
:
# apk add tcpdump
If now we try to sniff an interface, even if we are root, we get:
/ # tcpdump -i eth0
tcpdump: eth0: You don't have permission to capture on that device
(socket: Operation not permitted)
This is because tcpdump
uses RAW sockets to listen to the network, and RAW sockets
require the CAP_NET_RAW
capability. Let's try to add this capability only to the
container.
# docker run --cap-drop=ALL --cap-add=CAP_NET_RAW -it alpine sh
# apk add tcpdump
# tcpdump -i eth0
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
20:09:25.754379 IP 151.101.66.132.443 > 84dd5599ce4e.53112: Flags [FP.], seq 2181022380:2181022404, ack 1351641877, win 294, options [nop,nop,TS val 3705369401 ecr 3549003202], length 24
Everything worked as expected.
There is a whole lot to discuss about capabilities, and specifically about how they are tied to processes, how they are inherited etc., but that I will probably keep it for a book I am writing (maybe, one day...).
If we want to know with which capabilities a process is running, we can use:
/ # cat /proc/$$/status | grep -i cap
CapInh: 0000000000002000
CapPrm: 0000000000002000
CapEff: 0000000000002000
CapBnd: 0000000000002000
CapAmb: 0000000000000000
We can decode that obscure number with capsh
:
capsh --decode=0000000000002000
0x0000000000002000=cap_net_raw
Will we Talk About Security?
So far we have only talked about separate features, one by one, without really putting the puzzle together to answer the very question we opened this post with: do containers help security? Does it help if I run containers as a low-privileged user?
Let's try to answer now this question and motivating the answer. The answer is yes, with a lot of different caveats.
Containers can make it much simpler to run programs (untrusted ones) in a much more isolated environment, with a much smaller impact in case of compromise, compared to standard native processes. They can, they don't necessarily do. Containers are ultimately just a collection of Linux features, they are not really "anything" else. Virtual machines have an hypervisor which helps creating a concrete boundary between the host and the guest, where the guest runs its own kernel. Containers, on the other hand, are nothing more than processes constrained by some features, which run in the same OS.
It is completely possible to run a container as insecurely as a process outside of it.
For example, run an image with the root
user, with the --privileged
flag, with all
capabilities added, with no seccomp
profile and maybe mounting /
to /host
inside,
and you will have exactly a root
process on the host, with the same level of impact in
case it gets compromised.
Even worse, you can have all this and run a completely untrusted artifact that you
pulled from who-knows-where in the internet with docker pull
or by building a
Dockerfile
. I would argue that this is worse compared to running a binary that
at least was installed from the default package manager of your distro.
However, if I have to run a "random" software, I am extremely confident to say that it
is an order of mangnitude easier to run said software securely in a container than it is
to run it natively, for example as a systemd
unit.
Before making a final, concrete, example I want to discuss why each of the features discussed until now represents, to some extent, a security feature:
- Namespaces: modifying, and specifically restricting, the view that a process has
of the system has huge security advantages. If I cannot see a mount, I cannot
access files inside it. If I can't see an interface, I can't sniff it or modify it.
Even more, if I am root but my user outside the system is actually a low-privileged
user, I can do limited damages to the system. Namespaces allow to greatly reduce the
attack surface available to compromised processes, which means both the probability
and the blast radius of a compromise is smaller. A side-effect of namespaces is also
that the tools available are generally only those inside the container
filesystem. If the container uses a minimal image, it might not even have a shell,
ls
,echo
or any binary that can be used to enumerate or exploit the system. - Cgroups: from the security point of view, cgroups prevent resource exhaustion, so they do provide a certain resilience against DoS attacks, such as a -admittedly old school- fork bombs. If PIDs are limited, a process won't be able to exhaust them for all the system. Similarly, CPU and memory limits can avoid a bad process to crash the whole system or to starve other crucial applications.
- Seccomp: this one is fairly obvious, as limiting the syscalls available to a process can potentially prevent certain dangerous actions against the system, such as an exploit that (ab)uses a particular feature. The truth is that it is hard to build seccomp profiles for applications, though, especially since it is not possible to accurately and completely get the list of called syscalls for a program, in all codepaths. Usually applications are traced/profiled, but this is just a statistical methodology.
- Capabilities: the
root
user, especially without using user namespaces, can do a lot, even within a container, because it is talking with the same kernel of the host. Being able to limit the privileges available greatly reduces the agency of an attacker. Some container-escape techniques rely specifically on certainroot
superpowers, therefore this gives a simple answer to the second question: does it help if I run a container as a low-privileged user? Yes.
I will repeat it once again: it is possible to use all of these features also running a regular process in the host. Containers are just a very handy way to do all of this at once, with a couple of flags. At least, this is my experience, I still need to meet anybody who - for every systemd unit - creates a bunch of namespaces, drops all capabilities, configures cgroups and applies a seccomp profile. If someone does that, congratulations, you are a master of systemd :).
PoC or GTFO
Following the most dear traditions of the web, let's make a final concrete example.
Let's say we want to run a very secure web application, a web application that has an API endpoint, which takes a command and an argument as a parameter, and executes it. Some sort of devil-ish backup utility controlled via API, for example. The reason why I am choosing a program that calls another external program is not to cheat by using a single binary that works so well in a container.
The code is the following (don't use it anywhere!):
use rocket::serde::{json::Json, Deserialize};
use std::process::Command;
use std::str;
#[macro_use]
extern crate rocket;
#[derive(Deserialize)]
#[serde(crate = "rocket::serde")]
struct Task {
command: String,
args: String,
}
#[post("/backup", data = "<task>")]
fn backup(task: Json<Task>) -> String {
let _child = Command::new(&task.command)
.args([&format!("{}", &task.args)])
.spawn()
.expect("Could not run the command");
format!("OK!")
}
#[launch]
fn rocket() -> _ {
rocket::build().mount("/", routes![backup])
}
To run it locally, the following needs to be included in the Cargo.toml
:
[dependencies]
rocket = { version = "=0.5.0-rc.4", features = ["json"] }
This example is really close to the quick start
in rocket
's documentation.
Essentially what we have is a very simple webserver that listens to a single
endpoint: /backup
, which we can pretend it can be used to "run backups".
The program is intentionally vulnerable, it's basically a RCE-as-a-service, but it simulates a regular RCE vulnerability. Actually, this is also better
than some vulnerabilities, as rust
command does not spawn a shell, so this does not support metacharacters (such as <
, >
, etc.).
We can run any command we want, even if "by-design" it was only supposed to run
restic
, borg
or similar tools:
# Let the program "curl" itself
$ curl -XPOST localhost:8000/backup -d '{"command": "curl", "args": "localhost:8000"}'
# Create a file on the disk
$ curl -XPOST localhost:8000/backup -d '{"command": "touch", "args": "/tmp/test"}'
$ ls /tmp
[...]
test
Now, assume that we want to run this application "securely" on our system, and let's take into consideration the following options:
- Run as a Systemd unit
- Run in a container
Let's consider also the following requirements:
- Permissions. This is a "backup" application, which means it might need access
to the paths to backup. Let's say for the sake of this conversation that we want
at a minimum to backup
/var/log
and/opt/myapp
./opt/myapp
is owned byappuser
while/var/log
is owned byroot:syslog
(common in many distro). - Exposed. The application is exposed publicly, or anyway within the network.
- Untrusted. We are installing a new fancy "backup" service, we don't own the code, we don't know who can commit to it. Basically, it is an untrusted application.
I hope none of these requirements will sound too made-up, because it is fairly common for selfhosters to run applications like this, and it is also fairly common that some application will need access to some part of the filesystem.
Scenario 1: Systemd unit
The first thing that immediately comes to mind, is that we don't want the application
to run as root
. So we will probably have to create a dedicated user on the system
for it, say wtfbackup
. Then we need to figure out the next step, which is how to allow
access to this unprivileged user access to the /var/log
folder, which is owned by root.
Some files might be world-readable, but some other might only be readable by root
and the
members of some groups. For example, in my system:
-rw-r----- 1 syslog adm 171848 Nov 14 00:00 kern.log
So once again we have a few options:
- Add the
wtfbackup
user toadm
group, which will most likely have other permissions "elsewhere". - Add the
DAC_READ_SEARCH
capability to the application binary and torestic
binary. This means that the binary will have read access to everything. Including/etc/shadow
,/home/**/.ssh/id_rsa
, etc.
We have one more (or a bunch more) options as well, which is using several settings
in Systemd, such as ProtectHome=
, ProtectSystem=
, PrivateTmp=
or even
ReadOnlyDirectories=
, ReadWriteDirectories=
etc. These options
and the other similar to them will make parts of the filesystem either read-only or
completely inaccessible. It will not surprise anybody that under the hood, our friendly
namespaces are used (specifically, mount
namespaces).
Let's say that we are super diligent and knowledgeable, and we know all these options for Systemd, so now we have a unit that can only access certain parts of the filesystem.
How do we also limit which binaries and functionalities we can run?
One way is to restrict the capabilities
, using CapabilityBoundingSet
option,
NoNewPrivileges=
, and other options (see doc).
We will need to spend quite some time to limit the access for the application,
especially because there are quite many binaries that are "dangerous".
For example, if an attacker would manage to use sh
, python
, etc., then a reverse shell is
a possibility, but also other attacks such as a fork bomb. So what do we do?
Well, we keep digging in systemd
options, there are options to limit network
connections, there are options to limit resources
and so on.
I don't want to go even in more detail and provide a prototype for the perfect answer, because - to be honest - I don't know the perfect answer myself with systemd. I have not seen a single tutorial, post, page, systemd unit template or anything like that which uses more than a handful of options.
I checked in my own machine, I looked at a few random units such as docker.service
or teamviewer
service. I couldn't find even one of the security options listed above.
This in itself does not prove anything, but I think it is a reasonable argument that if
some features are obscure, probably not many people are using them. In any case,
the point I want to make is another: we are talking about systemd security options, not just "systemd".
Scenario 2: Container
Let's look at the container scenario now.
Before even looking at the details, it is important to note a fundamental difference. In the container case, the software author, who knows the software much better than the user, can configure a lot of security features for the program.
For example, the author knows exactly which external binaries are/should be called, and can provide a container with those (and only those), with proper capabilities already added. The author can also providee a low-privileged user by default for the container, so that not everyone in the world will need to be knowledgeable enough to run their application securely.
This said, we can adopt common Docker best-practices while building the image (which again, it's often the responsibility of the software maintainers, not the user), such as the following:
- Use a multi-staged build, with the last base image being as minimal as we can,
possibly
scratch
ordistroless
. - Add only the necessary capabilities to the binaries
- Run as a low privileged user
Pretty much everything else comes out-of-the-box. Only the filesystem paths which
are explicitly mounted will be available inside the container, for example. We don't
even need to think about protecting our /etc
, because unless it is mounted specifically,
by default it's not accessible to the processes inside the container.
A Dockerfile
for such build would look potentially like the following:
FROM rust:1.73 AS builder
ENV USER=root
WORKDIR /code
RUN apt-get update && apt-get install -y pkg-config restic libcap2-bin
COPY Cargo.toml /code/Cargo.toml
COPY Cargo.lock /code/Cargo.lock
COPY src/ /code/src
RUN cargo build --release
# Set DAC_READ_SEARCH for both restic and wtfbackup
# This is required to read all files to backup, which
# can be owned by arbitrary users
RUN setcap cap_dac_read_search+eip /usr/bin/restic
RUN setcap cap_dac_read_search+eip /code/target/release/wtfbackup
# Set NET_BIND_SERVICE for wtfbackup, as we need it to bind
# to a port.
RUN setcap cap_net_bind_service+eip /code/target/release/wtfbackup
# Note that building without docker buildx will result
# in the capabilities being lost, because COPY does not
# preserve extended file attributes by default
FROM gcr.io/distroless/cc-debian12
USER nonroot
COPY --from=builder /code/target/release/wtfbackup /
COPY --from=builder /usr/bin/restic /
ENTRYPOINT ["/wtfbackup"]
This image can be built with the following command:
docker buildx build -t wtfbackup:v0.1 --output type=docker --platform=linux/amd64 .
Note that in the image restic
is included, assuming it would be one of those tools we
can invoke it.
We can then run the container with the following command:
docker run -p 8000:8000 -v $PWD/Rocket.toml:/Rocket.toml -it --cap-add=CAP_NET_BIND_SERVICE --cap-add=CAP_DAC_READ_SEARCH wtfbackup:v0.1
Note that we have to explicitly add the capabilities we want to run the
container with, or better, that we want to allow within the container at all.
Even if the files have the capability added (we use setcap
), if the container
is launched without specifically permitting them, we will get an error like the
following:
exec user process caused: operation not permitted
So, without any configuration from the user, how many damages we can do? Let's try a few commands:
$ curl -XPOST localhost:8000/backup -d '{"command": "id", "args": ""}'
# Nope
Could not run the command: Os { code: 2, kind: NotFound, message: "No such file or directory" }
$ curl -XPOST localhost:8000/backup -d '{"command": "/restic", "args": "-h"}'
# Prints help from restic
$ curl -XPOST localhost:8000/backup -d '{"command": "/bin/ls", "args": "/"}'
# Nope
$ curl -XPOST localhost:8000/backup -d '{"command": "/bin/sh", "args": ""}'
# Nope
$ curl -XPOST localhost:8000/backup -d '{"command": "cat", "args": "/proc/1/status"}'
# Nope
$ curl -XPOST localhost:8000/backup -d '{"command": "echo", "args": "test > /test"}'
# Nope
And we could continue for a while. We can look inside the image using
dive
to see what's available.
dive wtfserver:v0.1
Looking inside the layers, we can see that in the image there are basically no
binaries. There are some libraries (like libc
), there are timezone files and
that's basically it.
What damage can be done if this application is exposed through the web? Well, at
the moment it would be possible to somehow exhaust resources (or try to),
potentially mess with backups by using restic
, but not much more.
Sure, it is possible that a memory corruption vulnerability is found in restic
or in some rust
library, and it can be exploited in some way. For example,
restic could be used to "restore" from an unstrusted repository (remote) a
shell, and maybe that could be a way to gain access, but even if that was the
case, escaping the container and making some damage to the host is going to be
very complex.
Wrapping Up
If we look critically at what we have been discussing, we can now finally reach a conclusion and provide an answer to the original question. Do containers help security? Yes, they do.
They enable users of applications to use certain features of Linux which reduce the attack surface and reduce the impact if said applications are vulnerable and compromised. In addition, they have the advantage that a lot of the security work can be done by the software writer, rather than by the user.
Are containers the only way to achieve said security? Of course not. Systemd has a wide range of security options which allow users to reach the same level of isolation. Flatpak and similar systems also use namespaces, cgroups, seccomp and capabilities to isolate applications and their dependencies. However, I think that it is pretty obvious that the features used by containers under the hood are security features, even if containers themselves might not have been specifically designed to be a security feature (also, who can say why they were designed?). Containers are just a very convenient way to use those features transparently (and unknowingly).
Obviously it is still important to follow good security practices, run images from trusted sources, expose publicly only what's really needed and so on, but this applies universally.
If you find an error, want to propose a correction, or you simply have any kind of comment and observation, feel free to reach out via email or via Mastodon.