Do Containers Help Security?

Not so long ago I stumbled upon some internet conversation about a topic which I found fairly interesting. It was something like this: does running applications within containers help security at all? And does it help if I do run the container as an unprivileged user vs root?

As it is always the case, the amount of comments/answers to that question was diverse and somewhat contradictory, depending on the responder. Some people claimed that yes, it does, and that you should think of containers as a sort of VM, some people mentioned that containers have not been designed as a security tool, and therefore running in a container does not have any benefit vs running outside.

Were this a different time, I would have spent a few hours of my time ~~arguing~~ discussing with said internet users, but I decided instead to collect my thoughts and provide an answer to that question in a more organised way. Bonus points for the fact that by writing on this blog, the information will not be attached to some fleeting Social Media post, and will hopefully remain available for a longer time.

Back to the Basics - What is a container?

Ok, I partially lied with the title, there are many good articles on the web about containers, so I won't duplicate that information once again as it is one kagi search away, but I will summarize nevertheless the fundamental building blocks that I will later use to expand on the security benefits.

Containers, to some extent, can be seen as the combination of a few features:

Namespaces
Cgroups
Seccomp
Capabilities

In fact, the general idea behind containers is to package an application together with its own dependencies/runtime, and provide an isolated view of the system when running it, so that there are no conflicts between libraries, dependencies, etc.

Aha! So it is not a security feature!

While providing a "separate" view of the system, however, we also get some side benefits that improve the security of the system. Let's go in order though.

Namespaces

There are a bunch of namespaces in Linux. A namespace is a feature that allows processes to see a set of resources as the only resources in the system. There are mount namespaces, PID namespaces, UTS namespaces, etc. What this means concretely is that we can add a bunch of processes to a specific mount namespace, which means those processes will have a perspective of the filesystem which is completely different from the "real" perspective outside of it.

Let's make an example to see namespaces in practice.

In one shell we can run the following:

$ docker run alpine sleep 10000

This will create a sleep process that will give us the opportunity to explore the system from the perspective of processes inside the container.

We can then track down the PID of the sleep process and use nsenter to enter the "container" namespace:

$ ps aux | grep sleep
root       10689  0.2  0.0   1608     4 ?        Ss   21:03   0:00 sleep 10000
$ sudo nsenter -m -u -i -n -p -t 10689 /bin/sh
/ #

The nsenter command means essentially the following: enter in the namespace for the target PID 10689 and run /bin/sh, doing so for the mount namespace, the uTS namespace (hostname) the iPC namespace, the network namespace and the pID namespace.

Essentially I am entering in all the namespaces for the container which are used by default by Docker. Note that there is one notable namespace missing: the user namespace. This is because in my local machine, where I am doing these tests, I don't have docker configured to use user namespaces, which is a non-default feature which requires explicit configuration.

Exploring Namespaces

Now that we have a shell inside the container namespaces, let's have a look around. First, we can check some simple stuff:

/ # cat /etc/hostname
59f18340216d

Obviously the hostname is the ID of the container, not the one of the host system, right? This is thanks to the UTS namespace (flag -u of my nsenter command). In fact, if I would have not included -u in the nsenter command, the output of the hostname command would be the hostname of my host machine (try it!).

Let's have a look at the network:

/ # ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
12: eth0@if13: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
    link/ether 02:42:ac:11:00:04 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.4/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

Inside the container I have just a loopback interface and some virtual interface which is actually managed by docker (notice the 172.17.0.0/16 address). Obviously in the case of the network, docker does a lot of other things besides creating a separate namespace, that's why I have by default internet access, for example.

Similar considerations can be made looking around the filesystem:

/ # cat /etc/passwd
root:x:0:0:root:/root:/bin/ash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
[...]
vpopmail:x:89:89::/var/vpopmail:/sbin/nologin
ntp:x:123:123:NTP:/var/empty:/sbin/nologin
smmsp:x:209:209:smmsp:/var/spool/mqueue:/sbin/nologin
guest:x:405:100:guest:/dev/null:/sbin/nologin
nobody:x:65534:65534:nobody:/:/sbin/nologin

This file is not my system's /etc/passwd, it is a completely separate one. Also, I can't see any of the host mounts if I run mount.

Namespaces Wrapping Up

How to check which namespaces the container runs in? The simplest way (I know) is to look at the namespace id inside the /proc/PID/ns/* path.

For example, we can compare the namespaces of the process running inside our container, with those of our init process on the host system:

#for f in $(ls /proc/10689/ns); do readlink "/proc/10689/ns/$f"; done
cgroup:[4026531835]
ipc:[4026535492]
mnt:[4026535490]
net:[4026535495]
pid:[4026535493]
time:[4026531834]
user:[4026531837]
uts:[4026535491]
#for f in $(ls /proc/1/ns); do readlink "/proc/1/ns/$f"; done
cgroup:[4026531835]
ipc:[4026531839]
mnt:[4026531840]
net:[4026532000]
pid:[4026531836]
time:[4026531834]
user:[4026531837]
uts:[4026531838]

We can see that only the cgroup, time and user namespaces are the same.

Cgroups

The second feature that we need to discuss, even if much more briefly compared to namespaces, are cgroups. There is a lot of complexity around cgroups, there are two different versions, many different options etc., but on a fundamental level, cgroups allow to restrict certain resources for a group of processes.

In concrete terms, cgroups is what docker uses when we run a container using the --memory, --cpus or --pids-limit flags.

We can make a very simple example using the pids controller.

Let's spawn a single shell in a window and check its own PID:

$ echo $$
687

Let's now create a new cgroup (in another window):

# cd /sys/fs/cgroup
# mkdir test
# echo 687 > test/cgroup.procs

At this point the PID 687 is part of the test cgroup.

Let's check the number of PID it currently has:

# cat pids.current
1

To see this number changing, in the shell we spawned, we can run twice in a row sh.

Checking again we get:

# cat pids.current
3

Now we can set a limit to, for example, 5 PIDs.

# echo 5 > pids.max

So what happens if we keep spanwing shells?

sh: 1: Cannot fork

At some point we notice some error, because we exhausted the PIDs in this cgroup.

Seccomp

Seccomp is another very complex topic, especially because under the hood it uses BPF filters. However, for the purpose of this post, I will simplify it down to the core. The Kernel exposes a number of functions, which we call syscalls. Some of these syscalls are dangerous or just risky, and there is no need for all applications to potentially have access to all of them. Seccomp allows to limit the Syscalls available to programs.

In the context of containers, it is possible to create Seccomp profiles which essentially are JSON specs that define which syscalls we want to allow or deny for a given container. In fact, by default docker already uses a default profile!

Let's now create the simplest possible profile, and try to forbid the chmod syscall:

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "architectures": [
    "SCMP_ARCH_X86_64"
  ],
  "syscalls": [
    {
      "name": "chmod",
      "action": "SCMP_ACT_ERRNO"
    }
  ]
}

Basically this profile says: "allow everything (default), but throw an error in case the chmod syscall is invoked".

Let's call this profile profile.json and let's run a container using it:

$ docker run --security-opt seccomp=profile.json -it alpine sh
/ #

You can verify that even though we are the root user, we can't chmod anything:

/ # cd tmp/
/tmp # touch test
/tmp # chmod 777 test
chmod: test: Operation not permitted
/tmp # id
uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)

Capabilities

The last feature I want to discuss are Linux capabilities. These have absolutely nothing to do specifically with containers, but containers have decent tooling around them and make it easier to use them.

Capabilities in short are slices of the root user's power. There are quite a lot of them, which you can see by running man 7 capabilities, and some of them are particularly risky or interesting, depending on your point of view.

Anyway, why talking about capabilities in the context of containers? Well, because as I said containers have good tooling around capabilities, in particular they support a simple way to drop all capabilities inside containers, or to allow a few of them only, in a granular way. Capabilities are also the feature that allows us to do certain privileged actions within containers, without running containers as full root user.

For example, let's run our container dropping all capabilities:

$ docker run --cap-drop=ALL -it alpine sh

Note: dropping all capabilities should be the default for any single container which does not require_ them. Abusing capabilities (i.e., privileges) is one of the simplest way to escape containers and compromise the host system! If you run in Kubernetes, dropping capabilities can be done via the securityContext spec.

To try some privileged action, we can install tcpdump:

# apk add tcpdump

If now we try to sniff an interface, even if we are root, we get:

/ # tcpdump -i eth0
tcpdump: eth0: You don't have permission to capture on that device
(socket: Operation not permitted)

This is because tcpdump uses RAW sockets to listen to the network, and RAW sockets require the CAP_NET_RAW capability. Let's try to add this capability only to the container.

# docker run --cap-drop=ALL --cap-add=CAP_NET_RAW -it alpine sh
# apk add tcpdump
# tcpdump -i eth0
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
20:09:25.754379 IP 151.101.66.132.443 > 84dd5599ce4e.53112: Flags [FP.], seq 2181022380:2181022404, ack 1351641877, win 294, options [nop,nop,TS val 3705369401 ecr 3549003202], length 24

Everything worked as expected.

There is a whole lot to discuss about capabilities, and specifically about how they are tied to processes, how they are inherited etc., but that I will probably keep it for a book I am writing (maybe, one day...).

If we want to know with which capabilities a process is running, we can use:

/ # cat /proc/$$/status | grep -i cap
CapInh: 0000000000002000
CapPrm: 0000000000002000
CapEff: 0000000000002000
CapBnd: 0000000000002000
CapAmb: 0000000000000000

We can decode that obscure number with capsh:

capsh --decode=0000000000002000
0x0000000000002000=cap_net_raw

Will we Talk About Security?

So far we have only talked about separate features, one by one, without really putting the puzzle together to answer the very question we opened this post with: do containers help security? Does it help if I run containers as a low-privileged user?

Let's try to answer now this question and motivating the answer. The answer is yes, with a lot of different caveats.

Containers can make it much simpler to run programs (untrusted ones) in a much more isolated environment, with a much smaller impact in case of compromise, compared to standard native processes. They can, they don't necessarily do. Containers are ultimately just a collection of Linux features, they are not really "anything" else. Virtual machines have an hypervisor which helps creating a concrete boundary between the host and the guest, where the guest runs its own kernel. Containers, on the other hand, are nothing more than processes constrained by some features, which run in the same OS.

It is completely possible to run a container as insecurely as a process outside of it. For example, run an image with the root user, with the --privileged flag, with all capabilities added, with no seccomp profile and maybe mounting / to /host inside, and you will have exactly a root process on the host, with the same level of impact in case it gets compromised. Even worse, you can have all this and run a completely untrusted artifact that you pulled from who-knows-where in the internet with docker pull or by building a Dockerfile. I would argue that this is worse compared to running a binary that at least was installed from the default package manager of your distro.

However, if I have to run a "random" software, I am extremely confident to say that it is an order of mangnitude easier to run said software securely in a container than it is to run it natively, for example as a systemd unit.

Before making a final, concrete, example I want to discuss why each of the features discussed until now represents, to some extent, a security feature:

Namespaces: modifying, and specifically restricting, the view that a process has of the system has huge security advantages. If I cannot see a mount, I cannot access files inside it. If I can't see an interface, I can't sniff it or modify it. Even more, if I am root but my user outside the system is actually a low-privileged user, I can do limited damages to the system. Namespaces allow to greatly reduce the attack surface available to compromised processes, which means both the probability and the blast radius of a compromise is smaller. A side-effect of namespaces is also that the tools available are generally only those inside the container filesystem. If the container uses a minimal image, it might not even have a shell, ls, echo or any binary that can be used to enumerate or exploit the system.
Cgroups: from the security point of view, cgroups prevent resource exhaustion, so they do provide a certain resilience against DoS attacks, such as a -admittedly old school- fork bombs. If PIDs are limited, a process won't be able to exhaust them for all the system. Similarly, CPU and memory limits can avoid a bad process to crash the whole system or to starve other crucial applications.
Seccomp: this one is fairly obvious, as limiting the syscalls available to a process can potentially prevent certain dangerous actions against the system, such as an exploit that (ab)uses a particular feature. The truth is that it is hard to build seccomp profiles for applications, though, especially since it is not possible to accurately and completely get the list of called syscalls for a program, in all codepaths. Usually applications are traced/profiled, but this is just a statistical methodology.
Capabilities: the root user, especially without using user namespaces, can do a lot, even within a container, because it is talking with the same kernel of the host. Being able to limit the privileges available greatly reduces the agency of an attacker. Some container-escape techniques rely specifically on certain root superpowers, therefore this gives a simple answer to the second question: does it help if I run a container as a low-privileged user? Yes.

I will repeat it once again: it is possible to use all of these features also running a regular process in the host. Containers are just a very handy way to do all of this at once, with a couple of flags. At least, this is my experience, I still need to meet anybody who - for every systemd unit - creates a bunch of namespaces, drops all capabilities, configures cgroups and applies a seccomp profile. If someone does that, congratulations, you are a master of systemd :).

PoC or GTFO

Following the most dear traditions of the web, let's make a final concrete example.

Let's say we want to run a very secure web application, a web application that has an API endpoint, which takes a command and an argument as a parameter, and executes it. Some sort of devil-ish backup utility controlled via API, for example. The reason why I am choosing a program that calls another external program is not to cheat by using a single binary that works so well in a container.

The code is the following (don't use it anywhere!):

use rocket::serde::{json::Json, Deserialize};
use std::process::Command;
use std::str;
#[macro_use]
extern crate rocket;

#[derive(Deserialize)]
#[serde(crate = "rocket::serde")]
struct Task {
    command: String,
    args: String,
}

#[post("/backup", data = "<task>")]
fn backup(task: Json<Task>) -> String {
    let _child = Command::new(&task.command)
        .args([&format!("{}", &task.args)])
        .spawn()
        .expect("Could not run the command");

    format!("OK!")
}

#[launch]
fn rocket() -> _ {
    rocket::build().mount("/", routes![backup])
}

To run it locally, the following needs to be included in the Cargo.toml:

[dependencies]
rocket = { version = "=0.5.0-rc.4", features = ["json"] }

This example is really close to the quick start in rocket's documentation. Essentially what we have is a very simple webserver that listens to a single endpoint: /backup, which we can pretend it can be used to "run backups". The program is intentionally vulnerable, it's basically a RCE-as-a-service, but it simulates a regular RCE vulnerability. Actually, this is also better than some vulnerabilities, as rust command does not spawn a shell, so this does not support metacharacters (such as <, >, etc.).

We can run any command we want, even if "by-design" it was only supposed to run restic, borg or similar tools:

# Let the program "curl" itself
$ curl -XPOST localhost:8000/backup -d '{"command": "curl", "args": "localhost:8000"}'
# Create a file on the disk
$ curl -XPOST localhost:8000/backup -d '{"command": "touch", "args": "/tmp/test"}' 
$ ls /tmp
[...]
test

Now, assume that we want to run this application "securely" on our system, and let's take into consideration the following options:

Run as a Systemd unit
Run in a container

Let's consider also the following requirements:

Permissions. This is a "backup" application, which means it might need access to the paths to backup. Let's say for the sake of this conversation that we want at a minimum to backup /var/log and /opt/myapp. /opt/myapp is owned by appuser while /var/log is owned by root:syslog (common in many distro).
Exposed. The application is exposed publicly, or anyway within the network.
Untrusted. We are installing a new fancy "backup" service, we don't own the code, we don't know who can commit to it. Basically, it is an untrusted application.

I hope none of these requirements will sound too made-up, because it is fairly common for selfhosters to run applications like this, and it is also fairly common that some application will need access to some part of the filesystem.

Scenario 1: Systemd unit

The first thing that immediately comes to mind, is that we don't want the application to run as root. So we will probably have to create a dedicated user on the system for it, say wtfbackup. Then we need to figure out the next step, which is how to allow access to this unprivileged user access to the /var/log folder, which is owned by root.

Some files might be world-readable, but some other might only be readable by root and the members of some groups. For example, in my system:

-rw-r-----  1 syslog            adm               171848 Nov 14 00:00 kern.log

So once again we have a few options:

Add the wtfbackup user to adm group, which will most likely have other permissions "elsewhere".
Add the DAC_READ_SEARCH capability to the application binary and to restic binary. This means that the binary will have read access to everything. Including /etc/shadow, /home/**/.ssh/id_rsa, etc.

We have one more (or a bunch more) options as well, which is using several settings in Systemd, such as ProtectHome=, ProtectSystem=, PrivateTmp= or even ReadOnlyDirectories=, ReadWriteDirectories= etc. These options and the other similar to them will make parts of the filesystem either read-only or completely inaccessible. It will not surprise anybody that under the hood, our friendly namespaces are used (specifically, mount namespaces).

Let's say that we are super diligent and knowledgeable, and we know all these options for Systemd, so now we have a unit that can only access certain parts of the filesystem.

How do we also limit which binaries and functionalities we can run?

One way is to restrict the capabilities, using CapabilityBoundingSet option, NoNewPrivileges=, and other options (see doc). We will need to spend quite some time to limit the access for the application, especially because there are quite many binaries that are "dangerous". For example, if an attacker would manage to use sh, python, etc., then a reverse shell is a possibility, but also other attacks such as a fork bomb. So what do we do?

Well, we keep digging in systemd options, there are options to limit network connections, there are options to limit resources and so on.

I don't want to go even in more detail and provide a prototype for the perfect answer, because - to be honest - I don't know the perfect answer myself with systemd. I have not seen a single tutorial, post, page, systemd unit template or anything like that which uses more than a handful of options.

I checked in my own machine, I looked at a few random units such as docker.service or teamviewer service. I couldn't find even one of the security options listed above. This in itself does not prove anything, but I think it is a reasonable argument that if some features are obscure, probably not many people are using them. In any case, the point I want to make is another: we are talking about systemd security options, not just "systemd".

Scenario 2: Container

Let's look at the container scenario now.

Before even looking at the details, it is important to note a fundamental difference. In the container case, the software author, who knows the software much better than the user, can configure a lot of security features for the program.

For example, the author knows exactly which external binaries are/should be called, and can provide a container with those (and only those), with proper capabilities already added. The author can also providee a low-privileged user by default for the container, so that not everyone in the world will need to be knowledgeable enough to run their application securely.

This said, we can adopt common Docker best-practices while building the image (which again, it's often the responsibility of the software maintainers, not the user), such as the following:

Use a multi-staged build, with the last base image being as minimal as we can, possibly scratch or distroless.
Add only the necessary capabilities to the binaries
Run as a low privileged user

Pretty much everything else comes out-of-the-box. Only the filesystem paths which are explicitly mounted will be available inside the container, for example. We don't even need to think about protecting our /etc, because unless it is mounted specifically, by default it's not accessible to the processes inside the container.

A Dockerfile for such build would look potentially like the following:

FROM rust:1.73 AS builder

ENV USER=root
WORKDIR /code
RUN apt-get update && apt-get install -y pkg-config restic libcap2-bin
COPY Cargo.toml /code/Cargo.toml
COPY Cargo.lock /code/Cargo.lock
COPY src/ /code/src
RUN cargo build --release
# Set DAC_READ_SEARCH for both restic and wtfbackup
# This is required to read all files to backup, which
# can be owned by arbitrary users
RUN setcap cap_dac_read_search+eip /usr/bin/restic
RUN setcap cap_dac_read_search+eip /code/target/release/wtfbackup
# Set NET_BIND_SERVICE for wtfbackup, as we need it to bind
# to a port.
RUN setcap cap_net_bind_service+eip /code/target/release/wtfbackup

# Note that building without docker buildx will result
# in the capabilities being lost, because COPY does not
# preserve extended file attributes by default

FROM gcr.io/distroless/cc-debian12
USER nonroot
COPY --from=builder /code/target/release/wtfbackup /
COPY --from=builder /usr/bin/restic /
ENTRYPOINT ["/wtfbackup"]

This image can be built with the following command:

docker buildx build -t wtfbackup:v0.1 --output type=docker  --platform=linux/amd64  .

Note that in the image restic is included, assuming it would be one of those tools we can invoke it.

We can then run the container with the following command:

docker run -p 8000:8000 -v $PWD/Rocket.toml:/Rocket.toml -it --cap-add=CAP_NET_BIND_SERVICE --cap-add=CAP_DAC_READ_SEARCH wtfbackup:v0.1

Note that we have to explicitly add the capabilities we want to run the container with, or better, that we want to allow within the container at all. Even if the files have the capability added (we use setcap), if the container is launched without specifically permitting them, we will get an error like the following:

exec user process caused: operation not permitted

So, without any configuration from the user, how many damages we can do? Let's try a few commands:

$ curl -XPOST localhost:8000/backup -d '{"command": "id", "args": ""}'
# Nope
Could not run the command: Os { code: 2, kind: NotFound, message: "No such file or directory" }
$ curl -XPOST localhost:8000/backup -d '{"command": "/restic", "args": "-h"}'
# Prints help from restic
$ curl -XPOST localhost:8000/backup -d '{"command": "/bin/ls", "args": "/"}'
# Nope
$ curl -XPOST localhost:8000/backup -d '{"command": "/bin/sh", "args": ""}'
# Nope
$ curl -XPOST localhost:8000/backup -d '{"command": "cat", "args": "/proc/1/status"}'
# Nope
$ curl -XPOST localhost:8000/backup -d '{"command": "echo", "args": "test > /test"}'
# Nope

And we could continue for a while. We can look inside the image using dive to see what's available.

dive wtfserver:v0.1

Looking inside the layers, we can see that in the image there are basically no binaries. There are some libraries (like libc), there are timezone files and that's basically it.

What damage can be done if this application is exposed through the web? Well, at the moment it would be possible to somehow exhaust resources (or try to), potentially mess with backups by using restic, but not much more. Sure, it is possible that a memory corruption vulnerability is found in restic or in some rust library, and it can be exploited in some way. For example, restic could be used to "restore" from an unstrusted repository (remote) a shell, and maybe that could be a way to gain access, but even if that was the case, escaping the container and making some damage to the host is going to be very complex.

Wrapping Up

If we look critically at what we have been discussing, we can now finally reach a conclusion and provide an answer to the original question. Do containers help security? Yes, they do.

They enable users of applications to use certain features of Linux which reduce the attack surface and reduce the impact if said applications are vulnerable and compromised. In addition, they have the advantage that a lot of the security work can be done by the software writer, rather than by the user.

Are containers the only way to achieve said security? Of course not. Systemd has a wide range of security options which allow users to reach the same level of isolation. Flatpak and similar systems also use namespaces, cgroups, seccomp and capabilities to isolate applications and their dependencies. However, I think that it is pretty obvious that the features used by containers under the hood are security features, even if containers themselves might not have been specifically designed to be a security feature (also, who can say why they were designed?). Containers are just a very convenient way to use those features transparently (and unknowingly).

Obviously it is still important to follow good security practices, run images from trusted sources, expose publicly only what's really needed and so on, but this applies universally.

If you find an error, want to propose a correction, or you simply have any kind of comment and observation, feel free to reach out via email or via Mastodon.