Kubernetes Runtime Security for the Pessimist

On Monday, as I usually do, I opened the Cloudseclist email that I get every week. In this issue, there was one article that caught my attention, as it concerns one area where I am quite invested professionally: Kubernetes runtime security.

The article in question is this one, and I want to premise that I have absolutely nothing against the author and that this article is not "worse" than the general material that can be found online. However, I find this article expressive of a tendency in the industry (I am talking about the tech industry as much as the security industry) to barely scratch the surface, to build a personal brand instead of putting quality, in-depth information out there.

For this reason, I want to take some time to critique the trend that I personally observe and then I will also talk about some of the content from the technical point of view.

Pop-Security

There are a lot of very good people out there, who write about technical stuff, who are really expert and provide to us mortals vital information that we can learn from. However, in particular around certain topics, there is also a lot of fluff. Specifically, there are two kinds of fluff that are very common:

Marketing material disguised as information/blogging/etc. - usually this is pushed by one organization directly.
Self-promotion material: guides/tutorials that are extremely basic, a rewording of the official documentation and that are somewhat aimed at increasing the reputation of the writer.

As it's clear from the introduction, in this case we are talking about the second kind.

Again, I have nothing against the author of this article, I perfectly understand why people do this. I am 100% sure that the author had nothing but good intention writing that piece, and again, I could have referenced dozens of similar articles instead.

One common theme across such articles is the superficiality with which they treat topics that are extremely complex and nuanced. In fact, one argument that I am sure many will think while reading my words is that such material is aimed at beginners and to introduce people to certain topics, and that I am trying to gatekeep security. Well, my hot take is that certain topics are inherently not appropriate for beginners, and treating them as such is harmful, especially for beginners.

For this reason, I will refer to such articles with the term "pop-security". What I mean with this term is that they technically discuss topics in an accurate way, but they don't provide any kind of contextualization and insights, to the point that they become misleading or useless. It's the kind of material that can make a person feel like "they know security" after reading 1000 words about a super complex topic, but if that person would actually start touching that topic they would essentially have to start from scratch.

Talking about this article in particular, the first line in the introduction tells us:

This blog post provides a comprehensive guide to securing Kubernetes, covering runtime security and system hardening.

If I look above, under the name of the writer I see:

10 min read

The whole article is 2000 words, and it touches on:

strace
Apparmor
Falco
seccomp
Kubernetes Audit API
Immutable Pods

If we make an average, this means each of these areas is covered by less than 400 words.

The same article is introduced in the newsletter with:

This blog post provides a guide to securing Kubernetes, covering runtime security and system hardening.

I want to focus on two words: comprehensive and guide.

Comprehensive should indicate that the content is broad in scope, and possibly cover most if not all aspect of the topics it touches. Being a "guide" instead makes me think that by following it I should be able to harden my systems and my cluster.

I instead will argue that this article, as most pop-security self-promotion material, is neither. It's an article that can be considered an (incomplete) index for some keywords to research in the context of runtime security.

Let's see why I think so.

Strace

To be honest, I have no idea why the article starts with this section, but I find it extremely lacking:

Strace, short for system call trace, acts as a sidecar on the system call interface

It doesn't. strace uses ptrace as far as I know, which lets it "intercept" each system call. For information on how Strace can be used, I usually refer to this Michael Kerrisk's talk. Note that this is a 40 minutes talk only on Strace, talking about "comprehensive".

Among the practical applications of strace is the ability to inspect whether secrets stored in etcd are accessible.

I really don't think strace is the tool for this job, even though technically can do it. It's also unclear what "accessible" means. Accessible to whom? To etcd itself? To someone who gains physical access to the box?

I don't want to comment the whole article Reddit-style by quoting every line, but I think this already shows the superficiality I referred to. What I believe the author wanted to mention is that by default in Kubernetes secrets are stored unencrypted. There are ways to encrypt them so that etcd will only store encrypted blobs, while the encryption keys are stored in a KMS or - alternatively - in an encryption configuration (which doesn't solve the problem).

But why would we need strace if we can simply check the data files of etcd (or even comfortably use etcdctl) to verify that our secret is present in cleartext? Surely at some point etcd will write the content on the file and therefore the secret will be in the payload of a write call (or something similar), but it's definitely not "the way" to do it. In fact, the article itself doesn't use strace ultimately, it simply lists the open file descriptors for the etcd process and then checks the content of the db file for strings.

In this case, the articles shows something useful (secrets can be stored in plaintext) but in a sterile way. Specifically:

It doesn't contextualize the problem.
It doesn't explain what threats are we considering (e.g., etcd credentials compromised, lack of authentication for etcd, access to the machine where etcd is running).
It doesn't explain how to check in Kubernetes if encryption is enabled.
It doesn't even begin to talk about controls to apply (!). How to model the problem, how to approach the solutions, what considerations to take when hardening the cluster (e.g., running etcd on a completely dedicated set of machines).

Until this point, the article seems just a way for the author to mention a couple of things and show a couple of nice shell tricks, but a beginner will get absolutely nothing out of this, and of course this is completely pointless for more expert readers.

Falco

The next section of the article touches on Falco. The first line of this section was already confusing:

Falco is a CNCF project that was created to track all actions taken by Kubernetes administrators.

Tracking all actions taken by administrators is the responsibility of the auditing system. In fact, Falco is a tool that monitors and alerts on syscall usage, depending on preconfigured rules. The section continues and slightly improves compared to the previous one and its own beginning, however I do have a certain number of observations:

First, silly thing, but the author runs falco as a Systemd unit, but then checks the logs with tail -f /var/log/syslog | grep falco, instead of journalctl -u falco. This is purely a matter of preference, but it's one of those things that made me question whether they actually worked with this tool extensively before writing about it.
Second, the author briefly mentions rules, but doesn't make a single concrete example, doesn't show how to write a custom rule, where to find the syntax to use, what kind of use-cases Falco can satisfy and which ones cannot.
The author doesn't mention exceptions at all. Any professional who will ever have to work with Falco will definitely work in a context in which at least some applications will need exceptions to the rules.

In addition to the above, and to reiterate on exceptions, the article doesn't really cover the subject in a real-world fashion. This can almost be mentioned as another theme for pop-security articles, in which topics are presented in the simplest form, but without any real-world context, where they usually are much, much, much, much, much harder to apply, with concrete problems to solve.

For example, in the Falco case as I mentioned before there is no mention of exceptions, and these are not only extremely needed if you want to run a cluster, but they are also a vector for rule bypass. There are lots of precautions that have to be taken to make exceptions specific enough and combine rules so that one rule would catch the bypass for another rule. There is also no mention of a strategy to manage rules properly, since going to a machine and editing a YAML file is not a way a professional should aim to implement changes in such a sensitive component.

All-in-all, operating Falco is a complex task which takes a lot of time and effort. Getting it running and seeing some event live is easy, but configuring it as a beneficial runtime security tool is much harder and what I would expect from a "guide".

Immutable Pods

Once again, in this session the author starts with a very confusing statement:

One way to achieve immutability in Kubernetes is to use the startupProbe. This probe is designed to detect when a pod has finished starting up and is ready to be monitored by Kubernetes.

This is absolutely not true. startupProbe - similarly to livenessProbe has absolutely nothing to do with pod immutability. It's purely a construct for legacy application that are slow to start.

Pod immutability can be ensured in a couple of ways:

First, by setting the securityContext properly using the readOnlyRootFilesystem to true, which the author mentions, later.
Second, using falco it is possible to monitor the creation of new files or the modification of existing ones. Realistically, this is worth doing for executables only, but technically, it can be done.

The article once again fails to contextualize the problem, doesn't explain why immutable pods are important. Actually, it does:

This makes them more secure and stable, as it prevents attackers from making changes to the pod that could compromise its security or functionality.

The last part makes me once again question whether the author knows this topic really well. Immutable pods are "more secure" because changes at runtime can be indication of compromise. But what is changed is the actual filesystem of the running container, not the pod itself. I will give the author the benefit of the doubt, but the last sentence almost sounds like the author is hinting that it's the pod (as in the Kubernetes resource) to be immutable, which is not the case.

The author also misuses the startupProbe field in the example, executing one-shot commands rather than liveness-checking commands, and suggesting that this could be used to secure the pod by removing the shell. The proper way to approach this problem is from an image point of view (minimal images, distroless images, etc.), or using falco, or restricting the capabilities for the running processes etc. (by the way, the article does not mention what should be the "default" securityContext for pods, which is one of the easiest and most effective runtime control). I hope I did not misunderstand the article, but I couldn't find another interpretation for the example they used.

Kubernetes Auditing

This section is OK. The article itself adds very little value to the tutorial present in the official documentation, though.

In my opinion, the superficiality aspect comes out once again in the fact that the author doesn't attempt to explain what an exhaustive Audit policy should log, and why those resources/actions in particular are important to be monitored. There is no strategic vision that gets enabled by this tool, the presentation is purely limited to show how the tool works.

Apparmor & Seccomp

I will group these two sections because in my opinion they suffer from the same problems. Apparmor and seccomp are somewhat similar tools. They both can be used to create profiles and limit the actions that processes can perform.

The article shows basic usage for both, even in this case taking examples that are almost identical to those in the official Kubernetes documentation. In both cases though, the article once again doesn't make any effort to explain:

What kind of use-cases Apparmor/Seccomp should be used for.
How to build a profile for a use-case.
How to manage this at scale, especially in the context of Seccomp.

Both these topics are extremely complex and advanced. For both tools the main problem lies in being able to answer the very simple question: "What actions do my application need to take?". For seccomp in particular the question is: "What syscalls will my applications execute?". These questions are exactly what makes the use of these tools extremely advanced and demanding.

At the best of my knowledge, the only way to understand what syscalls an application invokes is by running it and profiling it. Obviously there is nothing that guarantees that all code paths are taken, especially those that might require some specific condition (such as production load). In addition, such experiment in theory should be conducted every time an application is modified, making the process absolutely crucial to automate.

Finally, while the tutorial uses the same deploy mechanism for profile which is shown in the documentation tutorials, it's clear to anybody that node-local profiles do not scale, and are a pain to manage. Additional tooling is needed to solve the very simple process of ensuring that wherever applications are scheduled, profiles are available. While I don't care about specific mentions of tools, such as the security-profiles-operator, this problem is one of the most relevant in using these tools, and the fact that the article does not mention it at all shows once again that we are stuck on the surface.

Conclusions

I will maybe take some time to write my own thoughts about Kubernetes runtime security, but I would say that the key points to look at are the following:

The basics

Strict defaults for SecurityContext of pods.
A policy engine/admission controller that enforces security policies (e.g., Kyverno, OPA Gatekeeper).
Network Policies that restrict network traffic to only needed flows.
Strict(er) RBAC policy, that protects critical resources and namespaces as much as possible.

Once the basics are covered

Runtime monitoring tools (e.g., Falco) with dedicated rules for specific use-cases.

When your security team is bored

Security profiles.
MicroVMs/Enclaves.

Obviously there are other areas to cover, such as audit logs and cluster configuration, but that's not runtime technically (at least not under my interpretation), so I am not including it.

Once again I want to repeat that I don't have anything against the author of this article in particular, and I could have taken many other articles I have encountered in the past as similar examples. There is simply a trend in the industry in which everyone has to build their own brand, leading to a lot of fluff being produced with the sole purpose of improving the credentials of the writer, rather than providing useful information to the readers.

The manifestation of this trend is superficial articles that do not provide remotely enough information to understand a topic to the needed level, nor present topics in appropriate contexts. The result is pop-security: lots of content with very little importance that gets shared to "give back to the community", and a complex topic like security being emptied of meaning. Some areas seem (based on my own empirical observations) more affected than others by this trend, and this seem to be correlated with the "hipness" of a certain technology or its relevance in the job market. Coincidence? I don't think so [cit.].