I hate Clouds

It is a Saturday afternoon when I start writing this. The weather is quite poor, with a wet morning and the promise for more rain in the afternoon. My partner is away for a couple of weeks, so I decided to get some time and write something I have in mind for a while and I have also expressed in multiple shapes and forms over the past years in various conversations: I hate Clouds.

I start writing this a couple of hours before having to prepare my lunch, which I already know it is extremely optimistic, as I am sure this post is going to be much longer than I anticipate right now. Given the likely length of the text, I ask for forgiveness in advance to the native English speakers: there are going to be mistakes, I will try my best, but sorry nevertheless.

This post is going to be both technical and political. I know that a lot of my fellow tech workers are very uninterested in the political aspects of our job/industry, with the idea that we just "solve problems", but I hope you will bear with me. Either way, to make it simpler for the reader, I will divide this post in multiple sections: for technical sections I will use the 🪛 symbol, whereas for the political sections I will use the ✊ symbol. I know that this is going to be a controversial topic, and I will try to anticipate some objections using quotes, and using the 🤌 symbol. I know that this is unnecessary, and that people can easily understand what is what, but damn it I want to have some fun and this is my blog after all!

Let's begin.

Introduction

There is no a very clear definition of what "Cloud" is. I hope it's obvious to everybody that I do not mean the white fluffy things in the sky, but besides this, "Cloud" is effectively a buzzword. There are some that will smirk at you and say that it's just someone else's computer, others that will yawn and say that back in our days (TM) we already had Clouds, it was called hosting and so on. In this post, when I talk about Cloud I will refer to something specific:

A model of renting compute power from a provider, where each aspect of the computation is unpacked into separate pieces (i.e., Hardware resources, IOPS, network traffic, etc.).
The above service is offered through many higher-level abstract features that integrate together.
The above service is offered by one of the big tech corporations.

In other words, I refer to "Cloud" specifically as the services offered by Google (GCP), Amazon (AWS), Microsoft (Azure) and a handful more players (IBM, Oracle).

🤌 If your critique is about the model, what does this have to do with which provider offers the service?

Exactly because my critique is not necessarily against the model, but about the specific blend of technical features and political dynamics that are implemented and created by the big players, which represent the vast majority of what "Cloud" is.

Now that the scope of this post is better defined, we can start discussing the merits of the argument. The first part of the post is going to cover the technical aspects, so that those who don't want to think about political issues at all will not even have to do the effort of scrolling further, and can just stop after the first half of the post. The second part is going to be my perspective on the political implications of the use of Clouds. Each of these parts is going to be divided into multiple sections.

Needless to say, the opinions on this post are going to be extremely subjective. I don't have the expectations that everything I am going to say is universal and applies to every human/organization on the planet.

⚠ Edit 08/07/2024: A very kind Lemmy user pointed out that there are quite a lot of acronyms and abbreviations in the text, which might not be very friendly for someone who is not a specialist in this area. To help with this, rather than just expanding them on first use, I added a small Appendix at the end where all acronyms and abbreviations are expanded and explained.

🪛 Technical Side

All of the following sections are somewhat related to each other, there is no clear separation. I am discussing them independently simply to organize better my thoughts and make sure I don't forget anything, but they should be seen all as part of a bigger argument rather than separate issues.

🪛 Complicated

The first reason why I hate Clouds is that they are complicated. What I mean by complicated is not that they are complex, but that they have a level of complication built-in. This can be observed for example in the way the different providers implement IAM features (on this topic: IAM is the worst). It can also be observed in the way Clouds refused to use any standard naming for their products (hello, security groups!), in the way products are split into hundreds of different services that need to be configured, managed (and often paid, of course) separately.

Then there are of course the random limitations that make things unnecessarily complicated. Take AWS and Network ACLs (NACLs): there is a limit of IPs (you might be able to ask support to raise it, but it's a manual process). Each security group can have a maximum of N rules, which are calculated with its own logic. In general engineers who have worked a lot with Clouds usually at some point in their career find themselves fighting with a problem that is seemingly impossible to debug or explain, until they realize that they hit one of these magic limits in some sub-feature of a sub-service of a product, and this broke something else.

🤌 The complication is necessary to make sure that different needs can be covered!

Yes, that's true and I understand it. However this doesn't change the fact that this complication exists and forces people to do a lot of joyless and uninteresting work (more on this later).

Take S3 encryption: data encryption is probably one of the simplest concepts that we apply for decades, and yet to achieve it you need either another service and rely on IAM (rather than on encryption) or encrypt the data before saving it on S3.

It is not a coincidence that the documentation for the S3 service is a 3449 pages PDF. My main point here is that individuals and organizations that require all the flexibility that Cloud services offer are a (tiny) minority. This means that for the majority of us, all the complexity necessary to provide this flexibility ends up being purely a complication or worse, a liability.

🪛 Risky

The above example about S3 encryption is a perfect starting point for the next point: using Clouds is risky. I am a security specialist, not a DevOps/Cloud Specialist, hence risk is the way I look at things. Well, right off the bat I would say that complexity hides security issues, and this is a known fact.

If you don't understand exactly how S3 encryption works, you will end up granting access to the data to applications or people that were supposed at most to get access to the files (or buckets). If you don't understand exactly what's going on with your CloudFront service, or with Cognito identity pools (BTW, the documentation about Amazon Cognito is a 1182 pages PDF), you might end up exposing publicly a private bucket. Let's add to the above that Cloud documentation is generally complete, but extremely fragmented and scattered around a mix of pages about single services, tutorials, guides unless you really want to deal with thousands of pages of PDF.

These are just some examples and most of them in AWS, but the way the complexity (let alone the complication) hides and causes security issues is a transversal problem. In the GCP world for example there are quite a number of privilege escalation techniques (see for example this from DEFCON) that abuse certain services or obscure features/interactions.

Ultimately, the matter is fairly simple: Clouds have an enormous attack surface. Having all the different resources sitting within one trust domain (Cloud API) makes issues with certain services (which you are likely not even using!) relevant for others (much more critical perhaps).

When using Cloud providers many organizations feel this cozy feeling that since it's Google, or Amazon, or Microsoft, who is taking care of security the "Cloud", then everything is safe. In reality not only this is wishful thinking, but what for the Cloud provider might be a regular, documented feature, for a customer might be an unexpected vector with which attackers can take over their whole infrastructure or steal their data.

Let's repeat it one more time: complexity hides and creates security issues.

🪛 Boring

Based on the points above, calling all of this "boring" seems a contradiction. However, this is my personal feeling about Clouds, and I want to elaborate my perspective.

While a complex system with many parts is generally something that takes a lot of effort to master and expert people have a deep understanding of the area where this system operates (e.g., the Linux Kernel, driving an airplane or a ship, etc.), for me the Cloud is nothing like that.

The way the Cloud is designed is essentially an attempt to sell companies the idea of outsourcing the complicated work needed to spin up and maintain digital infrastructure, so that they can focus on their products. This premise has the consequence that Cloud systems are a big puzzle. The pieces of the puzzle are the Cloud products. Engineers working with Cloud systems essentially need to understand the abstraction but not necessarily the underlying, ultimate working mechanism of what those abstractions do. For example, a Cloud expert might know everything about the difference between NACLs and Security Groups, all the details about how to configure them, their limitations etc., but the main idea is that such expert doesn't need to know anything below that (e.g., how the traffic is filtered).

Ultimately my perspective, and I appreciate it's a very personal one, is that building and working with the Cloud makes me feel like a glorified application administrator. My job becomes researching how the Cloud solved the problem that I need to solve, and compose the solution in the way the Cloud provider imagined it should be solved, rather than solving the problem. If something doesn't work, click again through some more web interfaces and see what information the Cloud providers made you available.

Being an engineer, I get absolutely no intellectual stimulation from this type of work. I feel like there is very little creative input and problem-solving from my end. Partially, I understand this is exactly why Clouds have been built: abstract so that digital infrastructure can be consumed rather than built. However, I can't help but feel that working with Clouds is ultimately taking away the technical aspects that makes you (me) feel rewarded: the ultimate understanding of a system to the point that when I solve a problem with it, I have full power, which comes from a full understanding of the domain.

This point is further illustrated comparing the training around different areas. If I want to study Linux security, I can work on system administration courses, diving deep into any subject until I reach the physical hardware if needed. This allows me to both being able to use the system, and to fully understand it. Similar considerations can be done for other topics like network security, web application security and probably many other topics besides security. When it comes to Clouds (or Cloud security), you will learn borderline static information (e.g., service X should have this option flagged to make it secure) or the collection of services that exist to perform a certain function. Essentially it's like if Linux security would be limited to CIS benchmarks and a list of tools that can perform different functions. Or if network security would be just a deep-dive into nmap flags rather than an understanding of why certain flags exist and their relationship to network protocols. I have observed this by studying for Cloud certifications: the entry-level certifications are essentially a dictionary whose job is to translate a technical need (e.g., a firewall to filter network traffic) into the relevant product, with a very brief overview of these products (which, as discussed before, in reality have a gazillion features). The more advanced certifications are essentially just a deeper dive into the many characteristics of those products.

I find all of this extremely boring, and I guess I still did not manage to find anything that would show that there indeed can be Cloud engineers and not simply Cloud users.

🪛 Unnecessary

I will start this section with a visual summary of my argument, using a highly scientific diagram made by Yours truly:

The tech world unfortunately is full of marketing and hype. It's no coincidence that in tech there is a yearly "revolution" (LLMs, Meta-verse, Blockchain, etc.). In this regard, Cloud providers managed to sell their narrative to basically everyone, including governments - which now hosts plenty of public services in Cloud.

So, why do I think Clouds are technically unnecessary? Because in reality the demands for most of the digital products are very simple and can be addressed very simply without them. In the diagram above, I have simplified by highlighting two categories of organizations:

Those whose digital needs are so small, that for them is not worth to solve any technical problem directly, so that they can skip even having a technical department. To this group I would also add another small case, which is building proof of concepts. You might not know yet if something is viable and to build a Proof of Concept you might not want to invest yet in the infrastructure. In this case it makes sense to have a short-term quick-and-dirty Cloud deployment.
Those whose digital needs are both complex and variable. This includes a need to scale rapidly and in multiple geographic areas, and whose load is very variable (10-100x or more, depending on conditions).

For those who fit in the first case, Cloud services make sense because the requirements are probably very lax, the amount of products are probably the 0.n% of what Cloud providers offer, and the configuration of those products are probably extremely basic too. Examples of such organizations could be someone who has to run a particular tool for which an AMI exist, or that needs a static site hosted and can use S3 for it, or those who need a couple of machines to run their website(s). I don't want to make up any rule-of-thumb, but essentially if the needs are so tiny that a handful of services are all you need, in their most basic (default) configuration, then Cloud services can actually spare you the cost of personnel who would be probably underloaded otherwise (more on this later).

The ones fitting in the second case are on the other end of the spectrum. For these complex organizations any other deployment model would be extremely expensive and wasteful. The biggest advantage of Cloud providers in this case is the ability to spin up and decommission computing power on demand, in a very elastic fashion. If you have very complex needs, but you don't have such elastic load, you don't fit into this category.

🤌 You are forgetting about time-to-market! You are forgetting about my company which does X and the Cloud makes it much easier.

Many of the marketing points of Clouds (save on personnel, save on infrastructure, scale easily, focus on the product etc.) are covered (at least partially) in the next sections. In this section I want to focus on a simple fact, which is that most organizations will use at most a handful of Cloud services, and most likely in a basic way: compute instances, firewall rules, load balancers, databases, IAM. Maybe some will use object storage and a few more, but the point stands. Their needs are fundamentally simple, and they are also a problem for the most part solved. They don't need the vast majority, almost the totality, of what the Clouds offer.

Who deserves a special mention for me are those who use the managed Kubernetes services (EKS, GKE, AKS). The whole point of Kubernetes is to take a bunch of machines, join them and make your own "Cloud", where "disks" or "storage" are called "PersistentVolume", DNS names are called "services", firewall rules are called "networkPolicies", and so on. Sure, managing a Kubernetes cluster is not a joke, but tooling improved enormously and now tasks that back in the day where like playing with explosives at a gas station are fairly easy (like upgrades).

🤌 To manage a Kubernetes cluster or a fleet of them you need a whole operations/platform team!

OK, show me an organization which uses managed Kubernetes service and doesn't need anyway an operation or platform team. All you are changing is that instead of carefully upgrading the cluster, you are carefully updating node-groups. The maintenance effort saved is just not there. If you are an organization who gets benefits from using Kubernetes, you still have plenty of maintenance left to make sure your cluster doesn't implode, doesn't get compromised, is used efficiently and more. All you are saving are deploying a bunch of go binaries (the controllers), and the maintenance for ETCD (and perhaps some one-off tasks like CNI installation). All these tasks are anyway now very automated and the tooling around them is very mature.

🪛 Reliable?

I am sure that many people reading the points above are thinking something along the lines of:

🤌 Cloud providers allow you to have a very resilient and reliable system without comparable effort to what would be needed by building the system yourself!

Yes, this is true. Generally Cloud providers have a level of resilience which is quite good. However, when things go wrong, they can go very wrong.

In addition to more exceptional cases (like the UniSuper case), Cloud providers also go down. They don't exist in another planet, they exist in the same planet where datacenters go offline or burn down. However, they might have a geographical redundancy that it's harder to achieve in other ways, and software processes that can move workload easily across different zones.

Obviously resilience is ultimately a matter of cost. Every 9 we add to the uptime increases the cost substantially. Ironically, Cloud providers having sometimes outages is one of the motives behind a recent trend, that I will touch on later as well: multi-cloud deployments. Similarly to how in the past there might have been distributed systems spanning multiple datacenters or providers, now there are deployments that span multiple Cloud providers.

Another problem with the reliability is that it ultimately depends on how you configure your resources. How many engineers can honestly say that they are confident that if a whole availability zone in AWS goes down their stuff will keep working? How many have tested the above process? What I mean by this is that given the complexity of the services offered and the nuances of each of those services, it is completely possible that some services are configured (or not configured) in a way that doesn't make them as resilient as they can be. In other words, if you want resilience, you still need to invest on it deliberately, it is not sufficient that you are using a Cloud service to guarantee it. This also means you need to pay twice: one for your personnel to invest on it, and the second in the cost of the service (which already factored in the cost of offering this resilience).

A separate post should then be made about the fact that most services shouldn't need the uptime that they currently required. The fact that even the app which tells you how hard the toilet paper you wipe your ass with is needs 99.9999% of uptime is partially the result of Cloud marketing, which convinced clueless managers that if you have a 2h downtime a month (99.7% uptime) it's the end of the world. I would really like to see how companies calculated that paying a huge premium on their infrastructure to go from a 99.9% to 99.99whatever is actually a net positive for them.

🪛 Vendor-lock

If you go to a GCP shop (or AWS, or Azure), you enter in the open office space (which unfortunately is what you will most likely find) and you shout: "we need to move everything to another Cloud", it's better if you first put nets under the windows. This is not even my own opinion, I think it is a well-known fact that nobody will disagree with: Clouds lock you in - aggressively. Thinking cynically, you would think that all the complexity I have talked about is also - partially - a form of vendor-lock. You need to buy so hard into a specific Cloud provider abstraction, tooling, set of services etc., that the moment you have your application running, moving away is essentially equivalent to starting from scratch.

This is exactly the definition of vendor-lock. Cloud providers do everything in their power to lock you within their platform, from long term rentals that provide a more affordable price to deals with organisations where personnel gets "certificates" (I spoke about this before, yuck). The objective is essentially having the engineers in an organization think about Clouds in the sense of that specific Cloud.

I don't think there is anything more to say about this really, except that it has an enormous cost. This form of vendor lock makes you completely dependent from the will of one organization. In the case of the Cloud I am talking about here, not even a nice one. The prices increase? The service degrades? Some services you were using get deprecated? Some alternative would work more efficiently for you? Tough luck. If you want to move you will need to face huge costs: first the technical development and additionally the reskilling of your work force. Yes, because vendor-lock is not something that has only to do with infrastructure. It has also to do with the skills of the engineers involved. Cloud knowledge, for the most part, is not portable. You are a wizard of IAM policies in GCP? Good job, this is completely useless if you go to Azure. Oh, you are a guru of VPCs and private endpoints? Well done, this is completely useless if you move to a different Cloud.

Concepts transport well (i.e., a firewall is a firewall), but the abstractions provided by Cloud providers do not. You can imagine that this has both a relationship with the previous section (called Boring) and it will come back when I will talk about deskilling labor later on. The vendor-lock is also something that removes power from you as a customer and removes incentive to the company to provide a good service. I will come back to this when talking about support.

🪛 Expensive

Finally the funniest section! Let's talk about money, and how ridiculously expensive running things on Cloud is.

Just recently I happened to read an article about the fact that many companies are moving away from the Cloud to save money. I will also use the experience of the companies I worked for as anecdotal evidence for this argument.

The summary is very simple: running things on Cloud is expensive as fuck.

🤌 You need to keep into account all the development costs and personnel savings!

I will talk about personnel specifically in the next section. However, even accounting for all of this, in the medium and long term, running on Cloud is incredibly expensive. I don't think that the fact that Cloud exploded exactly during the period in which money was thrown at every tech company as soon as a 3 slides deck was put together is random. I think it makes perfect sense instead. Clouds can be used to shorten the time to market. It is the perfect incarnation of a company which needs to move fast and continuously grow, without bothering to save money or be profitable. This was possible when money were not a problem, and companies were allowed to be unprofitable forever, so that plenty of money could be wasted into digital infrastructure, because the series E, F, G, H, Omega was always there to replenish the coffers.

Now the music has changed, and I think many companies find themselves in the situation where their infrastructure (Cloud) costs are simply unbearable, but due to the vendor-lock which we discussed, there is nothing that they can do about it, since they still need to keep the lights on to sell the product that generates revenue for them. I don't need to guess this, I can take the example of my own organization for this. We had 3 layoffs in a year, drastic reductions everywhere, all the while we pay tens of millions a year to the dear Cloud providers, with no significant chance to reduce this cost, despite the efforts.

Despite the efforts means something which is ironic: many companies which run on Cloud, at some point, will have one or more teams whose main purpose is understanding how they are spending money in the Cloud and to reduce those costs. If this sounds conflicting with the idea of reducing personnel, well, it is. The digital infrastructure of my organization is not that huge. Give or take 2000 compute instances (some very small). Something that 200 servers could easily provide. Cloud bills are more than $15 millions/year. I checked a server builder for example, and an absolute beast (something like 2x Xeon platinum processor, 200TB of NVME disks, 1TB of RAM etc.) would still stay comfortably under $250k. 100 servers this powerful will probably be a multiple of our computing power, and cost almost a third if we consider a lifetime of 3 years, which is very low. A more realistic estimation of 5 years leads to a saving of ~$50 millions over 5 years. Completely insane! This is of course if you want to buy hardware. Powerful servers rented run you for $500-1000/month. Assuming a cost of $1000/month, my company could rent more than 1000 powerful servers, and still save money compared to Cloud costs, leaving plenty for additional services such as networking, storage, premium support (remote hands) or actual engineers salary.

There is also another consideration about economies of scale. When you purchase or rent servers, you get usually the computing resources, their disks, a certain amount of network traffic (often unlimited for premium ones). With Clouds, there are a number of costs which are scaling with your usage (e.g., EBS). That means that while renting/buying a server your costs won't change from when you use 10% of the server to when you use 80% of it, allowing you to grow and generating more revenue without increasing infrastructure costs, depending on your service usage, using Clouds your bills could grow together with your customer-base, eating up some of the marginal gains.

To make a concrete example, I wanted to look at an EKS cluster estimation. For this I have used some official tool. One cluster with 10 cheap instances (t2.medium - 4GB RAM and 2vCPUs) and some cheap storage (10000 IOPS of General Purpose SSD, with 500MBps and 1TB) would run you approximately $21000 without snapshots, $28500 with a daily snapshot (assuming 10GB changes per day). This is an upfront cost with a 3 years commitment to save money. A more realistic cluster for even a medium organization is easily going to cost above $50000/year, without considering network-related costs and other services, of course. A much better dedicated server (6 cores, 64GB RAM, 1TB NVME disk) might cost you around $150/month. Let's say $200 to round up, not to consider discounts and so on. A cluster of 10 of these nodes would cost $24000/year, and probably would offer 5-6 times more computational resources. Imagine now the usual scenario, where an organization uses many clusters (development, staging, production, different products, different customers, etc.), and you can see how a well-organized team which manages a fleet of powerful machines is going to pay itself off many times over very quickly. In fact, another company I worked for used to run websites serving millions of users on baremetal Kubernetes clusters. The operation team was less than 10 people, and the yearly infrastructure costs, including everything were less than $200000/year. A team of 5 to manage the equivalent in EKS clusters would probably cost 3-4 times as much, easily.

The main problem is that once again, Cloud providers don't work with magic, they purchase hardware, build datacenters, hire engineers, develop products and then sell them to you. When you buy them, you are paying for all of that, plus the profit of the company (a generous amount, I am sure). Even worse, if your needs are not very advanced, you are paying the cost of a "premium" service (including all the features available that you don't need), while using it only minimally.

🪛 Saving on personnel!

I tried very hard not to overflow this section from the previous one, because I want to talk about the myth of "saving on personnel" specifically. This is fundamentally a mirage, an industry common place.

First of all, I have already mentioned that at some point, depending on your usage, you might need people whose job is to understand and optimize your Cloud bills. You will need development efforts for example to turn off and on environments based on schedules, so that you are not paying during the weekend (for example), but also meeting the needs of everyone, including those in different time zones. These teams are usually relatively small, but they are a good starting point. The main problem is, however, that the large amount of complication that exists within the Clouds, the million of foot-guns with which you can shoot yourself, the hundreds of features, switches, configuration flags, permissions etc. that exist within the Clouds make so that ultimately you still need a platform team. The company I work for hired for years solely senior engineers. It is a very mature Cloud organization (multi-cloud right now, more on this later) and yet it requires to function a platform team of about 60 engineers (out of 250 of them). Entire teams are dedicated to the Cloud networking and maintenance of managed Kubernetes services! It's not like the people working for my company are dumb, it's that the services ultimately still require a lot of effort, development, customization, plumbing and maintenance, even if you are essentially outsourcing much of the work to the Cloud provider. To me this is something which is completely insane. We are essentially paying twice: once for the people we employ, and the second for the people working at the Cloud provider.

Sure, I can't tell exactly how many more people (if any) would need to be hired if we did not have Clouds. When 37signals left the Cloud, they did not have to expand the team at all. However, even assuming that we had to, if we take the amount of savings I mentioned earlier with some quick calculations, we could easily spend $20 millions in salaries over 5 years. That is $4 millions/year, a team of 20 highly paid engineers (in Europe), and still have more than $20 millions in savings. My calculations are 10, 20 or 30% off? Great, we are still dealing with a saving of tens of millions in a 5 years time.

Even if the difference was not so huge, there are two more considerations to make:

First, you would have invested into your own people. They would grow their skills and become able to solve more and more complex problems in the future.
Second, you would have paid people. Your money would go to actual salaries for employees, who pay taxes in their countries and spend those money there. You create high-quality jobs that help the economy and you would be satisfying at least a basic part of the social function that companies should perform. The alternative is instead to pay for shareholder profits for your Cloud provider.

In addition to the above, and to conclude, your Cloud costs are not going anywhere, ever. As long as you keep your product running, you are going to pay. There is no investment, there is no capitalization, it's all purely a rent. Always and forever. This means that - to make a very random example - if the economy goes to the bin and interests rates increase, and you cannot get money for free anymore, you will find yourself with a huge bill to pay forever, and you will be forced to cut personnel, over and over again. Sounds familiar?

🪛 "Support"

How could I talk about Cloud providers without a special dishonorable mention to the support they offer? I need to be honest, not all providers are the same on this, and I think generally Amazon is a little better than Google, which is a little better that Microsoft. Yes, because you might think: I spend $10 millions a year to this company, if I need something I will get a Turing-prize winner 24/7 with the response time measured in nanoseconds ready to assist me? Right? Well, wrong!

Cloud support in general is essentially composed of 3 main parts: there is the LLM layer, the obnoxious first line that always assumes you have absolutely no idea what you are talking about and generally is just a glorified search bar for the documentation; then there is the first line of support (allegedly human). If you get here, you already deserve some congratulations. However, this service is generally designed using years of your behavioral data-points carefully collected to push your buttons in the most effective way. You might be able to experience ignored questions, lack of understanding of your problem, solutions repeated even after you have proven they do not work and much more. I would be happy to gather material for a Cloud support horror gallery. At some point, if you want the issue to get resolved, you will need to escalate it further, where actual humans (hopefully) will look into it, and you will painfully get some assistance, usually still with the intermediation of the first layer to make the experience anyway slow and frustrating.

Just recently, I had the fortune to see two separate support cases with Microsoft. The first run for more than 25 days, regarding a critical bug in Azure that randomly bricked some Kubernetes nodes. In these 25 days I have seen answers copy-pasted from the wrong support ticket, twice (the same answer!), I have seen neverending "catch-ups" with other teams, support missing the call they scheduled themselves, and more. Each message rigorously ended with "Thank you for choosing Microsoft!", which I assume was added by a majority shareholder of whoever produces Xanax. Microsoft support in particular has a fetish in letting you know their team structure and which team they are involving, as if for you it made the slightest difference.

In the second support ticket, we discovered an undocumented Kubernetes role in AKS. There is some documentation, but the documentation is not specific enough and mentions the inability to perform "sensitive actions", while the role we noticed could do quite sensitive actions indeed and was not explicitly mentioned. So we asked about this JIT access, with a single question such as "Can you confirm that this role is the one used by AKS support, and that the support has to go through the JIT process?" (essentially we wanted to confirm that this is nothing malicious and that despite the role exists, nobody can assume it unless explicitly granted permissions). This question had to be copy-pasted 3 times because it was getting ignored, and finally got answered: "You don't need to worry about the JIT process because this is managed by Microsoft and it's secure.". I shit you not.

My experience is of course very anecdotal, but if this is the support which receives a company that spends literally millions a year, what does this tell us? Well, it tells a very simple fact: they don't care. Remember when we discussed about the vendor-lock? Well, they know it too. They know you are not going anywhere. What is the first thing you say to companies when you see a support request not being handled properly? "I am considering alternatives". Well, if you do this to a Cloud provider you will be just served a minute of elevator music performed exclusively with wet farts. Even if you do go somewhere else, they are such big companies that they would rather save on humans and support across the board than retain you individually as a customer.

If you thought that outsourcing meant being able to fully delegate all problems to the Cloud provider, think again.

🪛 Multicloud - Are you joking?

I mentioned multicloud (or multi-cloud, or cluster-fuck) already a couple of times. Multi-cloud essentially means running your product on more than one Cloud provider, usually in a live fashion (i.e., not one being a stand-by to be activated if the other(s) go down).

I don't even know where to start to describe how stupid and idiotic this idea is, but since it's gaining popularity in the industry, I have to start somewhere. Being multi-cloud reduces to some extent the vendor-lock effect, even though it really depends on many factors, you might simply get locked into multiple providers instead of one. Ironically, this model is supposed to be tolerant to Cloud provided failures. The problem is that in practice this will hardly be the case. Look, if you or someone you know built a multi-cloud platform which is perfectly symmetrical and really eliminated every single point of failure, hats off to you or them. What I imagine usually happens to mortals instead is one (or more) of the following:

We have N Clouds, but one is more special than others. If this goes down, some service that is relied upon for other Clouds too, fails and disrupts services everywhere (hello, DNS).
We have everything in N Clouds, except for this crucial piece of infrastructure that provides access everywhere, or deploys code everywhere, or does X everywhere, which is running in a single Cloud. If this goes down, we have an incident.
The network plumbing to connect applications running across the many Clouds is so complicated and ultimately uses a concentration point that if it fails, brakes down the networking between all of them.

There are probably more scenarios, but you get the idea.

In addition to the dubious amount of resiliency gained, the system gets extremely expensive and incredibly complex. The chance of fucking up something in one of the services in one of the Clouds is almost certain. Your personnel will almost certainly not be an expert of all the quirks of all services for all Clouds, so the chance of misconfiguration is huge. But don't worry, you have almost merged 2 or more huge attack surfaces, so that anything that can go wrong in a Cloud can be a liability for all the others too. Great job!

In all seriousness. The risks introduced alone would outweigh the resiliency you gain from a multi-cloud design. This means you shouldn't have redundancy? Well, no. It means you should have backup, it means you should prepare for the UniSuper scenario where the Cloud provider disappears overnight and, if you have high needs for uptime (maybe you are a Critical Infrastructure), you should have contingencies in place (like a hot site). Doubling the complexity and the attack surface is not the answer. Your contingency plan should be a low-complexity design that can be set in motion quickly, not a liability. In any case, your resilience requirements should be the determining factor, not a marketing pitch (we are resilient, we use 4 Clouds). If your organization runs something critical to the civilization, then OK, you might want to throw money at this problem using more than one Cloud provider. If you have those resources you should probably simply use 10-20 sites scattered around the continent and be done with it, but anyway...

✊ Political Side

By the time I am writing this is already Sunday, I ended up ranting more than I expected on the previous part. Now it's time to move to the even more controversial part! Here I want to put aside all the technical discussions, even though there are some relationships (similarly to how I dropped hints of political arguments in some of the previous sections) between the two.

I want to focus now purely on political arguments, from the point of view of a citizen (possibly, European) and a tech worker.

✊ Deskilling

The first argument I want to make is that working with Clouds is a form of de-skilling. Yes, the Clouds are complicated, but ultimately the distance between them and the actual engineering details that make things work is much higher. This means that people who only worked with Clouds don't have the skills to actually build a comparable solutions without it. It's pretty much the difference between someone who chooses some options in the GUI to customize some OS detail, the people who build the code to do that, the people who build the libraries to implement those functions and those who write the actual OS code.

I guess this is a common feature of abstractions, the higher you go, the more you are dependent on those abstractions and you are less able to create one yourself. I am sure that the percentage of programmers who could write low-level code is decreasing as more and more programmers for their whole life only work with very high-level abstractions compared to 60 years ago. This is not inherently a bad thing. However, I think we should put it in context: tech workers are still relatively expensive. Sure, starting from 2-3 years ago there is a continuous series of layoffs that are eroding the contractual power of many of us, but ultimately the skills are still expensive. Now imagine this: how much training and skills are needed to learn AWS (or GCP, Azure) services to spin up a basic infrastructure, and how much training and skills are needed to understand how all that works so that you can build it yourself? I hope you will agree that learning how to glue Cloud services together to get something working (once again, this doesn't mean you fully know everything your services do and that they are tuned perfectly, just that they kind-of-work) is much faster. In fact, Cloud companies offer dirt-cheap (or free, if you are a partner) certifications aimed to have you learn exact that.

The idea is fairly simple: stop paying skilled engineers who spent years studying/working to have a full understanding of the systems you will be working with. Instead, hire cheaper engineers whose job is just to glue pieces together, and delegate all the complexity to us good Cloud providers. Maybe we are not at that point yet, but if your skills are not hard to acquire, your job is going to be devalued soon enough, because you won't have any contractual power. Cloud providers are trying to remove the need for advanced skills from the industry (they will be the only ones needing people able to build stuff) in the same way as the companies pouring money right now on LLMs have the hope of transforming programming into an activity that requires simply to be able to ask what you want in a natural language. The flatter the learning curve is, the more compressed salaries can be as the pool of people will be larger and larger. I am not saying that we should gatekeep the knowledge and remove access to training/education to keep the pool of skilled workers under control, but we should still recognize that companies act for their own benefits, and most companies would benefit if they could pay their Cloud engineers in a similar way they pay all other people in their company (i.e., less and less), rather than having to offer bigger salaries.

✊ Centralization

Going hand-in-hand with the previous point, I hope it's clear to everyone that the Cloud model is fundamentally a centralization model. Let alone the fact that eventually the only people able to build stuff will be the ones working for the Cloud providers, the centralization happens at all levels. The more companies use Cloud-providers the harder it is for other companies selling/renting hardware to find customers. The more it gets expensive for them to maintain their Datacenters.

🤌 Isn't the same that happens with other models?

No it's not, for a few reasons. First, Clouds build their marketing in the range of services they offer. It doesn't matter that the vast majority of the companies will never use more than 1% of them, if you are picking up a service you choose one that offers the most. Also, big companies benefit from economies of scale and can build and stock datacenters cheaper, in multiple locations, while traditional providers are generally less distributed and more local.

In general, the barrier of entry for the Cloud business is so high, that even huge corporations like IBM and Oracle are extremely minor players in the industry. In fact, this whole argument could be made as a simple observation of what actually happened. More than 10 years of Cloud already (the first AWS services launched around 2006), and 2/3 of the market share is firmly in the hands of 3 players, with AWS and Azure holding more than half of it.

If this trend continues, eventually if you want to run something you will need to pay the rent to one of the Cloud providers (i.e., there won't be viable alternatives). Second, you will necessarily have to pick one of 6-7 companies available, more likely one of 3. This also means that those companies are the ultimate owners of computational resources, with all the consequences of the case. For example, they might decide that investing in AI is the way to go, and will be the only ones being able to provide services around that (because it requires immense computational resources for training). Incidentally, the hell if this means missing the environmental targets. They might also decide that your business/organization doesn't fit their guidelines, so you are out of luck (for example, organizations working for the democratizations of the means of computation).

Essentially a handful of companies (mostly US or China-based) will have in their hands the resources that are essential for a world that is more and more based on digital services.

✊ Profits in the wrong pockets

In addition to a handful of companies having the monopoly on digital resources, we can also make considerations specifically on which companies these are.

We are talking about Amazon, Microsoft and Google (in order of market share). I mentioned before how even if the cost of running in Cloud or on a different model was the same, with the latter you would at least be paying salaries to people, creating quality jobs, paying local taxes and ultimately helping the local economy. Cloud services are the complete opposite:

Cloud companies dodge taxes as much as they can. In Europe this is done by using the Irish/Dutch/etc. entities as the one making the profits, which are then "licensing" intellectual property to the other national branches. It's a legal scheme, but we can still judge it for what it is. Even in the US I read that Amazon is a master in tax avoidance, paying something like 12% effective tax from 2010-2018, and things are not improving. Essentially these big companies do not have any intention to contribute to society, and their goal to maximize shareholder value is pushed beyond the boundary of moral behavior.
Unless you are in US, paying for Cloud providers mean siphoning money out of the country, hurting local economy and weakening your country digital independence. This would be true even if those companies paid taxes, since most of the money would leave the local real economy.
By working with Cloud providers, you are working on the deskilling of the local workforce, reducing the number of quality jobs and losing skills locally that could be applied to provide more competition in the digital space (and competition is good even according to the mainstream narrative, right?).

Even without all of the above, these companies are responsible (especially Google) for the continue degradation of the digital space, for pushing an ad-driven vision of the internet (with all the consequences of the case), for privacy violations and actively fighting privacy rights, for major environmental pollution and probably hundreds more negative impacts on society. For you(r company) it might be just spinning up some infrastructure to do your job, but you are effectively supporting all of the above when you use one of these companies.

Conclusions

I am sure that many people will disagree with me, many will want to make corrections to imprecisions and more. My main consolation is that a very low amount of people reads this blog and that I write mostly not to go insane, as my showers are not long enough to articulate all the above arguments.

Anyway, I want to wrap up this whole post by saying that I don't object against the idea of Cloud as a way to rent resources and services. I think there are use-cases that benefit from it, I also think that there are companies offering such services in a reasonable way (this whole site is hosted on Hetzner Cloud!). My point is that working with Clouds is extremely boring, it definitely feels as all the creative aspect of engineering is sucked out, transforming a super cool Lego set into a boring puzzle for a toddler. At the same time, the Cloud hides an incredible amount of complexity that can bite your back at any point and that is the result of a flexibility in the services that in most cases you don't need and simply represents a cost and a liability. As an engineer, I find it borderline demeaning having to simply look at a GUIs (or Terraform code) to make sure that some settings are "correct", I found training material for Cloud certification so demotivating that I would rather go read the manual for my washing machine. In addition to all that, companies using Cloud services hoping to save on personnel are absolutely shortsighted and now workers are paying the price for that by being laid off. And if that was not enough, all of this is done while supporting other terrible companies that are working hard to achieve a monopoly (or well, oligopoly) of computational resources. Seriously, fuck all this, I hate Clouds.

Appendix: Acronyms and Abbreviations

This table has some information on all acronyms and abbreviations used in the text, and can be used as a quick reference for those who might not be familiar with all the terms used.

Term	Expansion	Comment
AKS	Azure Kubernetes Service	Microsoft managed Kubernetes offering
API	Application Programming Interface	A layer through which computer and programs can communicate. In case of Cloud API, this is the interface that can be used to programmatically provision resources, make changes and generally manage the Cloud. In opposition to the console where such management can be done via a human interface (the web application)
AWS	Amazon Web Services	Amazon Cloud
CIS	Center for Internet Security	CIS is an organization which, among other things, writes standard about tools and systems that can be used to ensure a minimum level of security
CNI	Container Network Interface	A Kubernetes network plugin that is responsible for managing the address space within the cluster
DNS	Domain Name System	The protocol that is used to translate mnemonic names (e.g., loudwhisper.me) into IP addresses
EBS	Elastic Block Storage	Amazon storage service that can be used to provide block devices to systems
EKS	Elastic Kubernetes Service	Amazon managed Kubernetes offering
GCP	Google Cloud Platform	Google Cloud
GKE	Google Kubernetes Engine	Google managed Kubernetes offering
IAM	Identity and Access Management	The system or functionality that is used to manage users and their permissions within a certain scope. Cloud IAM is essentially the service that is used to manage access to the various Cloud services.
IOPS	Input/Output Operations per Second	IOPS refer to read/write operations that are performed in a second.
IP	Internet Protocol	Usually this is a short form for IP address, the common addresses that are used to communicate on the internet, such as 192.168.0.1
JIT	Just in Time	Usually it's referred to a procedure that happens when it's needed, not in advance. JIT access means granting access on-the-fly when the access is needed (and then revoking it)
(N)ACL	(Network) Access Control List	An ACL is a system in which some permissions are defined (who is allowed to do something). In the case of network ACLs, these can act as a firewall for the network.
OS	Operating System
S3	Simple Storage Service	Amazon object storage service
VPC	Virtual Private Cloud	A Cloud abstraction of a separate network

If you find an error, want to propose a correction, or you simply have any kind of comment and observation, feel free to reach out via email or via Mastodon.

A tech worker's perspective about the Cloud