Configuring cloud infrastructure can be painful. Especially, if you’re doing it by hand. Especially, if you’re doing it frequently and at scale.
It can take a long time to set up the required resources in the cloud, even for a medium-sized project. And you need to be really careful, otherwise, you risk setting everything up just to realize, that you’ve defined an incorrect CIDR block for a virtual network, or made a typo in a name of a resource. And now you have to start all over again because these settings cannot be changed. Ouch.
There should be a better way, than doing it manually, and this is where automation and code come into play. We can create scripts to set up our cloud infrastructure for us. This approach is called Infrastructure as Code, or IaC.
Benefits of IaC
Programming our infrastructure provisioning sounds complex. Why bother? For one, deploying your infrastructure automatically is much faster. This means that you can quickly make changes to your configuration or even recreate it completely as you need. And you can use it to quickly provision multiple environments, e.g. development, test, and production.
But this doesn’t end here. When your infrastructure is code, we can treat it as code. This means: check it into version control, perform code reviews, static code analysis, and set up up CI/CD pipelines. This opens up new capabilities in our infrastructure setup.
- Rollbacks. In case of a regression, we can quickly revert changes to the infrastructure back to the last working state.
- Repeatability. We can create new environments with the same configuration, without the risk of accidental misconfiguration.
- Preventing environment drift. Often, when you maintain multiple environments by hand, for example, development and production, or installations in different regions, you end up introducing subtle differences into one or the other, and they start to drift apart. This can lead to problems with development and testing since various environments can behave differently. When everything is provisioned from the same code base, this risk is mitigated.
- Code reviews. You can merge all changes to the infrastructure using pull requests, run them your standard code review process, and onboard stakeholders from outside your team, who might have a say in the infrastructure setup. It’s a great way to spot defects early and share knowledge.
- Static code analysis. In addition to human code reviews, you can run your code through various code analysis tools to scan for best practices, bugs, and security issues.
- Audit. When checking in changes into version control, you gain full traceability for your infrastructure, which is crucial for some environments, such as finance.
- On-demand environments. Since bringing up new infrastructure is faster with IaC, you can now fully leverage the consumption-based payment model of the clouds, by provisioning the infrastructure when you need it and destroying it afterward.
- Security. This approach lets you tighten the security around your infrastructure by limiting the number of users with write permissions and making sure that all changes go through a well defined, automated change management pipeline.
Approaches to Automation
There are multiple ways you can do IaC and different tools available, but you can roughly classify them using two criteria.
- A tool can be either specific to a particular cloud provider, or cloud-agnostic. Each major cloud provider comes with its own set of tools to deploy infrastructure, however, some third-party products offer a single solution that spans multiple cloud providers. While cloud-agnostic tools give you the obvious benefit of using a single tool for multiple clouds, they may fall behind feature-wise compared to the native tools, since they are implemented by third parties.
- A tool can have either a declarative or an imperative API. Just as with other programming languages, we have tools that implement the imperative paradigm, which is convenient for scripting, and declarative, which allow defining the desired state and let the tool do the rest. Declarative solutions are often easier to use since you don’t have to worry about things like provisioning order, dependencies, error handlings, and other usual scripting concerns. However, it can be tricky to add special logic to declarative solutions, for example creating a resource based on a condition, or in a loop.
|Declarative||Azure ARM Templates, AWS CloudFormation, Google Cloud Deployment Manager||Terraform, Pulumi|
|Imperative||Azure: CLI, PowerShell, SDK; AWS: CLI, CDK; Google Cloud: Console, Client Libraries||?|
For most cases I’ve seen so far, declarative tools work best. When using the imperative approach, you always need to consider the order in which your resources must be provisioned, handle potential race conditions and eventual consistency, handle errors, when your script fails halfway, etc. And once you get the initial setup working, you need to think about how to make changes to already existing infrastructure, since most of these tools have separate APIs for creating and updating resources. With declarative tools, it’s simpler. You just list the resources you need and it’s up for the tool to make it happen. And when you want to make a change, you just update your definition, and the tool decides, how to reconcile the actual and desired state.
The choice between a native or a cloud-agnostic tool can be more difficult. If you have a multi-cloud or poly cloud strategy, it’s a no-brainer, go for a tool that will work everywhere. But if you focus on a single cloud provider, teams might be hesitant to add another third-party tool to their arsenal. In such a case, you’ll need to do your own feature comparison and see if it’s worth it to you. I’ve, personally, came to the conclusion that Terraform’s plan feature and great developer experience trumps having to learn a new tool. But from time to time you encounter a situation when a particular feature is not implemented, and you need to look for a workaround.
If you prefer to use your favorite programming language, then take a look a Pulumi. It’s a newer kid on the block but allows you to code in five different languages. And although Pulumi uses general-purposes languages for infrastructure definition, it still creates a declarative desired state model when you run it. Similar to how an ORM creates an SQL statement.
Look around your environment, see what is most valuable to you, and choose the tool, that contributes to this value.
Doing IaC is quite the paradigm shift and comes with its own set of challenges. You’ll need to learn a new tool and adopt a new change management process for your infrastructure. Once you start your journey, there will be several tricky questions that you’ll need to answer at some point.
- How you initially adopt the process heavily depends on whether you are in a greenfield or a brownfield environment. In a greenfield environment, you don’t need to worry about supporting already deployed infrastructure, you can just start everything from scratch. If, however, you already have a working productive application, you need to think about how to put it under code management and keep it running all the way. If you have the luxury of downtime, you can just recreate the complete environment a new using IaC, but in most cases, you’ll need to find an incremental approach.
- You’ll need to come up with a workflow to apply changes. Ideally, this should be done using automated CI/CD pipelines, not from a developer’s workstation, to exclude human and local workstation errors. The pipeline should have roughly the same steps as a CI/CD pipeline for a regular application: running static code analysis, deploying to the development environment, and promotion to other environments with testing and monitoring along the way.
- If you’re working in a larger company, chances are, you are not the only person or team responsible for your cloud setup. If you need to coordinate your changes with someone responsible for the overall enterprise cloud architecture, networking, and security, you’ll need to get them on board with your processes. This can involve getting them to review pull requests and approve deployments.
- As with a regular application, you’ll need to properly structure your IaC into modules and maintain dependencies between them. Conway’s law can be of good help here, since structuring your code similarly to your organization can make communication much easier.
- With great power comes great responsibility. Making a mistake in your IaC can lead to terrible consequences. Who knew that renaming a database can force it to be recreated?! Or what if your script fails halfway and leaves your production infrastructure in a state of limbo? Thinking about failure scenarios and mitigation tactics is important. And this is why safety features, like Terraform’s plan, are so valuable. Make sure you test your deployments thoroughly on the development environment and have a good understanding of the tool you use, so you can get your self out of a jam quickly.
Taking the First Steps
To emphasize once again, IaC is a radically different approach to managing infrastructure, so start slow and safe. Look at the tools available on the market and identify the features that you need most, whether this is safety, complete resource coverage, your favorite language support, or something else. With a tool in mind, find a smaller project you can experiment on and give it a go there. Don’t push it onto your business-critical application just yet. Once you start to feel comfortable with the tool, shift your focus on the deployment process and start working on an automated pipeline and quality gates. And don’t forget to share your progress with the rest of the organization to get them on board as well.