Terraforming AWS: Part I

Terraforming AWS: Part I

Note: Be sure to clone the companion project to follow along!

When you hear the phrase "Cloud Computing", the first thing that likely comes to mind is Amazon Web Services. As one of the oldest and largest cloud providers, it would be difficult not to encounter them on your path to DevOps Nirvana. Aside from record reliability and one of the largest infrastructure footprints in the world, they provide a diverse service catalog serving as an erector set any engineer can use to build ready-made infrastructures.

Similarly, HashiCorp's Terraform has become a staple of the DevOps toolbox. With a declarative DSL-based frontend reminiscent of configuration management tooling such as Ansible or Salt and diverse provider (plugin) ecosystem supporting every major cloud provider, Terraform has went from relative newcomer to ubiquitous automation standard within a few years.

Naturally, pairing AWS and Terraform is a winning combination. This multi-part series will walk you through common tasks faced when automating AWS with Terraform. We'll look at the major primitives (network, compute, storage), and provide tips and tricks to make your automation cleaner and reusable.

If you've used Terraform before, jump right in for the tips and tricks. If you are just getting started, be sure to check out the official introduction, this online tutorial (with Linode examples, but concepts that apply to any provider), a commonly recommended book and perhaps even an online course.

AWS Network Concepts

As a starting point, we're going to bootstrap the network required to host a simple website. If you're used to physical networking, you might immediately jump to thinking about routers, switches and subnets... which is good, because we'll need at least one of those. With an IaaS, many of these infrastructure components have been abstracted away behind an API. However, there are also new concepts to grasp before we can take full advantage of cloud capabilities.

For AWS – other providers adopted similar, often identically named, concepts – the key things we need to know about when it comes to networking are regions, availability zones, virtual private clouds (VPCs), subnets, route tables and internet gateways (IGWs). Let's quickly define each of these, and then jump into code (or Terraform's HCL).

A region, as the name implies, can be thought of as a geographic region such as "US East" or "EU West". Within a given region, Amazon operates multiple data centers (usually three, a conveniently odd number for any consensus-based protocol). Amazon accepts reality (systems fail all the time, the Internet can break, maintenance has to be done), so provides high availability by encouraging critical applications to be spread across availability zones within a region. For example, if you are hosting an important website within US East, you would minimally deploy instances (and replicate data across) us-east-1a, us-east-1b and us-east-1c. This ensures that failure within any single physical site does not cause downtime.

For anything truly critical, you would also want to select one or more sites in other regions such as us-east-2 or us-west-1. For a full list of available regions and availability zones, consult the AWS documentation.

AWS Regions and AZs facilitate HA

Within a given region, you can have one or more VPCs. A VPC is a virtual network construct similar to combining VLANs and an overlay protocol such as VXLAN which allows it to span physical network boundaries. Think CIDR range which can span all of the availability zones within a region. Within availability zones (data centers), you have subnets (smaller CIDR ranges) which are allocated from the VPC range. So you could have a VPC allocated 10.0.0.0/16, and subnets consisting of 10.0.0.0/24, 10.0.1.0/24, etc.

AWS Network Concepts

When allocating address space, overlap is not allowed within the VPC and subnets are contained within an availability zone. If needed, you can assign additional CIDR subnets (secondary ranges) to the VPC to support growth, but you can not change the size of an existing allocation. Read this detailed guide for more details on AWS networking.

The last two concepts we'll work with here have to do with public and private subnets. In AWS, a "public" subnet is simply defined as one which can route to the Internet. We allow that by deploying an Internet gateway, route table, and route table association which is just a way of tying a route table and subnet together.

Note that you do not need routes for subnets within a VPC to communicate. Those "local" routes are automatically added by AWS. However, you will need to consider ACLs. By default AWS denies all inbound (ingress) and allows all outbound (egress) communication. However, Terraform automatically removes the default egress rule to enhance security. This means you will need to specifically allow ingress and egress traffic for Terraform controlled resources.

In our project we will see each of these in more detail to provision a public and private subnet. In a typical multi-tier application, you would have a public-facing load balancer as an entry point for Internet traffic. In turn, it would distribute load across multiple backends housed on private subnets (caches, web servers) which consume data sources similarly insulated from the public. This is what we'll work toward:

Simple N-Tier Deployment

If you are completely new to Terraform, the last thing I'll say before jumping in is that many of the code samples in this series are condensed for easier reading. A Terraform project or module typically consists of multiple files (main.tf housing the code, variables.tf providing inputs or definitions used by the code and outputs.tf providing CLI output for reference or consumption as inputs by other code). This project is no different, and we'll come to see more of the structure as we go along. Feel free to browse the project repository for more context.

Plumbing the Network

Note: All examples are in v0.12 format.

Good news... With all those definitions out of the way, we're ready to start building the network portion of this simple web stack! Worth nothing, any new AWS project will contain a default VPC, which in turn contains a few starter subnets. You start with more than one subnet because AWS best practices tell us to distribute services across availability zones for HA, and subnets can not span availability zones. Most tutorials will simply leverage the default VPC and subnets to build something useful in fewer steps. You can always do that, but I wanted to go a bit deeper here so you can understand all of the moving parts. In the real world, you will almost certainly want to provision VPCs and subnets for each of your projects.

We're going to pick some random RFC1918 ranges. When deploying a new project in your company, it makes sense to think about existing resources and pick ranges which don't overlap or otherwise collide and cause problems. Here we'll simply use:

  • VPC: 10.1.0.0/16
  • Public subnet: 10.1.1.0/24
  • Private subnet: 10.1.2.0/24

First, you'll need to pull the appropriate provider into main.tf:

provider "aws" {
  region = var.region
}
main.tf

var.region tells us the value comes from a variable called region defined in variables.tf, allowing us to easily deploy in different regions. We can accept the default, or provide overrides via environment or command line. Let's add env_name as well, so we can use it as a convenient tag or prefix for related resources:

variable "env_name" {
  description = "Short descriptive name to help identify resources we create"
  type        = "string"
}

variable "region" {
  type    = "string"
  default = "us-east-2"
}
variables.tf

Now we can define our VPC:

resource "aws_vpc" "vpc" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true

  tags = {
    "Name" = "${var.env_name}-vpc"
  }
}
main.tf

enable_dns_hostnames tells AWS to assign DNS names to resources we provision. This will come in handy for reaching the load balanced VIP later. In production, you would typically integrate with something like a Route53 Hosted Zone. Without this, you can still reach things you provision via IP address.

With that, we have a container spanning all availability zones within our specified region that is ready for subnets. Let's see how one of those might look:

resource "aws_subnet" "public" {
  vpc_id                  = aws_vpc.vpc.id
  cidr_block              = "10.1.1.0/24"
  availability_zone       = var.availability_zone
  map_public_ip_on_launch = true

  tags = {
    "Name" = "${var.env_name}-public-subnet"
  }

}
main.tf

This shows some useful concepts. Aside from using aws_subnet to define subnets, we give it a name and associate it with our VPC. Since this is going to be our public subnet, we allow AWS to assign resources on this network routable IP addresses.

While it might look configurable, we've also got some problems. We've allocated our entire public CIDR range to a single subnet, but we know subnets can only exist within one AZ. We also know we want to utilize all AZs within a region for HA. We could duplicate this block to create public (and private!) subnets within each AZ, but that would be hard to maintain. How can we stick to best practices while DRYing it up?

Avoiding Duplication

Lucky for us, Terraform provides a number of functions we can use to simplify common tasks. Let's extend main.tf with a data resource and use several functions to achieve our goal:

data "aws_availability_zones" "all" {}

resource "aws_subnet" "public_subnets" {
  count                   = length(data.aws_availability_zones.all.names)
  vpc_id                  = aws_vpc.vpc.id
  cidr_block              = cidrsubnet(var.public_cidr, 2, count.index)
  availability_zone       = element(data.aws_availability_zones.all.names, count.index)
  map_public_ip_on_launch = true

  tags = {
    "Name" = "${var.env_name}-public-subnet${count.index}"
  }
}

resource "aws_subnet" "private_subnets" {
  count             = length(data.aws_availability_zones.all.names)
  vpc_id            = aws_vpc.vpc.id
  cidr_block        = cidrsubnet(var.private_cidr, 2, count.index)
  availability_zone = element(data.aws_availability_zones.all.names, count.index)

  tags = {
    "Name" = "${var.env_name}-private-subnet${count.index}"
  }
}
main.tf

That's a lot to take in, let's walk through it...

First, we use a Data Source to query the availability zones within our region. Resources like aws_subnet allow us to create IaaS entities, while data sources allow us to invoke read only queries against the provider's API. This lets us obtain a list of AZs within the desired region, so we can distribute subnets and other resources across them.

We've added the private subnet provisioning here, the main difference being that we do not include map_public_ip_on_launch. We'll talk about one other important difference below, but first the functions!

count is an often used idiom in Terraform. Specifying a count causes multiple actions to be conducted, similar to a loop. In this case, we use the number of returned availability zones (or the length of the containing list) to provision one subnet per AZ. Since a subnet can only exist in a single AZ, and we have a single /24 to work with, we use cidrsubnet to carve out smaller blocks of addresses which can be mapped to individual AZs using element. While not as expressive as a full programming language, reading through the documented functions will certainly give you lots of ideas. The community is also full of creative ways you can reduce the amount of code you must maintain to accomplish common tasks.

We could further simplify this (reducing even more duplication) by encapsulating all our code and configuration associated with provisioning subnets within a module, then simply re-using the module as needed. That is a best practice, and we'll learn how to work with modules later!

Final Steps

So we've got six subnets defined with only two blocks of code, and avoided any messiness like hard coding AZs which would make our automation brittle. We've also seen one difference between public and private subnets, where we avoided mapping public addresses to private resources.

The final steps to make our public subnets useful is defining an Internet gateway, and associating a route table allowing it to be used. Compared to our subnet definitions, that involves less magic:

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.vpc.id

  tags = {
    "Name" = "${var.env_name}-igw"
  }
}

resource "aws_route_table" "public_route" {
  vpc_id = aws_vpc.vpc.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.igw.id
  }

  tags = {
    "Name" = "${var.env_name}-public-route"
  }
}

resource "aws_route_table_association" "public_rta" {
  count          = length(data.aws_availability_zones.all.names)
  subnet_id      = element(aws_subnet.public_subnets[*].id, count.index)
  route_table_id = aws_route_table.public_route.id
}
main.tf

Assigning an IGW to our VPC gives us a router we can use to reach the Internet. Using similar techniques as above, we iterate over our public subnets and associate a route table allowing public resources to leverage our IGW. Pretty neat, huh?

Next Steps

We've covered a lot of ground, ramping up on AWS networking basics and fully plumbing a custom network infrastructure. We avoided duplication and created more flexible automation by leveraging native functions and a bit of creativity. We've laid the foundation needed for our simple web stack in a little over 50 lines of code, and adhered to best practices ensuring a secure and highly available deployment.

In the next part of this series we'll move onto managing compute resources. Rather than simply managing EC2 instances directly, we'll use higher level concepts including target groups and launch configurations. We'll also take advantage of the latest Application Load Balancing functionality, and talk about why it's the successor to now legacy Elastic Load Balancing.

Thanks for reading... see you next time!

Show Comments