Note: Be sure to clone the companion project to follow along!
In a past series we used Terraform to build a simple LAMP stack (actually a N-tier, Linux-based, MySQL-backed web stack atop AWS... there is technically no Apache or Perl/PHP/Python if we're being pedantic). We fixed some bugs along the way, and even extended the original design to provide more real-world networking for our private subnets.
In this article, I want to again (ab)use our simple N-tier project to explore DNS and TLS within the AWS ecosystem. Of course this is a DevOps-heavy blog, so we'll continue the automation obsession and implement what we learn using Terraform. Combined with some refactoring that's happened along the way, this will add polish needed to turn our suspiciously lab-smelling proof of concept into automation potentially useful in the real world!
Since we've already invested a lot of time plumbing the network, getting auto scaling and user data just right, figuring out how to spin up a database and tying all those pieces together... Here we can focus specifically on DNS and TLS. Each of those are often shrouded in mystery (perhaps because getting either one wrong can do very bad things to availability and security!) and could consume entire books in themselves. To avoid writing a book as a blog post, I'm going to focus on AWS-specific aspects. Where helpful I'll touch on generic concepts... but this won't be a DNS or TLS tutorial since good resources for those already exist (see this, this, this and this).
In a lot of my posts I present patterns that can be extended as needed to meet your specific needs. Patterns are common across many industries, and often seen as a pairing of problem and solution. It's important to remember that patterns are more like tuples which inlcude a third element: context! Absorbing patterns without awareness of their context can lead to confusion in the best case (I don't even want to think about the worst case 😫).
Before exploring the pieces of the AWS LEGO set we need to manage DNS, I want to highlight the context... We'll need DNS in at least two places: routing traffic to our web cluster with a friendly name, and supporting TLS certificate validation. I'll assume you have an existing (registered) domain you control , and that it is not serving mission-critical traffic (so experimenting with it is relatively safe). In reality you might need to register a new domain, cut-over an existing domain serving production traffic, etc. Each of these will have unique considerations. Hopefully you can still extract useful concepts and automation snippets to help you on your journey, adjusting as needed to fit your context.
We've already touched on why we need DNS and our assumptions. Since we will be migrating an existing, unused domain we follow the steps in this guide (at a high level). If you are moving a production domain read this guide instead.
DNS on AWS Route53 leverages Hosted Zones. A Hosted Zone can be public or private. Public zones serve the Internet, while private zones are associated with VPCs (your private AWS networks). A Hosted Zone is simply a container holding DNS resources. While it looks similar, it's not technically the same as a DNS domain. Hosted Zones are an AWS vs. DNS concept.
For example, you can create a Hosted Zone without owning the associated domain. Nothing will happen until you delegate the DNS domain to the Delegation Set (a set of four AWS name servers associated with the Hosted Zone that will handle traffic for the DNS domain). You can also do interesting things with Hosted Zones such as using Alias Records to map the zone apex directly to AWS resources. Contrasted with pure DNS, you can not alias (CNAME) the apex. There is also a financial difference: Alias Records are free, CNAME queries are not!
This means getting a public domain working with AWS will require a number of steps:
- Create public Route53 Hosted Zone
- Obtain Delegation Set from AWS
- Update our registrar's DNS servers using the Delegation Set
- Create any records we need (using Alias Records where applicable)
TLS is the modern, fast, secure SSL. If you are hosting content on the Internet (or anywhere), you probably have it already and if not you need it. With all the great features comes added complexity. Serving TLS-protected content requires managing certificates.
You have many options... You can purchase certificates from third-party Certificate Authorities (e.g. DigiCert or Thawte). You may integrate your service directly with modern alternatives such as Let's Encrypt. Depending on your use case, you might chose to cut costs and self-sign certificates using a Private Certificate Authority. There are times when an internal CA makes sense, and AWS even has PCA support so you don't have to carry the burden alone. If you are managing your own PCA, be sure to account for the hidden costs from additional effort securing, signing, storing, revoking, etc. all of your certs.
Terraform can help in these cases, letting you inject certificate files directly into resources and configuration. Since you will have both private and public keys (certificates), you'll also need to think about secret vaulting (using something like HashiCorp's Vault) to keep private keys secure (obvously never comitting them to version control).
Considering all the options is one reason TLS can seem overwhelming. Luckily, AWS can help simplify our life and save us money at the same time. We'll use AWS' Certificate Manager (ACM) to manager our TLS certificates. This provides a self-service, performant, highly-secure (certainly better than we could roll ourselves) certificate service that is nicely integrated with key parts of the AWS ecosystem (read: our ALB). Unlike third-party CAs which exist to issue certs, AWS gives you certs for free since they make money on other resources that get used with them (ALBs, EC2 instances, etc).
To validate certificates (you wouldn't want just anyone to be able to issue certs for your domain), ACM can use a DNS or email-based workflow. DNS is highly preferred, since it can be fully automated and managed by Terraform.
Now that we know our goal and what pieces we need to accomplish it, we can start translating requirements into code. First, here's a simple picture of what we're tying together:
Looking at the boxes and solid lines, we point the DNS server (NS) entries for our domain to the Route53 Hosted Zone Delegation Set. This allows us to create resource records, which we can alias to AWS resources. Since we're going to start leveraging TLS, our ALB now listens and serves traffic on 443/tcp. Internally, we still talk to our EC2 instances using 80/tcp (while we could easily encrypt this traffic as well, offloading TLS overhead to load balancers is a common approach and simplifies our deployment). ACM integrates with our Hosted Zone, publishing special DNS records which allow any certs issued for our domain to be validated.
From a user's perspective, a request to our service will still require a DNS request to the registrar (hopefully cached) where our delegation will redirect to Route53's name servers for resolution. In our case we'll use a www record aliased to our ALB, which terminates the connection securely using the TLS certificate obtained from ACM.
Let's build it! Creating a Hosted Zone is easy enough, but leveraging a few related resources should improve our quality of life:
Aside from the
aws_route53_zone itself (if you had an existing Hosted Zone, you could use the Data Source to get the Zone ID needed below), we also create a
aws_route53_delegation_set. When you create a Hosted Zone, AWS will assign a random set of geographically diverse name servers (DNS best practice). Since you must update your registrar with these values for Internet users to make it to your site (or ACM to validate certs by conducting DNS lookups), you don't want future updates to potentially change these values and break the delegation. That's why we use the special
lifecycle block to ensure Terraform does not destroy the Delegation Set.
Notice I said "should" improve our quality of life above... While this looks like an ideal representation of our desired state, it has at least one major problem. As you can read in this issue (please 👍 it!),
prevent_destroy does not always work satisfactorily. Instead of simply not deleting the specified resource, future updates throw errors (preventing useful things such as cleaning up your project without hand editing and losing the desired benefit).
One workaround is using resource targeting, but that is a toilsome hack at best. A common solution is better segregation of managed resources. That is generally a good thing, but in this case results in another repo, Terraform module, state file, etc. that manages a single resource (might as well just use create-reusable-delegation-set!). Just be aware of this pitfall, and take heart knowing HashiCorp is working on a real solution. In the meantime, this might be a reason to think carefully about how much of your infrastructure you place under Terraform's control.
We also add an
output so we can easily grab the Delegation Set needed to update our registrar (simply run
terraform output name_servers after executing
terraform apply). Since it varies greatly and often involves clicking through a UI, I won't show that step here – simply take the provided list of name servers and update your domain registrar's DNS servers accordingly.
As mentioned above, we're going to use a Route53 Alias Record to point to our ALB. This is very similar to creating any other Route53 resource with Terraform, but since Alias Records always use a 60-second TTL, we must omit
ttl. We also replace
records with an
alias block (you must have one or the other) referencing the desired resource, and take care of the zone apex since no one has time to type "www" in 2020:
evaluate_target_health is another advantage of Alias Records, since it will only route traffic to targeted resources if they are actually in a healthy state. Failing fast and shedding load before it consumes additional resources is a best practice (one of many useful patterns described in Michael Nygard's Release It!).
Now for the trickiest part... Requesting the certificate from ACM. The actual request is simple, but since we also want to leverage DNS-based validation we need to chain several Terraform resources together. Thankfully, the Terraform docs are great at covering this. In the docs they assume you only have a single hostname to contend with, so here's a more real-world example:
domain_name is your Common Name (CN) for the cert, and you can include any number of SANs via the
subject_alternative_names list. You can probably think of ways to improve on this, but avoiding hard-coded indexes into
domain_validation_options ensures we can easily extend the
alt_names list in configuration to include any number of SANs. Wildcards are supported as well, though typically frowned upon by your local InfoSec representative. 🤨 Note the use of
local.fqdns to obtain
count. We can't simply use
domain_validation_options for that since it's only known at apply time.
Almost there... We've got a zone, DNS records for our users, and a TLS certificate that has been validated. So far, this does nothing! We need to adjust our ALB listener to use the shiny new certificate. As part of that, we will update our
port (since the TLS standard is 443/tcp vs 80/tcp which we had before) and most importantly
protocol (we were leveraging the default HTTP but now want HTTPS). ALBs only support HTTP or HTTPS. If you need TCP or other protocols, time to refactor using a Network Load Balancer (this is also required if you want to associate static or elastic IP addresses).
Let's quickly update our
The one other thing to pay attention to here is
ssl_policy, which is required when using HTTPS. These are policies provided by AWS. While
2016-08 sounds a bit dated, it is the latest policy available while writing this. For a full list and details on what each policy contains, refer to the official documentation.
As a final note, there is a bit of chicken and egg when building from scratch... You need to run apply, watch output until the Delegation Set is created, run to the Route53 console, click into your hosted zone, grab the name servers, update your registrar while the apply is running (likely while waiting 10-15 minutes for RDS to provision), then wait on the delegation (usually quick, but could take hours) so DNS resolution works and ACM can actually perform the requisite DNS queries to validate certs.
If you are too slow or that fails, the site will never come up since the dependency graph looks something like ALB => listener => validated cert => DNS (no DNS, no cert, no listener!). For our simple project we're trying to keep all the pieces organized in a single repo to make it easier to reason about. In the real world you would typically solve this by breaking out modules (and state) to manage your networks, DNS, etc. separate from the application stack.
By now we are starting to see a theme... For common use cases (and many uncommon!), AWS has an answer. By leveraging AWS services we can greatly reduce maintenance and overhead while increasing security and making it easier to automate provisioning. Despite a potential pitfall, Terraform also shines again... In ~50 lines we've added substantial functionality, easily managing DNS and TLS (both often considered opaque and persnickety) in a repeatable, auditable way.
With that, I'd like to put our over-used web project aside for awhile and spend some time in coming articles exploring AWS Lambda. Much like containers (and cloud, VMs, NAS, VLANs... many things before!), Lambda has gone from "suspiciously new" to "widely adopted". It provides a lightweight way to build cloud native applications, allows easy extension of many AWS services, supports a number of common languages (we'll focus on Node.js since it has such a great community), and despite all the advantages it actually saves you a lot of cost as well. Too good to be true? Stay tuned as we find out!