Do you remember what you were doing on February 28, 2017?
I know where I was: in a war room, frantically trying to figure out how to deal with an S3 outage that hit the us-east-1 region. At the time, I was at a company whose main service depended on files stored in S3 — in us-east-1, where all their infrastructure was located. They had a copy of the S3 bucket in a different region as a backup, but there was no easy way for their infrastructure to access it. Initially, we hoped the incident will blow over quickly, but the outage lasted almost four hours. That’s an entire month’s error budget if you want 99.9% availability.
The S3 outage was a serious event, bringing down many major sites and causing several hundred millions of dollars in damage globally. Mainstream news outlets reported on it. It was a wake-up call for the industry — in case of a major regional outage in the cloud, you want to be in a position where you can do something about it.
At Contentful, we store structured content and serve it via a fast and highly available delivery API. Our clients include large enterprises and Fortune 500 companies, who trust us enough to build their revenue generating sites on the platform we provide. To earn this trust, we offer contractual SLAs on the availability of our APIs.
As an infrastructure engineer working for Contentful, I’m responsible for operating the platform these API services run on. We wanted to make our APIs resilient to regional outages to increase availability, so our team set out to implement multi-region active-active file delivery on S3.
Why multi-region? Regions are the limits of the blast radiuses for big outages. Availability zones are also outage limits — one size smaller than regions. But cloud providers usually offer multi-AZ support for their services out of the box, so most setups are resilient to outages of one single AZ. The next blast limit is the region limit, so it’s sensible to have infrastructure in multiple regions. But to have infrastructure in multiple regions, and to get it to cooperate and cluster, takes work.
The next question is, why active-active? Why isn’t failover not enough?
Imagine you have a failover solution, and the primary region fails. You probably have some kind of health check that detects this failure, maybe after two bad signals, to make sure that it’s not picking up just noise or internet weather. Then it initiates the failover solution, some signals pass between some systems (for example an SNS message triggering a lambda) and finally a routing change. Even if fully automated, all of these need some time to propagate, and likely take longer than the approximately four minutes that you have if you’re aiming for 99.99% availability. Finally, when the incident is over, you’ll likely have to switch back manually. Failover systems are not blue/green deployment systems — the backends are not symmetrical and the switch is optimized and automated in one direction only.
Even with this, if you have a simple backend like S3, this might be good enough. But with more complex backends, such as services running on Kubernetes or on EC2 nodes in an autoscaling group, there are other difficulties.
Regional outages are not so common, so a failover only occurs very rarely. In the meantime, you have two different instances of your stack that you’re trying to maintain on the same architecture but with a different load and traffic. This creates room for errors, such as something being deployed only to the primary region and then going unnoticed because there’s no traffic on the failover. Or the failover infrastructure simply can’t scale up fast enough when traffic hits. To make sure the failover works, you need to run regular tests and drills — everyone’s favorite pastime.
Active-active has multiple advantages over failover: adjustments are faster, it works more reliably and it’s easier to maintain. You always have a signal on the health of both backends, and you don’t need to scale from 0 to 100 in a few minutes.
The part of the Contentful API that serves the images, files and other assets in the content is based on S3’s file service. To serve files from S3, we use a very standard setup: a CloudFront distribution with an S3 backend to serve as CDN. This is how the asset delivery infrastructure looked roughly a year ago:
A request goes to Route53 to resolve the URL of a CloudFront distribution in front of S3 in us-east-1. The bucket in S3 is replicated to us-west-2.
We didn’t want to overhaul the whole thing — we wanted to extend this to be multi-region active-active. One easy solution would be a round robin using a weighted DNS entry as a backend for CloudFront:
For this, we would place a weighted DNS entry in Route53 that resolves to each of the buckets with 50-50 weight; if Cloudfront uses this entry as a backend, the requests would end up in both buckets with 50-50% chance.
Unfortunately this doesn’t work due to a security limitation in S3 — as a file server, it will only serve requests that have the exact bucket name and region in the host header. And since neither the bucket name nor the region are the same for the original bucket and the replica (bucket names need to be globally unique), the same request cannot be valid for two buckets. With this solution, half the requests would fail. Not good.
But we know exactly what’s wrong — the request header. Once we know where to send the request, can’t we just fix the header? Is there a way we can change it on the fly?
There is a way to inject some logic into CloudFront and modify the request! This mechanism is called Lambda@Edge, a lambda function that is attached to the CloudFront distribution and can be used to modify the requests and responses in any way we want. In this case, it’s very simple: just rewrite the host header in the request.
All we need to do is get the information on where to send the request, which we can uncover by doing a DNS resolution in Route53 (for the DNS entry with the weighted backends). It returns an address (either of the two buckets), we update the host header in the request with that information and then send the request to the right backend. This way, one CloudFront distribution can have both S3 buckets as origin servers. The idea came from this article, where they used the same logic to modify requests to implement AB testing.
When does this lambda run exactly? CloudFront actually allows you to inject lambdas at four points on the path: once, as the request comes in; once, for uncached requests, just before the request is sent back to the origin server; and again at the same points for the response. To get this behavior, our lambda is injected at the origin request:
We tried it, implemented it, and built a small POC. But how can we tell which region was a request served from? We had no such information in the response (for example a custom header). To differentiate between responses from the two regions, we uploaded two different versions of the same image to the original bucket and the replica under the same name, and voilà — as we made repeated requests, we saw the two images alternating (with about 50-50%) in our browser.
In the end it was only 20 lines of code, hardly more than a custom CloudFront configuration. It felt like writing nginx config.
At this point, it was just a technical solution — not a production-ready, reliable and resilient component.
This is edge computing. It comes with all the risks of edge computing. For me, the biggest risk is that there’s nothing between the user and the service — there is no way to route traffic around it. If the service fails for any reason (more likely a programming error than, say, resource starvation), there is no way to divert traffic around the service. Of course, every system has an edge, but in our implementation, the API is directly on it. Also Lambda@Edge (and Lambda in general) was new to our stack. We didn’t know how it behaves or how it fails, and we had no automation around it.
All in all, there was a lot of risk around this solution, which gave us pause. Our goal was to increase availability, which is not quite achieved by introducing a system that increases the probability of causing an outage ourselves. We needed to manage this risk, and to reduce it to an acceptable level.
I’ll explore this fully in part two, where I’ll cover how we took our 20-line solution, managed these risks and turned it into a production-ready feature!
Reliability engineer at Contentful working with highly available systems serving mass customer traffic. Julia is especially interested in the concepts that allow high availability—everything from horizontal scalability to the process around incident response. She sees systems, people, and the tooling that allows interaction between them as one whole that should be managed as such.