Last month, I wrote about how we built a solution for multi-region active-active delivery on s3. Using Lambda@Edge, we were able to modify requests passing through our CloudFront CDN to route them to S3 buckets in two different AWS regions. Our solution was only 20 lines of code, hardly more than a custom CloudFront configuration. The real difficulty lay in all the parts around it.
The problem: not only is this edge computing — which comes with risks of its own — Lambda@Edge was totally new to our stack. Which brings us to the other side of the story: our journey of reducing risks and turning a technical solution into a reliable service.
We wanted to use multi-region active-active delivery to increase availability, but high availability isn’t achieved by introducing a system with a higher probability of causing an outage on our end.
A few things were obviously problematic with the Lambda@Edge-based solution. It wasn’t wrong. It just had some challenges that we needed to solve:
1. Bare bones code deployment. AWS offers a web-based editor for the Lambda code with minimal syntax highlighting. When you save the code with a save button, it’s deployed. The whole thing feels like writing Yahoo mail online around 1996, except it changes the configuration at the edge of your infrastructure and, if you made a mistake, your whole site is down for everyone. We needed to create deployment automation.
3. Lack of production experience. Before this, we didn’t have production experience with Lambda. We didn’t have any experience with error states or action options.
4. Computing on the edge. One factor that increased all these risks: there was no way to route traffic around the service in case of failure; there was nothing between the user and the service. If someone made a small programming error, the request would invariably fail, and since it comes from the user, there was no retry we could implement. Every system has an edge, but in our case, the API was directly on it. A typo would cause a full outage. We needed to understand how to manage this characteristic.
This API is on the critical path. We have contractual SLAs, and there’s no point in raising availability a little while increasing the risk of a complete outage by a lot. But how could we do better? How could we map out what needs to be done to manage those risks, and how would we know when it’s good enough?
Since we had no better metric, we decided to make it our goal to raise confidence. To engineers this might sound… quite unscientific, right? To base our decisions on gut feelings, and feelings of confidence? This isn’t how engineering should work!
Here’s how I see it: we felt that something was off, and this is a valuable signal. It’s a familiar feeling: something seems amiss, and then you realize that the weight of the phone in your jeans pocket was what was missing. Some of the things you do become muscle memory.
I feel like I'm forgetting something.
As the infrastructure engineers, we have years of experience operating these kinds of systems. We develop gut feelings about them. It may not be much, but it’s better than not having a signal at all. So we set out to turn our gut feelings into actionable tasks, and improve things little by little until we felt confident.
It proved to be a good starting point because it presented the following question: What would make us feel more confident? We reformulated it as a question that’s easier to answer: what gives us trust in our current software platform?
Logging, monitoring, CI, CD, and alerting
Tooling for development and deployment
Version control and tests for code and config
Understanding performance, dependencies, risk matrix
This step made it obvious that there was a bunch of tooling we could build; the question was now, what do we choose to build and how do we make sure we’re not overlooking something important?
I’m the kind of person who deals with the chaos of the universe by writing elaborate lists, so I suggested that we visualize success by writing down how things would work if they worked well — if they felt the same way our current production systems feel while we operate them. We created a production readiness criteria list.
Lists like these exist for services in a microservices-based architecture. There are books, talks and articles on this. We haven’t found a resource for infrastructure elements, probably because they are quite different in this regard from a service.
First, we wrote our requirements on an abstract level. We imagined that someone gave us a new infrastructure component in a black box that will be added to our existing infrastructure. What does it need to make us feel confident and familiar in operating it, or even to be on call for emergencies? For example (in no particular order):
How is it monitored?
How is it deployed?
How should it scale?
How can I change its configuration?
How can we tweak it during an incident?
What does it do if one of its dependencies fails?
What happens if the component itself fails?
What kind of metrics do we want to have?
We made a list of all of these, and then grouped them by topic. The list ended up being some 70 items long, covering everything from code ownership to security criteria. Then we took this abstract list and translated the requirements to the current solution. We didn’t just evaluate Lambda@Edge in general but this specific case. This was important because risk is extremely dependent on context.
After a few more iterations, we had a specific and detailed list. This basically gave us a gap report between the current and the ideal state. But building wasn’t the next step.
We remembered that our production readiness criteria list was an ideal, a menu to pick and choose from. We needed to pick enough items to bring our confidence up to an acceptable level. So for each item in the list, we carefully considered risk, cost and benefit. We marked what we wanted to build, what was already done, and what we would not build.
For example, having logs was one abstract requirement. Most of our app logs are in Splunk. For this specific case, do we want the logs in Splunk, or is having them in CloudWatch enough for our needs?
We documented all these decisions, especially the ones we decided not to pursue. It was an exercise in risk management informed by our gut feelings. I think it’s actually fine to take informed risks and make calculated bets, but you need to be aware and explicit about them so that it’s easy to revisit the choices that didn’t work out, or if circumstances change.
As we were doing all of this, we actually built enough tooling for Lambda@Edge that we turned it into a reliable software platform. One where you can run things with more complicated logic than just adjusting request headers.
We found an existing TypeScript library with the needed AWS CloudFront components. The specific events we wanted to handle were missing, so we ended up contributing to the library.
While working on our specialised production readiness criteria list, we realized that we were missing something that has a big impact on our confidence but isn’t a characteristic of the system: operational experience.
Operational experience is a collection of experiences:
Having the right alerts, dashboards and runbook
Having a sense of the possible error states,
Knowing where to look when there’s a problem
Knowing what options you have for action.
It’s about being familiar enough with the area to be creative. This experience is not something you get by waiting around until all this accumulates. You can take an active approach and run some simple drills on operations that you think you’ll need. Realizing we lacked enough operational experience, we started to actively work towards getting more before we moved critical components to the new platform. By the time we finished, we felt confident to safely release our code to production.
Today we heavily rely on Lambda@Edge as a software platform. We have mission-critical, customer-facing business logic running on it that’s being actively developed.
So what’s it like, life with Lambda@Edge? First of all, it scales like a dream. The Lambdas can take a 20x traffic spike without a hiccup. It’s also cost-effective: the Lambda functions finish in just a few milliseconds. That’s quite far from AWS’s minimum billable unit, which is 50 milliseconds. Also, since this code itself runs in CloudFront, which is a global service, it’s immune to regional outages.
There are some downsides of course. We learned that it's harder to extensively test code before running it on Lambda@Edge, because you can’t emulate the full environment. You can do some unit tests in a CI, but you need dedicated testing environments for integration tests.
If you want to treat Lambda@Edge as a software platform, you need to deploy it as you deploy software. We still deploy Lambda as infrastructure configuration because, when we started out, the first 20 lines of code felt more like configuration. But we might change this soon. Infrastructure and software changes at different paces, and they need different deployment pipelines with different characteristics.
Finally, when we needed to look into the logs to investigate a problem, we realized that they weren’t easy to find. Because CloudFront is a global service, CloudWatch collects logs in each region based on the origin of the request. This means that the logs for a given request will be in the AWS region closest to the sender and are a bit hard to find at times.
Starting from our routing problem, we turned Lambda@Edge into a reliable software platform. Creating the traffic manipulation logic was the smaller part of the task, but getting to a state where CloudFront with Lambda@Edge was a reliable platform in our stack required a systematic approach in mapping out what needed to be built to ensure operability. We used our engineering gut feelings to spot the need, and then production readiness criteria lists to build a gap analysis and manage the special risks that edge computing poses.
Using TypeScript and deployment automation removed a lot of the possibility for human error from the system, while purposefully building operational experience ensured we can leverage human knowledge and creativity when needed.
Reliability engineer at Contentful working with highly available systems serving mass customer traffic. Julia is especially interested in the concepts that allow high availability—everything from horizontal scalability to the process around incident response. She sees systems, people, and the tooling that allows interaction between them as one whole that should be managed as such.