When a potential customer evaluates Contentful, their DevOps teams are always interested in the infrastructure scaling and failing capabilities of our platform – this is simply a part of the technical evaluation process. "What will happen to latencies in the case of a Superbowl type of traffic explosion?" or "how do you ensure that I will be up on Black Friday?" are very common questions.
But a relatively unusual one of late was this: "How do you scale down your infrastructure without messing with your users' requests?" Since we’re doing something unconventional and interesting with Lambda and HAProxy to tackle this problem, we're happy to share our techniques with the DevOps community. (If you're not really into the whole DevOps thing, you might not like this post all that much.)
The tech stack
At Contentful, we run the systems that provide our API services on Amazon Web Services. The API traffic coming in from the internet is handled by AWS' Elastic Load Balancer (ELB), which is an easy-to-use, scalable traffic management solution — but it provides very few knobs to twist. So, once the requests are within our platform, we use the open-source HAProxy software to handle traffic routing and load-balancing.
The problem of lost requests
Our HAProxy is configured by Chef, so it's aware of the different application components internal to our platform, and also knows which servers they are currently running on. Last year, after streamlining our instance bootstrap process with cloud-init and custom Amazon Machine Images (AMIs) built with Packer, we switched to using Auto Scaling groups in Amazon EC2, which means application servers can appear and disappear depending on the load across the entire server group. With ELB this isn't a problem since ELB and Auto Scaling are designed to work together; Auto Scaling honors the ELB's connection-draining settings and allows for a graceful exit of the server being terminated. But with HAProxy, we needed a way to remove an EC2 instance from HAProxy before it was terminated, so that we would not lose customer requests
Rough sketch of the setup
Hacking the instance lifecycle
This effect could be achieved by interacting locally with every HAProxy process to tell it to disable a backend, requiring extra commands to be run on every instance. But since currently our Chef setup reads the tags of all our EC2 instances to build its map of what should be running where, we realized that we could do this by setting a tag, e.g.
remove_from_lb, and adding a little extra Chef code to drop instances with this tag when we periodically build the HAProxy configuration file. By configuring an Auto Scaling Lifecycle Hook on the Auto Scaling Group, we could make it wait when entering the
EC2_INSTANCE_TERMINATING phase long enough for Chef first to remove it and reload HAProxy, thus ensuring no additional requests were sent to this server.
To carry out the lifecycle state transition, Auto Scaling Lifecycle Hooks must be configured with a notification target to trigger the action required. This can be an SNS topic or an SQS queue. Therefore, we needed a service that could receive such a notification — a JSON message containing the instance ID and details about the lifecycle state change — and add the tag to that instance. This would provide us with a central location to run this task, but we would need to set up a new web service to listen for notifications and do the job for us — and frankly, we have just enough services to take care of already.
Lambda, the lifesaver
Lambda ensures that no request is left behind
That is where AWS Lambda came in. Lambda is a zero-administration compute platform: it offers a great way to execute stateless code in response to events such as changes in S3 buckets or messages to SNS topics. You configure your Lambda function's triggers and operating parameters (memory and CPU per invocation), upload your code via the AWS API or console, and Lambda will run your code every time it is triggered (you are billed by seconds of execution time).
We implemented the tag-on-terminating function in 20 lines of Node JS code, and so far this has worked without a hitch.
Without Lambda we would have had to do this from one of our instances, adding more complexity and yet another service we must monitor to ensure correct operation. Lambda provides an easy-to-use, dependable, low-maintenance platform for acting on events, especially those generated inside the AWS environment themselves (S3, SNS, SQS, DynamoDB and Kinesis, with more being added all the time). We have since used it to ship ELB and CloudFront logs to logging aggregation services, invalidate CloudFront objects on changes in S3, and translate SNS notifications to Slack messages (in conjunction with AWS API Gateway).