In terms of personal experiences, 2020 has been a year of disaster for most of us. But, as we take stock of 2020 in terms of technical breakthroughs, there is a lot to cheer about. We helped customers like GoodRx, Loblaw and the Covid Tracking Project weather the deluge of customer traffic; we saw technology partners like XTM, Uniform and Mux build amazing apps unlocking new capabilities; and we saw the use of cutting-edge technologies like GraphQL skyrocket from a couple of thousand to 100,000,000+ API calls a day.
This growth relies on Contentful infrastructure — the behind-the-scenes engineering that is core to how we support the growth of our customers. We’ve done a lot of work over the past year, and I’m excited to share a few of our most impactful initiatives.
Behind the scenes work ensures onstage performance
Despite the rapid growth, our team managed to scale Contentful’s underlying infrastructure without any major hiccups. Contentful has several teams working on the cloud infrastructure and tooling powering our API-first platform. Each of these teams is investing in a specific charter defining its area of responsibilities and key performance indicators:
Predictable performance: ensure that our platform lives up to predefined availability levels and stays performant across the globe.
Fleet management: automate repetitive processes and ensure that the platform scales in lockstep with our company growth and customer usage.
PaaS: empower internal engineering teams by abstracting the complexities of cloud infrastructure powering our services.
When it comes to day-to-day work, the teams are free to define problems to be solved, identify the best solutions and pick the right technologies to implement them. We do a lot of research to understand the problems we are solving and typically build a number of prototypes before committing to a specific solution.
While most of the technical users think of Contentful as a collection of APIs for distributing content, behind the scenes our platform is a composite of dozens and dozens of individual microservices. Some of them authenticate users, others count consumed resources and others transform assets before delivering them to the visitor of a website powered by Contentful. To ensure the overall health of our platform, we spend a lot of time fine-tuning the performance of these services.
The most obvious thing to do here is automate the scaling of underlying compute resources required by internal microservices. This requires us to clearly separate these services, implement scaling mechanisms, define key metrics and threshold triggers, and implement proper tracking to be able to monitor their performance in real time and effectively debug any anomalies. The real art of performance lies in being able to provide consistent performance across geographies, clusters, and customers and that is where our team spends a lot of its time and focus.
We run a lot of database server instances, also known as shards, to store customer data. Individual shards vary in purpose, size and utilization rate, requiring our infrastructure team to do a fair amount of data migrations. For example, we might need to move the customers’ spaces to a dedicated shard, when they sign an enterprise contract or have to swap long-running shards for those upgraded to the latest version of the database software. Unfortunately, the process of copying data can take hours and calls for a temporary freeze on content changes. This is why SaaS businesses have to introduce regular maintenance windows.
Unhappy with this state of affairs, we’ve experimented with alternative solutions and eventually found a way to perform data migrations without locking the original database. The secret we hit upon was to log all the write operations happening during the migration and replay them against the newly created instance once the content is moved. As the process nears its end, we pull another database jujitsu move by simultaneously applying incoming changes to both instances. And finally, we lock down the original instance and from then on write all the updates to the new instance only. If the customer attempts to update their content during this swap, their changes are not lost but merely delayed by a few seconds as our internal service automatically retries to write changes until the new database instance becomes fully operational.
While cloud and database vendors provide useful methods for dealing with such problems (e.g. logic replication, two-phase commit protocol, resharding), our business model required the infrastructure team to develop a custom solution to this universal problem. While in retrospect it sounds easy, it was a tough problem to crack. But having done so, we can now easily move customer data across shards without needing to schedule a downtime window. Hence the name of the feature — Zero-Downtime Resharding (ZDR).
Some projects we work on come from unexpected observations; Bob Ross would call these happy accidents. One such project came from our engineers looking at the load distribution across our database shards. In theory, all clusters host an equal number of customer spaces, but in practice the load on shards is anything but uniformly distributed. Since all our customers run unique queries, reach different audiences, and build their projects using different architectures — some database instances might see a quarter of expected load, while others are quickly approaching 80% of available capacity.
Typically, SaaS companies ignore this variance since traffic spikes eventually subside. Plus moving customer data involves maintenance windows and these often require advance notice and careful coordination with the customer.
Luckily for us, we had implemented ZDR, allowing us to proactively rebalance the load on our databases instances. So when it comes to managing customers on the shared infrastructure, rather than assigning customer spaces to shards chronologically, we look at the actual load they generate. This allows us to mix up active spaces with low-activity ones to achieve an optimal utilization of shared database shards. The best thing? The mix-and-match process can run in the background, as our monitoring algorithms react to real-time activity within the shards and automatically make necessary adjustments.
We like to think of our customers in two categories: the 99% and the 1%. The 99% are mainstream customers who hardly ever notice any variation in their performance. The 1%-ers are different. Sometimes they sell game titles that set the internet's imagination on fire. Other times they try to push the limits of the possible by combining best-of-breed services into a unified technology stack. And sometimes they just forgot to drink their morning coffee and query our GraphQL API with a high-resolution timestamp, effectively disabling our delivery cache.
To help internal teams understand the root cause of slow queries, we employ OpenTelemetry solutions, which allow us to sample their queries and trace the request across all internal microservices. Internal implementation of open-source tools like Jaeger enables our internal teams to replay suspect customer queries blow-by-blow and establish with a high level of confidence what services and operational factors led to unexpected slow-downs in performance. A lot of the time, we use these insights to refactor internal services or advise the customer on ways in which they can optimize slow-running queries.
Creative infrastructure continues
I hope these projects piqued your interest and helped you understand a bit more about the day-to-day responsibilities of the infrastructure team. The ability to think creatively about the use of resources in the platform like Contentful and implement automation to ensure its consistent performance is what keeps a lot of us excited.
If you want to dive deeper, I urge you to check out the presentation my colleagues Julia Biro and Yann Hamon gave at the AWS Summit Berlin 2019. And if you are thinking about your next career move, we have a lot of opportunities and would love you to join us and help tackle them.