The journey to make environment cloning faster

The engineering team embarked on an ambitious project enhance environment cloning for Contentful customers, with terrific results. This is how they did it.
Published
February 28, 2024
Category

Insights

In 2017, Contentful released what was to become one of the foundational features of the product: environments. With environments, customers could create isolated copies of their main content repository with just an API call or the click of a button. This enabled new use cases that were not possible before and was a great success from day one. With environments, and then environment aliases, customers could now apply good practices like CI/CD to their content repositories.

In 2021, we started a journey to further enhance environment cloning and make it scale for the next leg of Contentful’s journey. In 2023, we gradually rolled out the enhancements to most of our customer base (if you are a new customer you also got this improvement) and the results couldn’t have been better. This post is the first part of a deep dive into how we did it and what we learned along the way. Part two is here.

Setting the scene

Conceptually, cloning an environment is a simple task. Take all the records in the source environment and copy them to the target environment. There’s some additional bookkeeping that has to be done but nothing particularly exciting or complex. This copying can be described with the following pseudocode:

There would be other very similar “for” loops like this one for the other entities that are copied as part of environment cloning. We are leaving them out to keep things simple.

This process worked well enough for the first four years, but performance wasn’t keeping pace with the growth of both the customer base and the size of their content repositories. Over time, as the customers started adopting environment copying for their CI/CD workflows, some of them started reporting that their environment cloning times had high variability or that it took longer than expected for them. And environment cloning started causing pages to on-call engineers.

The beginnings and the challenges

Contentful uses Postgres to store all the content from its customers. The JSON payloads that customers send to the APIs are decomposed into rows that are kept in different tables. At query time, we reassemble these entities from the same rows. This means that the above pseudocode has to be expanded into the following:

This procedure was simple to implement and easy to explain. This simplicity served us well in the beginning but eventually created problems.

The copy-of-master environment is a 1:1 replica of the master environment.

The copy-of-master environment is a 1:1 replica of the master environment.

Cloning an environment by making a physical copy of each row in the source environment causes lots of I/O on the database. SQL statements to read and write from and to different tables consume I/O and the database would also implicitly consume I/O to keep all the necessary indexes consistent. Depending on the machine your database is running on and for how long this elevated I/O consumption goes on, you might have queries fighting for bandwidth or hitting any other limit that will degrade performance. 

The drawbacks of this approach don’t end here. We saw that on several occasions after a batch of environment clones had finished, the query planner would start making incorrect choices because the sudden influx of rows from the clones outdated the statistics it was using to pick the query execution plan. And it would also have the obvious effect of growing the size of indexes and tables, diminishing the efficiency of the Postgres shared buffers.

In a drive for continuous improvement, we sought a solution that would keep our customers happy, while also alleviating pressure on the engineering and on-call rotation teams. We needed a way forward to improve environment cloning.

Eureka!

We could have gone many different ways looking for a solution to how resource-intensive copying an environment was. But what really set the framing for how we thought about the solutions was that our data showed that most of the time only a small fraction (< 10%) of the content in an environment copy ever changes. This means that we were copying millions of records that would never change. Ever.

This was our Eureka! moment because we could exploit this behavior to drastically improve environment cloning.

Copy-on-write

One of the driving goals when Contentful built environments was to provide isolation so that changes from one environment wouldn’t be visible in its copies. Let’s explain this with one example.

Initially, you have only the master environment. Let's say you make a clone to create the ugly-christmas-sweaters environment. Right after creation, they are exactly the same, for example, an entry with the same ID will be the same in both environments. But soon after, they start to diverge when users modify them (in either environment). But because they are isolated from each other, changes to master are not visible to ugly-christmas-sweaters and vice versa.

As we have explained in the previous sections, this isolation was achieved by having separate physical copies of the records in both environments. Changes to records on the master environment would touch different rows than changes made to records on the ugly-christmas-sweaters environment. For records that never change and are only read in both environments, we would have “paid the price” of isolation even if we didn’t need it!

And so here is where our Eureka! moment becomes relevant. Isolating environments by having physical copies of their records when on average < 10% of them change, was a waste of resources. On the flip side, if 90% of the records of environments created from another environment remain unchanged then we could “share” the same record across environments and still guarantee their isolation in an efficient way. Enter copy-on-write.

Generally speaking, copy-on-write means that a resource is shared and it’s only copied when one of those sharing the resource wants to write to it. In our context, the resources are the environment records shared across different environments.

Copy-on-write presented us with a solution that would kill multiple birds with one stone. On one hand, it would be faster to clone environments because we wouldn’t have to copy all the records up front, instead, we would delay that to the point when a shared record is changed. On the other hand, it would drastically reduce the I/O requirements to clone an environment, which would also eliminate the page noise. And finally, it would eliminate the main source of bloat on our databases.

How are records shared?

The first thing we need if we want to implement copy-on-write is a mechanism to share a record across multiple environments. As with many other solutions in software engineering, what we need is one more level of abstraction or indirection.

Because, with copy-on-write, one record could belong to many environments, we introduced new tables that would map the “belongs to” relationship between records and environments. This is the new indirection that would support copy-on-write. In a display of originality, we called these the “environment mapping tables” (or mapping tables for short).

Two environments sharing content records via the mapping tables. Notice that the environment_id column has been removed from the entries table.

Two environments sharing content records via the mapping tables. Notice that the environment_id column has been removed from the entries table.

With the mapping tables, we can now have multiple environments sharing content. For each record in one environment, there is a row on the mapping tables with a pointer to the rows which hold the content (i.e. the Symbol, `Number`, `Text`, etc. values). Right after an environment has been copied, all its pointers point to the same content records as the environment it was copied from.

When are records copied in copy-on-write?

Prior to copy-on-write, the process of updating a record could be described with this pseudocode:

In our application code, we looked up a record with a given ID and if that record existed, we modified it in place and then saved it. Because every environment had an isolated copy of its content, there was no risk of these rewrites leaking to other environments.

Let's look now at some pseudocode of how records are updated with copy-on-write in place:

Making changes to records is no longer a straightforward retrieve → update → write process. The application code has to check how many environments have a reference to the record. When the record is present in multiple environments, the application has to make a copy of the record, update the mapping row so that it points to the fresh copy, and finally carry on with the update. Everything is done inside a transaction so consistency is guaranteed.

But this is the magic of copy-on-write. We only have to create new copies of those records that change. Those that remain unmodified can still be shared, saving lots of I/O and CPU time that might go to waste because those records never changed.

The entry with ID entry-3 was updated on the copy-of-master environment. The rest of the shared records are unaffected.

The entry with ID entry-3 was updated on the copy-of-master environment. The rest of the shared records are unaffected.

This is easy...?

At this point, we had a good mental model of how copy-on-write would work in our use case. This was the easy part.

You see, if we were starting Contentful today and we picked copy-on-write from the beginning, it would have been easy to implement. But this wasn’t the case. Contentful, being a thriving business loved by many customers, already had tons of content stored in environments created in the old way. This meant that we had to engineer a rollout process to move millions of records to copy-on-write. Across hundreds of databases. For thousands of clients. Without any noticeable impact on them.

Maybe not so easy.

So, we had to implement copy-on-write on a live system, with no one noticing. That is, no downtime for customers, no performance degradation in any functionality that depends on the data storage layer, and most importantly, no data loss.

The implementation required several steps:

  • Prepare the database schema to support copy-on-write — add new tables, foreign keys, backfill data, and keep it up to date.

  • Modify the SQL we use to filter and retrieve content to adopt the copy-on-write changes, while ensuring that performance doesn’t suffer.

  • Identify unique content that has to be kept and delete duplicates. As we wrote above, the old full-copy mechanism led to a lot of content that never changed, which was exactly the same across environments. As part of the copy-on-write migration, we had to remove all this duplication.

  • Demonstrate that the copy-on-write didn’t open the door to any data races. With copy-on-write, we moved from physically isolated copies to mutable shared states. And any book about concurrent programming will tell you that mutable shared state is the source of all evil (i.e., difficult-to-debug races).

And we wanted to do this process online, concurrently with customer’s reads and writes. We started by tackling those steps with more risks and unknowns: identifying content to be kept and validating that the new querying performance was on par with the current one.

Are you a clone?

Because every environment clone made so far was a perfect physical copy, we had a lot of duplication in our data storage (remember that our data showed that 90% of the content in copied environments never changes). And with copy-on-write, we wanted to move to a setup where there was no duplication. So, how did we identify which content to keep and which was a duplicate that could be removed? 

The first approach we considered was to locate all of the duplicates for each record and pick one to keep as the only copy that the mapping rows in the different environments would point to. The gist of this process was:

  1. Run a query to find all the duplicated records. Group records by those columns that would identify them unequivocally within one environment (e.g., creation timestamps, update timestamps, version numbers, customer-facing IDs, etc.) and count how many records had the same values for those columns across environments of the same space. Filter to keep only the groups with a size > 1 (i.e. with duplicates).

  2. Run a query to select one of the records from each group. This would be the “representative.”

There were however two problems with this approach. The first one was race conditions and the second one was performance.

This process used several queries and application logic in between them, which is the perfect setup if you want to have data races in your system. The interactions between the migration process and the other actors reading and writing records in the database could result in inconsistencies in the data. That’s why we went through all the possible interactions we could think of and the different interleavings to make sure that we got a solid solution. We found some and fixed them.

Next, we tried the performance of the process and the results were not positive. In several cases, we saw an increase in CPU usage by >30% and very high latencies, which for some large customers would bring the deduplication process to over a week. We looked at this problem from different angles for a couple of days and concluded that the combination of poor performance and the complexity of this procedure made it a bad choice, so we threw it away.

Next up. How can we find duplicates without causing a problem?

Blazing fast diffs

“One door closes and another one opens …” goes the saying. And what a door! We were a bit down after our previous setback but our spirits quickly lifted once we tried a new idea. This time, we were going to find duplicate and unique content using plain SQL.

Postgres, like other database systems, offers set operations like EXCEPT or INTERSECT to calculate the difference and the intersection between two sets. And what are the environments if not sets of records?

With this insight, we quickly put together a pl/pgSQL procedure to see how fast we could find duplicate records across two environments. Contrary to the previous procedure, where we could find all the duplicate records across all environments in one space at once, this new procedure required that we operate only on pairs of environments at a time. Because new environments are always copied from an existing environment, we had to compare every environment (except for the first one in the space) against their base or parent to find the duplicates.

Sidenote here to briefly explain how environment copy works and what’s considered an environment’s “parent”. Every space is created with one environment inside, the master environment. Editors can create copies from the master environment and further copies from its resulting copies. We say that this new environment has master as a parent or that master is the parent of the newly copied environment. If the user wants to make another copy of the environment, then she can choose to copy it from master or from the copy she made before.

Back to finding duplicates after this short digression. Right below you can see part of the pl/pgSQL procedure we used to test this new approach of finding duplicates, which we called “environment diffing.”

With this tool in our belt, we could now locate duplicates and unique records in environments with more than 1M records in less than 30 minutes! However, it was not all good news. Contrary to the previous approach, this one couldn’t be done online. Customers could change the data concurrently with this procedure, resulting in an incorrect deduplication.

Instead, we had to break one of our initial aspirations and impose some write downtime on our customers.

A new day, a new obstacle

As you might’ve noticed, the introduction of mapping tables led to changing the way we query data. Before, we would only have two tables participating in the query — entries and entry_fields. Both tables had the environment_id column that allowed us to filter rows that belong to a specific environment before joining the tables' rows.

But with the new structure, rows in those tables could be shared between environments, so we removed that column. The environment_id column was only present in the mapping table. Therefore, we always needed to additionally join the mapping table first to get only the rows belonging to a specific environment. 

Before copy-on-write:

SELECT * FROM entries e JOIN entry_fields ef ON … WHERE e.environment_id = “abc” AND ef.environment_id = “abc”;

After copy-on-write:

SELECT * FROM environments_entries em JOIN entries ON … JOIN entry_fields ON … WHERE em.environment_id = “abc”;

During the proof-of-concept phase, we compared the performance of a new query and it worked just fine. However, we didn’t have a chance to test it on the production scale and data distribution. So, once the mapping tables were backfilled with real data in production, but weren’t used for read traffic yet, we did some testing of the new query performance … and it was not anywhere near what we expected. 

The performance dropped roughly two times compared to the original queries.

To join tables, the database fetches the rows based on WHERE query conditions for every table participating in the query. It only performs the “join in” memory after that. Without the environment_id column on entries and entry_fields tables, the number of rows the database had to fetch grew dramatically. After joining the rows from three tables and having the complete picture, most of the rows in the final joined table would get removed by the environment_id filter. That led to significant IO and working memory usage increasing the query runtime.

Of course, this wasn’t acceptable, so we started looking for a solution. 

A glimpse into results

That wraps up the first installment of a two-part blog post series. To not keep you waiting, here are some of the numbers shown in the next post — the average environment copy is now 50 times faster! To be more specific, for some of our largest customers, copying an environment used to take on average 50 minutes and now it’s down to 60 seconds. 

Plus, here is what some of our customers had to say about their experiences with this improvement:

The Contentful maintenance occurred last night and I just tested one of the main benefits of it which is the ability to create new environments … in a matter of minutes (1-2), rather than the 15-20 minutes we're used to seeing. The new environment I created this morning was ready in ~30 seconds, this is a game changer for our development workflows and internal tooling across the board!

– Customer in the payments industry.

Just completed some testing of the new environment cloning process and the performance numbers are fantastic!

– Customer in the social media industry.

Intrigued? Want to hear the full story? Click here to continue reading part two.

Start building

Use your favorite tech stack, language, and framework of your choice.

About the authors

Don't miss the latest

Get updates in your inbox
Discover how to build better digital experiences with Contentful.
add-circle arrow-right remove style-two-pin-marker subtract-circle remove