Database migrations that are applied to tables with millions and millions of rows can run for a long time and can also cause upticks or sustained increase in CPU, RAM or IO usage (or all of them combined). In some cases, the migration combined with other concurrent database activity can be enough to bring the database into an unhealthy state (e.g. statements timing out due to the load on the DB).
You can imagine that for a content platform such as Contentful, we want to prevent any performance issue in the databases which store our customers' content. The databases are automatically monitored 24/7 and when they enter an anomalous state, on-call engineers are paged to check the alert and take the appropriate measures. A migration could trigger some of these alerts, so to avoid unnecessary pages and ensure the availability and performance of the databases, the engineer running the migration would have to be ready to stop any migration before alerts were fired and restart it when it was considered safe to do again. This process worked well until our scale made it not sustainable anymore:
The migration could only be rolled out at once to a handful of the database instances because one person can't monitor multiple metrics across hundreds of instances. Even when running the migrations on a handful of instances I was sometimes late in stopping the migration and an alert was fired.
Keeping track of the status of the migration on each database instance is busy work which doesn't add any value.
The migration can only run when there is an engineer overseeing it. Migrations were run during office hours and losing the possibility to run during low traffic periods (when engineers like to sleep).
The gist of the problem was that relying on an engineer to constantly monitor the migration execution wasn’t sustainable and it was something which didn't bring any joy to the engineers (I can attest to that as I ran many of these migrations) or value to the company (my team would rather have me working on something else). Conceptually, the solution to these problems is simple and resembles 100% what an engineer did before: automate the periodical metrics checks and stop the migration if some thresholds are crossed and then only after the metrics have returned to safe values, re-start the migration.
Our migration files are node.js modules (exporting up and down functions) which are executed by a runner in our migrations SDK. The following diagram shows the main components of the old SDK: a thin migration runner which would take the migration and execute it until it finished or failed.
Let's start by enhancing the previous diagram with the key additions to our database migrations sdk. As before, at the top is the database migration the engineer wants to execute on the databases and as before, the execution of the migration will be controlled by the migration runner. The difference is that now the runner relies on two new components: the metrics service and the stoppable migration. These two new components are what helps make sure that database migrations don't cause any harm on the databases’ availability and their performance.
We will cover both components in the following sections.
Read metric. Analyze metric. Signal result.
For the first release of the new version of the database migrations sdk the scope was to automatically monitor the same metrics the engineer was using before (cpu, memory and io usage) but making it possible to extend this set in the future with things like api error rate, execution time of other concurrent jobs in the databases, etc. Due to this requirement we designed the metrics service to delegate most of the work to what we called metric providers and to use the data returned from them to determine if the migration could continue running or not.
All that the metrics service knows about the metric providers is the interface they must implement. This allows us to create providers which are specific to one metrics source (e.g. CloudWatch) without having to change anything on the metrics service.
For example, below there is a snippet from the
analyzeMetric method in the CloudWatch metrics provider. The
analyzeMetric takes the result of the pollMetrics function (not included in the snippet) and the thresholds we have configured for the migration (i.e. which values are considered safe for each migration). If, for example, the measurement for the
CPU usage or the one for the
FreeableMemory are outside of the safe values, a
NoGo signal will be returned to the metrics service, which in turn will forward it to the migrations runner.
The value returned from the
analyzeMetric method in the metrics providers is a
Signal. A Signal can be a
Go (continue execution) or
NoGo (stop execution) decision that a provider made after analyzing the polled metrics. The metrics service also returns a
Signal from its
analyzeAllMetrics method, which derives from the signals returned by the providers. We decided to use a
Signal for two reasons:
The migration runner, the providers and the metrics service only share a very slim type, the
Signal, which reduces the chances of coupling across different layers. At the same time, it is more meaningful to return a primitive like a boolean or a string value.
The logic in both the metrics service and the runner is easier to read and understand since the dependencies with other parts of the sdk are smaller.
The migration runner will schedule a periodical call to the metrics service to determine if the migration has to be stopped or not.
Continuing where we left it in the previous section, the migration runner must stop a migration if it receives a
NoGo signal from the metrics service. This raises the obvious question: how to stop a running migration? What we needed was to prevent a migration from executing more operations on the database and only allow them again once the migration runner received a
Go signal. A way to not execute more database operations could have been to terminate the job that runs the migration but that would have also terminated the metrics service and the migration runner. If that code was terminated, we wouldn't have a way to restart the migration after the metrics values went back to safe value, so we had to find a way that would handle the stop and start of the migration from within the running job.
To better explain how we solved this, I'm going to use an example migration like the following one:
This example migration reads some data from the tables
titles to generate the value of a recently added signature column. These operations will be repeated in a loop until there is no more data to transform (i.e. there are no more rows where the signature column is null). Now imagine that the migration runner receives a
NoGo signal while the migration is counting the number of remaining rows to migrate. Because of that, the migration shouldn't execute the
SELECT query to read the name, surname and title name on the next loop iteration but instead do it only once a
Go signal is received.
Our first attempt was to decorate the database client the migration uses (the
db argument the migration function receives ) and inside it control whether to forward calls to the real client or block them.
createWrappedDBClient returns the decorated client and two helper functions (
disableStop) that are used to enable and disable the forwarding of the calls to the real database client. After
enableStop is called, the next call to the decorated methods will be blocked on the deferrable which will only be resolved, allowing for the call to be forwarded to the real client, after a call to
disableStop. This approach works, the migration runner can use the
enableStop function to prevent the migration from making more queries to the database after a
NoGo signal is received and the
disableStop to allow for the queries to go through again.
Blocking calls this way has, however, a big drawback. To explain why we have to talk about transaction isolation in Postgres. With the
READ COMMITTED isolation level, if a transaction in the application code wants to
delete a row which had been modified by an unfinished transaction in the migration, the call made from the application would have to wait until the transaction in the migration is committed or rolled back. In the example migration, if we blocked after updating the
foo table but before the transaction finished, we could be blocking other concurrent application transactions and this was something we can't permit.
We modified the database client decorator and instead of blocking before forwarding the call to the real db client, we now throw an exception, which would implicitly roll back any inflight transaction and prevent anymore calls being sent to the database. The rest of the decorator is conceptually the same, still returning two functions, now called
disableThrowing which control the two states of the decorator.
Complementary to the changes to the decorator, we added a function that leverages the exception thrown from the decorated client (
RejectedDBClientUsageError) to know when the migration has to be stopped in a controlled way.
Only exit if the migration has been fully executed or an exception different to
RejectedDBClientUsageErroris thrown from the migration code.
RejectedDBClientUsageErrorhas been thrown, the function will block and not execute the migration again until the
resumehelper is used.
Now the migration runner can use the
abortMigration function to prevent the migration from making more queries to the database after a
NoGo signal is received and the
restartMigration to restart the migration again.
Putting all the pieces together
All what we have explained up to now comes together in the migration runner. The runner will start the periodical polling and analysis of the metrics (via the metrics service) and it will also create a stoppable migration. The migration runner will execute the migration and use the
restartMigration functions to control the execution of the migration based on the signals that the metrics service will periodically return.
Did we solve the problems we had? Let's see:
"The migration could only be rolled out at once to a handful of the database instances because one person can't monitor multiple metrics across hundreds of instances." We can now roll out the migration to all database instances at one. On each instance, the migration’s sdk will automatically start and stop the migration as many times as necessary depending on the metrics values.
"Keeping track of the status of the migration on each database instance is busywork which doesn't add any value." No need to do this anymore, the migration sdk knows when the migration finished and when it still has to be running.
"The migration can only run when there is an engineer overseeing it." Now the migration can run 24/7.
This is the first step to improve the developer experience of those engineers who have to run database migrations and to ensure that the availability and performance of the content databases, a core component for Contentful, is not affected. There are still rough edges and many ideas of how to make this more awesome to use. If you would like to work on projects like this go check out our careers page. We are hiring!