Database – How to Safely Run Migrations with Multiple App Instances

databasedistributed-systemsqlsql server

We have an application that has a mix of both fast (< 1 second) and slow database migrations (> 30 seconds). Right now, we're running database migrations as a part of CI, but then our CI tool has to know all of the database connection strings for our app (across multiple environments) which isn't ideal. We want to change this process so that the application runs its own database migrations when it starts up.

Here's the situation:

We have multiple instances of this application – around 5 in production. Let's call them node1, ..., node5. Each app connects to a single SQL Server instance, and we're not using rolling deployments (all apps are deployed simultaneously as far as I know)

Problem: say we have a long running migration. In this case, node1 starts, then begins executing the migration. Now, node4 starts, and the long-running migration hasn't finished yet, so node4 also starts running the migration -> possible data corruption? How would you prevent against this problem or is the problem even important enough to worry about?

I was thinking of solving this problem with a a distributed lock (using etcd or something along those lines). Basically, all apps try to acquire the lock, only one of them gets it and runs the migrations, then unlocks. When the rest of the apps start up and enter the critical section, all the migrations have already been run so the migration script just exits.

However, my gut is saying "this is overkill, there must be a simpler solution," so I figured I'd ask here to see if anyone else has any better ideas.

Best Answer

Since you mentioned SQL server: according to this former DBA.SE post, schema changes can (and should) be put into transactions. This gives you the ability to design your migrations just like any other form of concurrent writes to your DB - you start a transaction, and when it fails, you roll it back. That prevents at least some of the worst database corruption scenarios (though transactions alone will not prevent from data loss when there are destructive migration steps like deleting a column or table).

So far, I am sure you will also need some migrations table where already applied migrations are registered, so an application process can check if a specific migration was already applied or not. Then utilize "SELECT FOR UPDATE" to implement your migrations like this (pseudo code):

  • Start a transaction
  • SELECT FROM Migrations FOR UPDATE WHERE MigrationLabel='MyMigration42'
  • if the former statement returns a value, end the transaction
  • apply the migration (roll back if it fails, log the failure and end the transaction)
  • INSERT 'MyMigration42' INTO Migrations(MigrationLabel)
  • end the transaction

That builds the locking mechanism directly into the "was the migration already applied" test.

Note this design will - in theory - allow to let your migration steps be unaware of which application actually applies it - it can be possible that step 1 is applied by app1, step 2 by app2, step 3 by app 3, step 4 by app1 again, and so on. However, it is also a good idea not to apply migrations as long as other app instances are in usage. Parallel deployment, as mentioned in your question, may already care for this constraint.

Related Topic