Postgres: Column change migrations with zero downtime In Rails

Since online shoppers are known to abandon carts at all hours of the day, we do our best to make sure Jilt is up and running 24/7. However, every developer knows altering a large database without any downtime can get pretty tricky.

A few weeks ago, we had several columns that needed to be changed from an int to a bigint. At first glance, this seemed like a straight forward change, we could use the rails change_column method and be off to the races.

 change_column :orders, :price, :bigint

Not so fast. Before deploying this to our production app we ran a quick test on a copy of our production database. The result? It took an hour and a half.

Eep. Not willing to put our app into maintenance mode for an hour and a half, we set out to find a workaround. Taking inspiration from this Braintree post, our solution involved two parts. The first to add temporary columns and backfill them, and the second to rename the temporary columns and remove the old ones.

Part I

In part I of our migration we needed to accomplish three things

  1. Add temporary columns of the proper type
  2. Add a callback to save data to both our old column and our new column
  3. Add a rake task to backfill our new temporary columns

First, we added our temporary columns.

add_column :orders, :temp_price, :bigint

Then, we added a callback to our model to make sure that when a record was updated, our data would write to both columns.

Class Orders < ActiveRecord::Base
  before_save :save_to_temp_columns
  
  def save_to_temp_columns
    self.temp_price = price
  end
end

Tip: Make sure you’ve added logic to handle this in any place in your code that this model could be updated without triggering an active record callback. (For example, update_columns)

Then, we added a rake task to backfill our temporary columns.

desc ‘Copy values from int columns to temporary bigint columns’
task backfill_bigint_columns: :environment do
  Order.find_each do |order|
    columns_to_update = {}
    if order.price != order.temp_price
       columns_to_update[:temp_price] = order.price
    end
    order.update_columns(columns_to_update) unless columns_to_update.empty?
  end
end

After deploying Part I, and running this rake task in a detached state on Heroku, we were ready to start on Part II.

Part II

Part II needed to accomplish three things:

  1. Remove callbacks to save to temporary columns
  2. Drop our original columns
  3. Rename our temporary columns

In one migration, we both removed old columns and renamed the ones with the proper type.

# remove columns that are the wrong data type
remove_column :orders, :price
# rename temporary columns
rename_column :orders, :temp_price, :price

Each part of this migration was written on a different branch and deployed one after the other.

Part I of our migration took 0.0279s,  while Part II took 1.3389s, saving us an hour and a half of potential downtime. This meant we were able to keep capturing abandoned carts without any interruptions.

3 Comments

  1. I think you have the potential for triggering some errors during the phase 2 deployment. Since your deployment will first run your migration there will be some old web servers still running the old code after the migration has been deployed. Since ActiveRecord caches the columns and uses that cache to build INSERT and UPDATE statements, the old web servers will try to write to the old column and blow up.

    The most common way to avoid that is to do this in 3 steps, and in step 2 you would tell ActiveRecord to ignore the old column by doing

    def self.columns
    super.reject { |column| column.name == ‘temp_column’ }
    end

  2. We’re struggling with a similar challenge. Have you happened to do this with no downtime when you’re trying to change a primary key from an int to a bigint?

    1. Hey John, this is an excellent and very timely question, especially with Basecamp’s recent and much talked about downtime caused by running out of integer primary keys in their database.

      The short answer to your question is: “no”, we have not yet attempted to retype any our existing integer primary keys to bigint, and so I’m not certain that performing that migration is even possible to do with no downtime. I suspect some amount of downtime for this operation will be required, though it may be well be possible to do it with minimal downtime, on the order of a minute or two. I’ll try post an article with some additional thoughts and ideas that I have on the subject in the next day or so.

<em>Hmm, looks like this article is quite old! Its content may be outdated, so comments are now closed.</em>