Opinionated Programmer - Jo Liss's musings on enlightened software development.

The Limits of Continuous Deployment

Abby Fichtner asks, Is Deploying to Production 50x/Day a GOOD Idea? Here’s my take.

From Continuous Integration …

Originally, “continuous integration” was about having a fully automated integration process: Moving from a manual to an automated compile-link-test cycle meant that instead of every week or month, you could have an up-to-date binary every (gasp!) day.

If that sounds normal to you, it’s testimony to how much better we are at making software today. Any building and testing tends to be automated, and in fact, nobody gives much thought anymore to the fact that they run make test && git commit every few hours, or even minutes. So continuous integration has become trivial. What’s the next frontier?

… to Continuous Deployment

That’s right, continuous deployment. For web applications, a build-test-stage-deploy cycle is the logical extension to the build-test cycle we already have. All you need is a script that runs the staging and deployment for you. So how often should you run this cycle? Many shops seem to deploy on the order of once a day. Abby asks the logical question, “given good testing and monitoring, why shouldn’t we deploy 50 times a day?” I’d like to offer some arguments as to why not.

For continuous integration, the limit to how often you can integrate is mostly technical: Can you automate everything, and does it run fast? But for continuous deployment, the real limit is not in the technology but in the process of writing software: It’s the non-zero cost of reverting a change.

The Cost of Backtracking

Let’s say you make a commit, and then later decide that it was a bad idea. How difficult is it to revert to the previous state? In ascending order:

  1. In a private repository, it’s instantaneous: git reset

  2. If you have published your repository, it’s perhaps embarrassing, but still (usually) instantaneous: git revert && git push

  3. If you’ve sent a pull request to another developer, it’s an extra email and a few minutes of wasted time on both sides. Still not too bad.

  4. But once you have deployed to a production site with users and valuable data, reverting is expensive, because you may have to:

    • Migrate the database back to the previous schema and conventions.
    • Think about what happens to users who are using your site right now and having the application changed under their feet (potentially causing links to break, and Ajax requests to fail).
    • If it’s a bug (rather than just a decision you’d like to reverse), you might even have to email users who were affected by the problem, or deal with support requests.

So problems on your production site are more expensive, no matter how awesome and agile your process and tools are. Good test suites may cut down on the number of actual bugs you introduce, but there will always be design decisions that you want to revert because you end up regretting them.

How to Revert Less Often

Agile is all about making it easy to change your mind about things – so if it becomes difficult to change things because they’ve been deployed to the server already, that should (rightly) bother us. Now we have two options:

  1. Up to a point, it’s probably possible to make reverting things on the server a bit easier, by having awesome deployment and rollback scripts. But in the end, it won’t usually be as simple as typing git reset. Which leaves us with option two:
  2. Push fewer commits to the server that you end up wanting to revert.
graph: days after commit vs. chance of regret
My completely unscientific idea of how likely commits are to be reverted over time.

And here is the magic trick: If you are making a bad decision, you are likely to figure it out quickly – because you’re basing new code on it, or maybe just because the next morning, you get a better idea in the shower. This is really just a variation of the common agile theme, “the longer you can delay decisions, the wiser you are when you make them.”

So if you can sit on your hands for a while instead of succumbing to the temptation to be hyper-aggressive and deploy immediately, then you might be able avoid most of those nasty 5% of changes you really wish you hadn’t pushed to the production server, and at the same time only lose unsubstantial amounts of early user feedback. I would suggest that reasonable time frames for this “deployment delay” are on the order of five hours to two days.

In other words, you might want to try deploying not the latest stable tree, but rather the latest stable tree from, say, 24h ago. (That should be fairly easy to do provided that you keep your most recent history clean by rewriting bad commits – rebasing in git terms.) Now you could deploy 50 times a day the tree as it was 24h ago (and you perhaps wouldn’t even cause disaster), but in practice it’s probably easiest to just deploy once or maybe twice a day – say, first thing in the morning, and perhaps after lunch, each time bundling many commits into a single deploy.

Two caveats: First, I’m definitely not suggesting that you should never deploy fast. For example, if you think you can collect feedback in a matter of minutes, or if you can selectively push a change to a specific user who requested it, or if you have an urgent bugfix, then there should be nothing technical to stop you from deploying immediately. What I am saying is that for the majority of changes, exercising some patience before deploying them to the server will save you a lot of pain. Second, the speed of 1-2 deploys per day I am suggesting applies per developer. In a larger shop, with a product with many components, you might conceivably accumulate tens of deploys per day, and that’s actually a good sign in my view.

Update: It seems that the folks at IMVU are successfully pushing the size of the changes down to a minimum. Here are some of their writings: Timothy Fitz’s motivation, how this works in practice, Brett Durrett’s slides, and James Birchler giving some more info on QA.