It was a lot of fun to vet dirty laundry in my last post on how one of our deployments went really wrong. But part of why that incident stuck out so strongly in my mind is that things so rarely go wrong.

Why is that? I think it’s because we do a lot right: we make it extremely easy for ourselves to keep features from going out to customers until we’re ready, and we give ourselves a lot of time to bang on the exact same version of the software, on the exact servers, that we’ll be pushing out to all of our customers.

How do we do that? Let’s say that I decide that Kiln totally needs an awesome WebGL flyover view of the Electric DAG. I code it up over lunch in about 15 minutes. Now what?

  1. First, as bad-ass as my new feature is, I recognize that (unlikely as this is) I just maybe have a bug or two. So rather than pushing it directly to all my other teammates, I push it to a feature branch called flitesim or the like. If the feature totally stinks (who the feh doesn’t like flyovers? I mean come on!), it’ll never make it out of the branch, and never end up in the main code base. No one except devs working directly on the feature need to care about it at this point.

  2. I don’t know OpenGL from an ice cream sundae and coded the feature up mostly via NeHe copypasta, so next up, I’ll use Kiln’s code reviews to ask my team members to review what I wrote. They’ll look for XSS exploits, check for sane coding conventions, check for possible performance bottlenecks or threading errors, and ask me what on Earth I was drinking when I invented this feature. Minus the last bit, it’s kind of like having a friend read over your essay before you hand it in. (Actually, with the last bit, it’s still like having a friend read over your essay before you hand it in.)

  3. At the same time this is going on, I’ll use a feature of our continuous integration system, Mortar, to cut builds off the flitesim branch for the QA team. About five minutes later, they get a licensed (i.e., install-on-your-local-machine) version of Kiln with my spiffy flight sim feature that they can bludgeon the crap out of, develop tests for, and so on. Because they’re licensed copies, they can install on VMware snapshots, which makes diagnosing schema migration issues and the like ridiculously trivial, and nearly guarantees we can get good repros when things go wrong.

  4. At some point, my team approves the code reviews (I hope), and QA agrees the feature is working to specification (weird specification though that may be). Kiln will be getting its own little DAG-oriented flight sim! Take that, $COMPETITOR! So at this point, I merge it into our devel branch, which we use for integration testing. Features merged into devel are destined for eventual deployment, and the devel branch is the one that all developers grab at the beginning of the day to base their work on. So as of now, the rest of my team will see my badassery, and begin playing with it and fixing bugs as they notice them.

  5. On Tuesday morning, provided that the developers feel good about everything that’s currently in the devel branch, we merge devel into dogfood, and, again using Mortar, deploy this version of Kiln to our own servers. At this point, the entire company is hitting on Kiln and flying all over our source code in amazing 3D-accelerated goodness, and probably reporting bugs. Meanwhile, the QA team begins hammering Kiln with their full test suite to make sure everything looks okay and we don’t have any regressions.

  6. Once we’ve stabilized the couple of bugs that inevitably appear at this point (e.g., “I flew before the beginning of time, halp?”), and developed tests to make sure they don’t reappear, we begin deploying to Fog Creek On Demand. This goes in several phases:

    1. In the first phase, we leak only to so-called “test” accounts, which are purely for QA. If something went horribly wrong, we could always just nuke these accounts.
    2. In the second phase, we leak to alpha accounts. Alpha account holders are exclusively current and former Fog Creek employees who understand the risks.
    3. If things go well there, we can optionally leak to beta customers, who are very nice clients of ours who have agreed to test versions before they go live so that they get new features first.
    4. Finally, provided that everything has been stable for a whole week, we push the leak up to 100%. Fly the rainbow!™

A couple of things that I think work really well with this system:

  1. Features are incredibly heavily tested before they reach customers. They’ve already been tested in isolation, again at integration, again on our corporate Kiln install, and again on our exact On Demand infrastructure. The exact point release that eventually goes out has already been live for days by the time our customers get it, on the exact servers that our customers use. All we’re toggling is who can see the new version.
  2. Our deployment process is fully automated. It takes one button-click in Mortar to tag a build, and another click to deploy to our own servers. Deploying to On Demand is slightly more involved, since Mortar doesn’t currently allow arbitrary input and we want to be able to specify which accounts to upgrade. But at the end of the day, it’s still just a single script invocation that takes a list of accounts and a generation to deploy them to, and takes care of the rest.
  3. Any individual customer can be upgraded (or downgraded) at will. One of the features that we can do with Fog Creek On Demand that is comparatively rare in other systems I’ve seen is that we can upgrade any individual customer to any version of Kiln and FogBugz that is currently hot (i.e., a valid deployment target). Many other systems I’ve seen can only upgrade at server-level granularity, which can be problematic if you’re trying to help a single customer out with a bug fix that you’re not ready to deploy to your entire user base.

The process isn’t perfect, but it’s one I’m very happy with, and one I’d like to emulate on other projects.