|
| 1 | +# Zero Downtime Deployments |
| 2 | + |
| 3 | +Another important part of your application lifecycle is deployment time. There are lots of strategies for deploying |
| 4 | +software. Like with anything there are pros and cons to the various strategies so I will run through a few options from |
| 5 | +least complex to most complex, and as you may imagine the most complex deployment types tend come with the highest |
| 6 | +guarantees of uptime and least disruption to your customer. |
| 7 | + |
| 8 | +You may be asking why it's important to consider how we deploy our applications as the vast majority of our application |
| 9 | +lifecycle time will be in the “running” state and therefore we could focus our time on strategies that support our |
| 10 | +running application’s resilience. My answer is: Have you ever been on-call? Almost all incidents are due to code |
| 11 | +releases or changes. The first thing I do when im on-call and called to an incident is see what was recently deployed - |
| 12 | +I focus my main attention on that component and more often than not it was to blame. |
| 13 | + |
| 14 | +We do also need to consider that some of these deployment strategies will require us to make specific code changes or |
| 15 | +application architecture decisions to allow us to support the specific deployment in question. |
| 16 | + |
| 17 | +### Rolling Deployments |
| 18 | + |
| 19 | +One of the simplest deployment strategies is a rolling deployment. This is where we slowly, one by one (or many be many, |
| 20 | +depending on how many instances of a service you have) we replace old deployments with their new tasks. We can check |
| 21 | +that the new deployments are healthy before moving onto the next, only have a few tasks not healthy at a time. |
| 22 | + |
| 23 | +This is the default deployment strategy in Kubernetes. It actually borrows some characteristics from Surge, which is |
| 24 | +coming next. It starts slightly more new tasks and waits for them to be healthy before removing old ones. |
| 25 | + |
| 26 | +### Surge Deployments |
| 27 | + |
| 28 | +Surge deployments are exactly what they sound like. We start a large number of new tasks before cutting over traffic to |
| 29 | +those tasks and then draining traffic from our old tasks. This is a good strategy when you have high usage applications |
| 30 | +that may not cope well with reducing their availability at all. Usually surge deployments can be configured to run a |
| 31 | +certain percentage more than the existing tasks and then wait for them to be healthy before doing a cutover. |
| 32 | + |
| 33 | +The problem with surge deployments is that we need a large capacity of spare compute resources to spin up a lot of new |
| 34 | +tasks before rolling over and removing the old ones. This can work well where you have very elastic compute such as AWS |
| 35 | +Fargate where you don’t need to provision more compute yourself. |
| 36 | + |
| 37 | +### Blue/Green |
| 38 | + |
| 39 | +The idea behind a Blue/Green deployment is that your entire stack (or application) is spun up, tested and then finally |
| 40 | +once you are happy you change config to send traffic to the entire new deployment. Sometimes companies will always have |
| 41 | +both a Blue and a Green stack running. This is a good strategy where you need very fast rollback and recovery to a known |
| 42 | +good state. You can leave your “old” stack running for any amount of time once you are running on your new stack. |
| 43 | + |
| 44 | +### Canary |
| 45 | + |
| 46 | +Possibly one of the most complicated deployment strategies. This involves deploying a small number of your new |
| 47 | +application and then sending a small portion of load to the new service, checking that nothing has broken by monitoring |
| 48 | +application performance and metrics such as 4XX or 5XX error rates and then deciding if we continue with the deployment. |
| 49 | +In advanced setups the canary controllers can do automatic rollbacks if error thresholds are exceeded. |
| 50 | + |
| 51 | +This approach does involve a lot more configuration, code and effort. |
| 52 | + |
| 53 | +Interestingly the name comes from from coal mining and the phrase "canary in the coal mine." Canary birds have a lower |
| 54 | +tolerance to toxic gases than humans, so they were used to alert miners when these gases reached dangerous levels inside |
| 55 | +the mine. |
| 56 | + |
| 57 | +We use our metrics and monitoring to decide if our “canary” application is healthy and if it is, we then proceed with a |
| 58 | +larger deployment. |
| 59 | + |
| 60 | +## Application design considerations |
| 61 | + |
| 62 | +You may have worked out by now that the more advanced deployment strategies require you to have both old and new |
| 63 | +versions of your application running at once. This means that we need to ensure backwards compatibility with all the |
| 64 | +other software running at the time. For instance, you couldn't use a database migration to rename a table or column |
| 65 | +because the old deployment would no longer work. |
| 66 | + |
| 67 | +Additionally, our canary deployment strategy requires our application to have health checks, metrics, good logging and |
| 68 | +monitoring so that we can detect a problem in our specific canary application deployment. Without these metrics we would |
| 69 | +be unable to programmatically decide if our new application works. |
| 70 | + |
| 71 | +Both these considerations, along with others, mean that we need to spend extra time both on our application code, |
| 72 | +deployment code and our monitoring and alerting stacks to take advantage of the most robust deployments. |
0 commit comments