Merge pull request MichaelCade#382 from Waterdrips/ah-day-87

MichaelCade · web-flow · commit 8750da33fa4e · 2023-03-28T09:10:11.000+01:00
Day 87
diff --git a/2023.md b/2023.md
@@ -156,10 +156,11 @@ Or contact us via Twitter, my handle is [@MichaelCade1](https://twitter.com/Mich
 
 ### Engineering for Day 2 Ops
 
+
 - [] 👷🏻‍♀️ 84 > [Writing an API - What is an API?](2023/day84.md)
 - [] 👷🏻‍♀️ 85 > [Queues, Queue workers and Tasks (Asynchronous architecture)](2023/day85.md)
 - [] 👷🏻‍♀️ 86 > [Designing for Resilience, Redundancy and Reliability](2023/day86.md)
-- [] 👷🏻‍♀️ 87 > [](2023/day87.md)
+- [] 👷🏻‍♀️ 87 > [Zero Downtime Deployments](2023/day87.md)
 - [] 👷🏻‍♀️ 88 > [](2023/day88.md)
 - [] 👷🏻‍♀️ 89 > [](2023/day89.md)
 - [] 👷🏻‍♀️ 90 > [](2023/day90.md)
diff --git a/2023/day87.md b/2023/day87.md
@@ -0,0 +1,72 @@
+# Zero Downtime Deployments
+
+Another important part of your application lifecycle is deployment time. There are lots of strategies for deploying
+software. Like with anything there are pros and cons to the various strategies so I will run through a few options from
+least complex to most complex, and as you may imagine the most complex deployment types tend come with the highest
+guarantees of uptime and least disruption to your customer.
+
+You may be asking why it's important to consider how we deploy our applications as the vast majority of our application
+lifecycle time will be in the “running” state and therefore we could focus our time on strategies that support our
+running application’s resilience. My answer is: Have you ever been on-call? Almost all incidents are due to code
+releases or changes. The first thing I do when im on-call and called to an incident is see what was recently deployed -
+I focus my main attention on that component and more often than not it was to blame.
+
+We do also need to consider that some of these deployment strategies will require us to make specific code changes or
+application architecture decisions to allow us to support the specific deployment in question.
+
+### Rolling Deployments
+
+One of the simplest deployment strategies is a rolling deployment. This is where we slowly, one by one (or many be many,
+depending on how many instances of a service you have) we replace old deployments with their new tasks. We can check
+that the new deployments are healthy before moving onto the next, only have a few tasks not healthy at a time.
+
+This is the default deployment strategy in Kubernetes. It actually borrows some characteristics from Surge, which is
+coming next. It starts slightly more new tasks and waits for them to be healthy before removing old ones.
+
+### Surge Deployments
+
+Surge deployments are exactly what they sound like. We start a large number of new tasks before cutting over traffic to
+those tasks and then draining traffic from our old tasks. This is a good strategy when you have high usage applications
+that may not cope well with reducing their availability at all. Usually surge deployments can be configured to run a
+certain percentage more than the existing tasks and then wait for them to be healthy before doing a cutover.
+
+The problem with surge deployments is that we need a large capacity of spare compute resources to spin up a lot of new
+tasks before rolling over and removing the old ones. This can work well where you have very elastic compute such as AWS
+Fargate where you don’t need to provision more compute yourself.
+
+### Blue/Green
+
+The idea behind a Blue/Green deployment is that your entire stack (or application) is spun up, tested and then finally
+once you are happy you change config to send traffic to the entire new deployment. Sometimes companies will always have
+both a Blue and a Green stack running. This is a good strategy where you need very fast rollback and recovery to a known
+good state. You can leave your “old” stack running for any amount of time once you are running on your new stack.
+
+### Canary
+
+Possibly one of the most complicated deployment strategies. This involves deploying a small number of your new
+application and then sending a small portion of load to the new service, checking that nothing has broken by monitoring
+application performance and metrics such as 4XX or 5XX error rates and then deciding if we continue with the deployment.
+In advanced setups the canary controllers can do automatic rollbacks if error thresholds are exceeded.
+
+This approach does involve a lot more configuration, code and effort.
+
+Interestingly the name comes from from coal mining and the phrase "canary in the coal mine." Canary birds have a lower
+tolerance to toxic gases than humans, so they were used to alert miners when these gases reached dangerous levels inside
+the mine.
+
+We use our metrics and monitoring to decide if our “canary” application is healthy and if it is, we then proceed with a
+larger deployment.
+
+## Application design considerations
+
+You may have worked out by now that the more advanced deployment strategies require you to have both old and new
+versions of your application running at once. This means that we need to ensure backwards compatibility with all the
+other software running at the time. For instance, you couldn't use a database migration to rename a table or column
+because the old deployment would no longer work.
+
+Additionally, our canary deployment strategy requires our application to have health checks, metrics, good logging and
+monitoring so that we can detect a problem in our specific canary application deployment. Without these metrics we would
+be unable to programmatically decide if our new application works.
+
+Both these considerations, along with others, mean that we need to spend extra time both on our application code,
+deployment code and our monitoring and alerting stacks to take advantage of the most robust deployments.