|
| 1 | +--- |
| 2 | +title: "Infrastructure Week - Day 4: Downtime Both Planned and Not" |
| 3 | +layout: post |
| 4 | +--- |
| 5 | + |
| 6 | +# Downtime Both Planned and Not |
| 7 | + |
| 8 | +Yesterday we looked at what Void does to monitor the various services |
| 9 | +and systems that provide all our services, and how we can be alerted |
| 10 | +when issues occur. When we're alerted, this means that whatever's |
| 11 | +gone wrong needs to be handled by a human, but not always. Sometimes |
| 12 | +an alert can trip if we have systems down for planned maintenance |
| 13 | +activities. During these windows, we intentionally take down services |
| 14 | +in order to repair, replace, or upgrade components so that we don't |
| 15 | +have unexpected breakage later. |
| 16 | + |
| 17 | +## Planned Downtime |
| 18 | + |
| 19 | +When possible, we always prefer for services to go down during a |
| 20 | +planned maintenance window. This allows for services to come down |
| 21 | +cleanly and for people involved to have planned for the time |
| 22 | +investment to effect changes to the system. We take planned downtime |
| 23 | +when its not possible to make a change to a system with it up, or when |
| 24 | +it would be unsafe to do so. Examples of planned downtime include |
| 25 | +kernel upgrades, major version changes of container runtimes, and |
| 26 | +major package upgrades. |
| 27 | + |
| 28 | +When we plan for an interruption, the relevant people will agree on a |
| 29 | +date usually at least a week in the future and will talk about what |
| 30 | +the impacts will be. Based on these conversations the team will then |
| 31 | +decide whether or not to post a blog post or notification to social |
| 32 | +media that an interruption is coming. Most of the changes we do don't |
| 33 | +warrant this, but some changes will interrupt services in either an |
| 34 | +unintuitive way or for an extended period of time. Usually just |
| 35 | +rebooting a mirror server doesn't warrant a notification, but |
| 36 | +suspending the sync to one for a few days would. |
| 37 | + |
| 38 | +## Unplanned Downtime |
| 39 | + |
| 40 | +Unplanned downtime is usually much more exciting because it is by |
| 41 | +definition unexpected. These events happen when something breaks. By |
| 42 | +and large the most common way that things break for Void is running |
| 43 | +out of space on disk. This happens because while disk drives are |
| 44 | +cheap, getting a drive that can survive years powered on with high |
| 45 | +read/write load is still not a straightforward ask. Especially not a |
| 46 | +straightforward problem if you want high performance throughput with |
| 47 | +low latency. The build servers need large volumes of scratch space |
| 48 | +while building certain packages due to the need to maintain large |
| 49 | +caches or lots of object files prior to linking. These large elastic |
| 50 | +use cases mean that we can have hundreds of gigabytes of free space |
| 51 | +and then over the course of a single build run out of space. |
| 52 | + |
| 53 | +When this happens, we have to log on to a box and look at where we can |
| 54 | +reclaim some space and possibly dispatch builds back through the |
| 55 | +system one architecture at a time to ensure they use low enough space |
| 56 | +requirements to complete. We also have to make sure that when we |
| 57 | +clean space, we're not cleaning files that will be immediately |
| 58 | +redownloaded. One of the easiest places to claim space back from, |
| 59 | +after all, is the cache of downloaded files. The primary point of |
| 60 | +complication in this workflow can be getting a build to restart. |
| 61 | +Sometimes we have builds that get submitted in specific orders and |
| 62 | +when a crash occurs in the middle we may need to re-queue the builds |
| 63 | +to ensure dependencies get built in the right order. |
| 64 | + |
| 65 | +Sometimes downtime occurs due to network partitions. Void runs in |
| 66 | +many datacenters around the globe, and incidents ranging from street |
| 67 | +repaving to literal ship anchors can disrupt the fiber optic cables |
| 68 | +connecting our various network sites together. When this happens, we |
| 69 | +can often arrive upon a state where people can see both sides of the |
| 70 | +split, but our machines can't see each other anymore. Sometimes we're |
| 71 | +able to fix this by manually reloading routes or cycling tunnels |
| 72 | +between machines, but often times its easier for us to just drain |
| 73 | +services from an affected location and wait out the issue using our |
| 74 | +remaining capacity elsewhere. |
| 75 | + |
| 76 | +## Lessening the Effects of Downtime |
| 77 | + |
| 78 | +As was alluded to with network partitions, we take a lot of steps to |
| 79 | +mitigate downtime and the effects of unplanned incidents. A large |
| 80 | +part of this effort goes into making as much content as possible |
| 81 | +static so that it can be served from minimal infrastructure, usually |
| 82 | +nothing more than an nginx instance. This is how the docs, |
| 83 | +infrastructure docs, main website, and a number of services like |
| 84 | +xlocate work. There's a batch task that runs to refresh the |
| 85 | +information, it gets copied to multiple servers, and then as long as |
| 86 | +at least one of those servers remains up the service remains up. |
| 87 | + |
| 88 | +Mirrors of course are highly available by being byte-for-byte copies |
| 89 | +of each other. Since the mirrors are static files, they're easy to |
| 90 | +make available redundantly. We also configure all mirrors to be able |
| 91 | +to serve under any name, so during an extended outage, the DNS entry |
| 92 | +for a given name can be changed and the traffic serviced by another |
| 93 | +mirror. This allows us to present the illusion that the mirrors don't |
| 94 | +go down when we perform longer maintenance at the cost of some |
| 95 | +complexity in the DNS layer. The mirrors don't just host sstatic |
| 96 | +content though. We also serve the <https://man.voidlinux.org> site |
| 97 | +from the mirrors which involves a CGI executable and a collection of |
| 98 | +static man pages to be available. The nginx frontends on each mirror |
| 99 | +are configured to first seek out their local services, but if those |
| 100 | +are unavailable they will reach across Void's private network to finda |
| 101 | +n instance of the service that is up. |
| 102 | + |
| 103 | +This private network is a mesh of wireguard tunnels that span all our |
| 104 | +different machines and different providers. You can think of it like |
| 105 | +a multi-cloud VPC which enables us to ignore a lot of the complexity |
| 106 | +that would otherwise manifest when operating in a multi-cloud design |
| 107 | +pattern. Theprivate network also allows us to use distributed service |
| 108 | +instances while still fronting them trhough rlatively few points. |
| 109 | +This improves security because very few people and places need access |
| 110 | +to the certificates for voidlinux.org, as opposed to the certificates |
| 111 | +having to be present on every machine. |
| 112 | + |
| 113 | +For services that are containerized, we have an additional set of |
| 114 | +tricks available that can let us lessen the effects of a downed |
| 115 | +server. As long as the task in question doesn't require access to |
| 116 | +specific disks or data that are not available elsewhere, Nomad can |
| 117 | +reschedule the task to some other machine and update its entry in our |
| 118 | +internal service catalog so that other services know where to find it. |
| 119 | +This allows us to move things like our IRC bots and some parts of our |
| 120 | +mirror control infrastructure around when hosts are unavailable, |
| 121 | +rather than those services having to be unavailable for the duration |
| 122 | +of a host level outage. If we know that the downtime is coming in |
| 123 | +advance, we can actually instruct Nomad to smoothly remove services |
| 124 | +from the specific machine in question and relocate those services |
| 125 | +somewhere else. When the relocation is handled as a specific event |
| 126 | +rather than as the result of a machine going away, the service |
| 127 | +interruption is measured in seconds. |
| 128 | + |
| 129 | +## Design Choices and Tradeoffs |
| 130 | + |
| 131 | +Of course there is no free lunch, and tehse choices come with |
| 132 | +tradeoffs. Some of the design choices we've made have to do with the |
| 133 | +difference in effort required to test a service locally and debug it |
| 134 | +remotely. Containers help a lot with this process since its possible |
| 135 | +to run the exact same image with the exact same code in it as what is |
| 136 | +running in the production instance. This also lets us insulate Void's |
| 137 | +infrastructure from any possible breakage caused by a bad update, |
| 138 | +since each service is encapsulated and resistant to bad updates. We |
| 139 | +simply review each service's behavior as they are updated individually |
| 140 | +and this results in a clean migration path from one version to another |
| 141 | +without any question of if it will work or not. If we do discover a |
| 142 | +problem, the infrastructure is checked into git and the old versions |
| 143 | +of the containers are retained, so we can easily roll back. |
| 144 | + |
| 145 | +We leverage the containers to make the workflows easier to debug in |
| 146 | +the genreal case, but of course the complexity doesn't go away. Its |
| 147 | +important to understand that container orchestrators don't remove |
| 148 | +complexity, quite to the contrary they increase it. What they do is |
| 149 | +shift and concentrate the complexity from one group of people |
| 150 | +(application developers) to another (infrastructure teams). This |
| 151 | +shift allows for fewer people to need to have to care about the |
| 152 | +specifics of running applications or deploying servers, since they |
| 153 | +truly can say "well it works on my machine" and be reasonably |
| 154 | +confident that the samee container wil work when deployed on the |
| 155 | +fleet. |
| 156 | + |
| 157 | +The last major tradeoff that we make when deciding where to run |
| 158 | +something is thinking about how hard it will be to move later if we |
| 159 | +decide we're unahppy with the provider. Void is actually currently in |
| 160 | +the process of migrating our email server from one host to another at |
| 161 | +the time of writing due to IP reputation issues at our previous |
| 162 | +hosting provider. In order to make it easier to perform the |
| 163 | +migration, we deployed the mail server originally as a container via |
| 164 | +Nomad, which means that standing up the new mail server is as easy as |
| 165 | +moving the DNS entries and telling Nomad that the old mail server |
| 166 | +should be drained of workload. |
| 167 | + |
| 168 | +Our infrastructure only works as well as the software running on it, |
| 169 | +but we do spend a lot of time making sure that the experience of |
| 170 | +developing and deploying that software is as easy as possible. |
| 171 | + |
| 172 | +--- |
| 173 | + |
| 174 | +This has been day four of Void's infrastructure week. Tomorrow we'll |
| 175 | +wrap up the series with a look at how we make distributed |
| 176 | +infrastructure work for our distributed team. This post was authored |
| 177 | +by `maldridge` who runs most of the day to day operations of the Void |
| 178 | +fleet. Feel free to ask questions on [GitHub |
| 179 | +Discussions](https://github.com/void-linux/void-packages/discussions/45140) |
| 180 | +or in [IRC](https://web.libera.chat/?nick=Guest?#voidlinux). |
0 commit comments