|
| 1 | +--- |
| 2 | +title: Infrastructure Week - Day 1: What is Infrastructure? |
| 3 | +layout: post |
| 4 | +--- |
| 5 | + |
| 6 | +This week we'll be taking a look into the infrastructure that runs and |
| 7 | +operates Void. We'll look at what we run, where it runs, and why it |
| 8 | +is setup the way it is. Overall at the end of the week you should |
| 9 | +have a better understanding of what actually makes Void happen at a |
| 10 | +mechanical level. |
| 11 | + |
| 12 | +## So What Is Infrastructure, Like Really? |
| 13 | + |
| 14 | +Infrastructure, as the term is used by Void, refers to systems, |
| 15 | +services, or machines that are owned or operated by the Void Linux |
| 16 | +project to provide the services that make Void a reality. These |
| 17 | +services range from our build farm and mirror infrastructure, to email |
| 18 | +servers for maintainers, to hosted services such as the Fastly CDN |
| 19 | +mirror. Void runs in many places on many kinds of providers, so lets |
| 20 | +take a deeper look into the kinds of hosts that Void makes use of. |
| 21 | + |
| 22 | +### Owned Physical Hardware |
| 23 | + |
| 24 | +The easiest to understand infrastructure that Void operates is |
| 25 | +physical hardware. These are computers in either server form factors, |
| 26 | +small form factor systems, or even just high performance consumer |
| 27 | +devices that are used by the project to provide compute resources for |
| 28 | +the software we need to run. Our hardware resources are split across |
| 29 | +a number of datacenters, but the point of commonality of owned |
| 30 | +physical hardware is that someone within the Void project actually |
| 31 | +bought and owns the device we're running on. |
| 32 | + |
| 33 | +Owning the hardware is very different from a Cloud model where you pay |
| 34 | +per unit time that the resources are consumed, instead hardware like |
| 35 | +this is usually installed in a datacenter and co-located with many |
| 36 | +other servers. If you've never had the opportunity to visit a |
| 37 | +datacenter, they are basically large warehouse style buildings with |
| 38 | +rows of cabinets each containing some number of servers with high |
| 39 | +performance network and cooling available. The economy of scale of |
| 40 | +getting so many servers together in one location makes it more cost |
| 41 | +effective to provide extremely fast networks, high performance air |
| 42 | +conditioning, and reliable power usually sourced from multiple |
| 43 | +different grid connections and on-site redundant supplies. |
| 44 | + |
| 45 | +Void currently maintains owned machines in datacenters in the US and |
| 46 | +Europe. Since we don't always have maintainers who live near enough |
| 47 | +to just go to the datacenter, when things go wrong and we need to go |
| 48 | +"hands-on" to the machines, we have to make use of "Remote Hands". |
| 49 | +Remote Hands, sometimes called "Smart Hands" refers to the process |
| 50 | +wherein we open a ticket with the datacenter facility explaining what |
| 51 | +we want them to do, and which machine we want them to do it to. |
| 52 | +There's usually a security verification challenge-response that is |
| 53 | +unique to each operator, but after some shuffling, the ticket is |
| 54 | +processed and someone physically goes to our machines and interacts |
| 55 | +with them. Almost always the ticket will be for one of 2 things: some |
| 56 | +component has failed, and we would like to buy a new one and have it |
| 57 | +installed (hard drives, memory) or the machine has become locked up in |
| 58 | +some way that we just need them to go hold in the power button. Most |
| 59 | +of our hardware doesn't have remote management capability, so we need |
| 60 | +someone to go physically push the buttons. |
| 61 | + |
| 62 | +Occasionally, the problems are worse though, and we need to actually |
| 63 | +be able to interact with the machine. When this happens, we'll |
| 64 | +request that a KVM/iKVM/Spider/Hydra be attached, which provides a |
| 65 | +kind of remote desktop style of access, wherein an external device |
| 66 | +presents itself as a mouse and keyboard to the machine in question, |
| 67 | +and then streams the video output back to us. We can use these |
| 68 | +devices to be able to quickly recover from bad kernel updates, failed |
| 69 | +hardware, or even initial provisioning if the provider doesn't |
| 70 | +natively offer Void as an operating system choice, since most KVM |
| 71 | +devices allow us to remotely mount a disk image to the host as though |
| 72 | +a USB drive were plugged in. |
| 73 | + |
| 74 | +Owned hardware is nice, but its also extremely expensive to initially |
| 75 | +setup, and is a long-term investment where we know we'll want to use |
| 76 | +those resources for an extended period of time. We have relatively |
| 77 | +few of these machines, but the ones we do have are very large |
| 78 | +capacity, high performance servers. |
| 79 | + |
| 80 | +### Donated/Leased Hardware |
| 81 | + |
| 82 | +Owning hardware is great, but a specific set of circumstances have to |
| 83 | +happen for that to be the right choice. The vast majority of Void's |
| 84 | +hardware is leased or is leased hardware which is donated for our use. |
| 85 | +This is hardware that operates exactly the same as physical owned |
| 86 | +hardware above, but we usually commit to these machines in increments |
| 87 | +ranging from several months to a year, and can renew or change the |
| 88 | +contract for the hardware more easily. Most of the build machines |
| 89 | +fall into this category, since it allows us to upgrade them regularly |
| 90 | +and ensure we're always making good use of the resources that are |
| 91 | +available to us. |
| 92 | + |
| 93 | +Interacting with this hardware is usually a little different from |
| 94 | +hardware we physically own. Since this hardware comes from facilities |
| 95 | +where many of the same kind of machine is available, usually more |
| 96 | +automation exists to be able to remotely manage the systems. In |
| 97 | +particular we make use of a lot of hardware from Hetzner in Germany |
| 98 | +and Finland where we make use of the Hetzner Robot to remotely reboot |
| 99 | +and change the boot image of machines. For hardware that is donated |
| 100 | +to us, we usually have to reach out to the sponsor and ask them to |
| 101 | +file the ticket on our behalf since they are the contract holder with |
| 102 | +the facility. In some cases they're able to delegate this access to |
| 103 | +us, but we always keep them in the loop regardless. |
| 104 | + |
| 105 | +### Cloud Systems |
| 106 | + |
| 107 | +For smaller machines, usually having fewer than 4 CPU cores and less |
| 108 | +than 8GB of memory, the best option available to us will be to get the |
| 109 | +machine from a cloud. We currently use two cloud hosting providers |
| 110 | +for machines that are on all the time, and have the ability to spin up |
| 111 | +additional capacity in two other clouds. |
| 112 | + |
| 113 | +We run a handful of machines at DigitalOcean to provide our control |
| 114 | +plane services that allow us to coordinate the other machines in the |
| 115 | +fleet, as well as to provide our single-sign-on services that let |
| 116 | +maintainers use one ID to access all Void related services and APIs. |
| 117 | +DigitalOcean has been a project sponsor for several years now, and |
| 118 | +they were the second cloud provider to get dedicated Void Linux |
| 119 | +images. |
| 120 | + |
| 121 | + |
| 122 | +For cloud machines that need to have a little more involved |
| 123 | +configuration, we run on top of the Hetzner cloud where our existing |
| 124 | +relationships with Hetzner make it easier for us to justify our |
| 125 | +requirements, and our longer account standing shows that we're not |
| 126 | +going to do dumb things on the platform, like run an open forwarder. |
| 127 | +Running a mail server on a cloud is itself somewhat challenging, and |
| 128 | +will be talked about later this week in more detail, so make sure to |
| 129 | +check back for more in this series. |
| 130 | + |
| 131 | +Though we do not actively run services on AWS or GCP, we do maintain |
| 132 | +cloud images for these platforms. Sadly in GCP it is not possible for |
| 133 | +us to make our images broadly available, but it is relatively easy for |
| 134 | +you to create your own image if you desire to run Void on GCP. |
| 135 | +Similarly, you can run Void on AWS and make use of their wide service |
| 136 | +portfolio. We have evaluated in the past providing a ready to run AMI |
| 137 | +for AWS, but ultimately concluded the trade-off wasn't worth it. If |
| 138 | +you're interested in having a Void image on AWS, let us know. |
| 139 | + |
| 140 | +## How We Provision the Fleet |
| 141 | + |
| 142 | +Void's fleet spans multiple technologies and architectures, which |
| 143 | +makes provisioning it a somewhat difficult to follow process. In |
| 144 | +order of increasing complexity, we have manually managed provisioning, |
| 145 | +automatically managed OS provisioning and application provisioning, |
| 146 | +and full environment management in our cloud hosting environments. |
| 147 | + |
| 148 | +### Manually Managed |
| 149 | + |
| 150 | +This is the most familiar to the average Void user, we just perform |
| 151 | +these steps remotely. We'll power on a machine, boot from the live |
| 152 | +installer, and install the system to disk (almost always with a RAID |
| 153 | +configuration). Once the machine is installed and configured, we can |
| 154 | +then manage it remotely like any other machine in the fleet using our |
| 155 | +machine management tools. |
| 156 | + |
| 157 | +### Imaged Resources |
| 158 | + |
| 159 | +Some systems we run use a Void Linux image to perform the |
| 160 | +installation. These are usually smaller VMs being hosted by the |
| 161 | +members of the Void project in slack space on our own servers, and so |
| 162 | +the automation of a large hosting company doesn't make sense. These |
| 163 | +are systems usually running `qemu` and where the system gets unpacked |
| 164 | +from an image that the administrator will have prepared in advance |
| 165 | +containing the qemu guest agent and possibly other software required |
| 166 | +to connect to the network. |
| 167 | + |
| 168 | +### Cloud Resources |
| 169 | + |
| 170 | +Cloud managed resources are probably the most exciting of the systems |
| 171 | +we operate. These are generally managed using Hashicorp Terraform as |
| 172 | +fully managed environments. By this we mean that the very existence |
| 173 | +of the virtual server is codified in a file, checked into git, and |
| 174 | +applied using terraform to grow or shrink the number of resources. |
| 175 | + |
| 176 | +We can only do this in places though that provide the APIs needed to |
| 177 | +manage resources in this way. We currently have the most resources at |
| 178 | +DigitalOcean managed with terraform, where our entire footprint on the |
| 179 | +platform is managed this way. This works out extremely conveniently |
| 180 | +when we want to add or remove machines, since its just a matter of |
| 181 | +editing a file and then re-running terraform to make the changes real. |
| 182 | +Beyond machines though, we also host the DNS records for voidlinux.org |
| 183 | +in a DigitalOcean DNS Zone. This enables us to easily track changes |
| 184 | +to DNS since its all in the console, but managed via a git-tracked |
| 185 | +process. |
| 186 | + |
| 187 | +Having support for terraform is actually a major factor in deciding if |
| 188 | +we'll use a commercial hosting service or not. Remember that Void has |
| 189 | +developers all over the world in different time zones speaking |
| 190 | +different languages with different availability to actually work on |
| 191 | +Void. To make it easier to collaborate, we can apply the exact same |
| 192 | +workflows of distributed review and changes that make void-packages |
| 193 | +work to void-infrastructure and make feature branches for new servers, |
| 194 | +send them out for review, and process changes as required. |
| 195 | + |
| 196 | +## Making the Hardware Useful: Provisioning Services |
| 197 | + |
| 198 | +For services like DigitalOcean's hosted DNS or Fastly's CDN, once we |
| 199 | +push terraform configuration we're done and the service is live. This |
| 200 | +works because we're interfacing with a much larger system and just |
| 201 | +configuring our small partition of it. For most of Void's resources |
| 202 | +though, Void runs on machines either physical or virtual, and once the |
| 203 | +operating system is installed, we need to apply configuration to it to |
| 204 | +install packages, configure the system, and make the machines do more |
| 205 | +than just idle with SSH listening. |
| 206 | + |
| 207 | +Our tool of choice for this is Ansible, which allows us to express a |
| 208 | +series of steps as yaml documents that when applied in order, |
| 209 | +configure the machine to have a given state. These files are called |
| 210 | +"playbooks" and we have multiple different playbooks for different |
| 211 | +machine functions, as well as functions common across the fleet. |
| 212 | +Usually upon provisioning a new machine, our first task will be to run |
| 213 | +the `base.yml` playbook which configures single sign on, DNS, and |
| 214 | +installs some packages that we expect to have available everywhere. |
| 215 | +After we've done this base configuration step, we apply `network.yml` |
| 216 | +which joins the machines to our network. Given that we run in so many |
| 217 | +places where different providers have different network technologies |
| 218 | +that are, for the most part, incompatible and proprietary, we need to |
| 219 | +operate our own network based on WireGuard to provide secure |
| 220 | +connectivity machine to machine. |
| 221 | + |
| 222 | +When we have internal connectivity to Void's private network |
| 223 | +available, we can finalize provisioning by running any remaining |
| 224 | +playbooks that are required to turn the machine into something useful. |
| 225 | +These playbooks may install services directly, or install a higher |
| 226 | +level orchestrator that dynamically coordinates services. More on the |
| 227 | +services themselves later in the week. |
| 228 | + |
| 229 | +## Why Does Void Run Where it Does? |
| 230 | + |
| 231 | +Alternatively, why doesn't Void make use of cloud `<x>` or hosting |
| 232 | +provider `<y>`. The simple answer is because we have sufficient |
| 233 | +capacity with the providers we're already in and it takes a |
| 234 | +non-trivial amount of effort to build out support for new providers. |
| 235 | + |
| 236 | +The longer answer has to do with the semi-unique way that Void is |
| 237 | +funded, which is entirely by the maintainers. We have chosen not to |
| 238 | +accept monetary donations since this involves non-trivial |
| 239 | +understanding of tax law internationally, and for Void, we've |
| 240 | +concluded that's more effort than its worth. As a result, the |
| 241 | +selection of hosting providers are either hosts that have reached out |
| 242 | +and were willing to provide us with promotional credits on the |
| 243 | +understanding that they were interacting with individuals on behalf of |
| 244 | +the project, or places that Void maintainers already had accounts and |
| 245 | +though the services were good quality and good value to run resources |
| 246 | +for Void. |
| 247 | + |
| 248 | +We regularly do re-evaluate though where we're running and what |
| 249 | +resources we make use of both from a reliability standpoint as well as |
| 250 | +a cost standpoint. If you're with a hosting provider and want to see |
| 251 | +Void running in your fleet, drop us a line. |
| 252 | + |
| 253 | +--- |
| 254 | + |
| 255 | +This has been day one of Void's infrastructure week. Check back |
| 256 | +tomorrow to learn about what services we run, how we run them, and how |
| 257 | +we make sure they keep running. This post was authored by `maldridge` |
| 258 | +who runs most of the day to day operations of the Void fleet. Feel |
| 259 | +free to ask questions on [GitHub |
| 260 | +Discussions](https://github.com/void-linux/void-packages/discussions/45072) |
| 261 | +or in [IRC](https://web.libera.chat/?nick=Guest?#voidlinux). |
0 commit comments