Skip to content

Commit 44020a2

Browse files
committed
_posts: Add day 1 of infra week
1 parent a448bc9 commit 44020a2

1 file changed

Lines changed: 261 additions & 0 deletions

File tree

Lines changed: 261 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,261 @@
1+
---
2+
title: Infrastructure Week - Day 1: What is Infrastructure?
3+
layout: post
4+
---
5+
6+
This week we'll be taking a look into the infrastructure that runs and
7+
operates Void. We'll look at what we run, where it runs, and why it
8+
is setup the way it is. Overall at the end of the week you should
9+
have a better understanding of what actually makes Void happen at a
10+
mechanical level.
11+
12+
## So What Is Infrastructure, Like Really?
13+
14+
Infrastructure, as the term is used by Void, refers to systems,
15+
services, or machines that are owned or operated by the Void Linux
16+
project to provide the services that make Void a reality. These
17+
services range from our build farm and mirror infrastructure, to email
18+
servers for maintainers, to hosted services such as the Fastly CDN
19+
mirror. Void runs in many places on many kinds of providers, so lets
20+
take a deeper look into the kinds of hosts that Void makes use of.
21+
22+
### Owned Physical Hardware
23+
24+
The easiest to understand infrastructure that Void operates is
25+
physical hardware. These are computers in either server form factors,
26+
small form factor systems, or even just high performance consumer
27+
devices that are used by the project to provide compute resources for
28+
the software we need to run. Our hardware resources are split across
29+
a number of datacenters, but the point of commonality of owned
30+
physical hardware is that someone within the Void project actually
31+
bought and owns the device we're running on.
32+
33+
Owning the hardware is very different from a Cloud model where you pay
34+
per unit time that the resources are consumed, instead hardware like
35+
this is usually installed in a datacenter and co-located with many
36+
other servers. If you've never had the opportunity to visit a
37+
datacenter, they are basically large warehouse style buildings with
38+
rows of cabinets each containing some number of servers with high
39+
performance network and cooling available. The economy of scale of
40+
getting so many servers together in one location makes it more cost
41+
effective to provide extremely fast networks, high performance air
42+
conditioning, and reliable power usually sourced from multiple
43+
different grid connections and on-site redundant supplies.
44+
45+
Void currently maintains owned machines in datacenters in the US and
46+
Europe. Since we don't always have maintainers who live near enough
47+
to just go to the datacenter, when things go wrong and we need to go
48+
"hands-on" to the machines, we have to make use of "Remote Hands".
49+
Remote Hands, sometimes called "Smart Hands" refers to the process
50+
wherein we open a ticket with the datacenter facility explaining what
51+
we want them to do, and which machine we want them to do it to.
52+
There's usually a security verification challenge-response that is
53+
unique to each operator, but after some shuffling, the ticket is
54+
processed and someone physically goes to our machines and interacts
55+
with them. Almost always the ticket will be for one of 2 things: some
56+
component has failed, and we would like to buy a new one and have it
57+
installed (hard drives, memory) or the machine has become locked up in
58+
some way that we just need them to go hold in the power button. Most
59+
of our hardware doesn't have remote management capability, so we need
60+
someone to go physically push the buttons.
61+
62+
Occasionally, the problems are worse though, and we need to actually
63+
be able to interact with the machine. When this happens, we'll
64+
request that a KVM/iKVM/Spider/Hydra be attached, which provides a
65+
kind of remote desktop style of access, wherein an external device
66+
presents itself as a mouse and keyboard to the machine in question,
67+
and then streams the video output back to us. We can use these
68+
devices to be able to quickly recover from bad kernel updates, failed
69+
hardware, or even initial provisioning if the provider doesn't
70+
natively offer Void as an operating system choice, since most KVM
71+
devices allow us to remotely mount a disk image to the host as though
72+
a USB drive were plugged in.
73+
74+
Owned hardware is nice, but its also extremely expensive to initially
75+
setup, and is a long-term investment where we know we'll want to use
76+
those resources for an extended period of time. We have relatively
77+
few of these machines, but the ones we do have are very large
78+
capacity, high performance servers.
79+
80+
### Donated/Leased Hardware
81+
82+
Owning hardware is great, but a specific set of circumstances have to
83+
happen for that to be the right choice. The vast majority of Void's
84+
hardware is leased or is leased hardware which is donated for our use.
85+
This is hardware that operates exactly the same as physical owned
86+
hardware above, but we usually commit to these machines in increments
87+
ranging from several months to a year, and can renew or change the
88+
contract for the hardware more easily. Most of the build machines
89+
fall into this category, since it allows us to upgrade them regularly
90+
and ensure we're always making good use of the resources that are
91+
available to us.
92+
93+
Interacting with this hardware is usually a little different from
94+
hardware we physically own. Since this hardware comes from facilities
95+
where many of the same kind of machine is available, usually more
96+
automation exists to be able to remotely manage the systems. In
97+
particular we make use of a lot of hardware from Hetzner in Germany
98+
and Finland where we make use of the Hetzner Robot to remotely reboot
99+
and change the boot image of machines. For hardware that is donated
100+
to us, we usually have to reach out to the sponsor and ask them to
101+
file the ticket on our behalf since they are the contract holder with
102+
the facility. In some cases they're able to delegate this access to
103+
us, but we always keep them in the loop regardless.
104+
105+
### Cloud Systems
106+
107+
For smaller machines, usually having fewer than 4 CPU cores and less
108+
than 8GB of memory, the best option available to us will be to get the
109+
machine from a cloud. We currently use two cloud hosting providers
110+
for machines that are on all the time, and have the ability to spin up
111+
additional capacity in two other clouds.
112+
113+
We run a handful of machines at DigitalOcean to provide our control
114+
plane services that allow us to coordinate the other machines in the
115+
fleet, as well as to provide our single-sign-on services that let
116+
maintainers use one ID to access all Void related services and APIs.
117+
DigitalOcean has been a project sponsor for several years now, and
118+
they were the second cloud provider to get dedicated Void Linux
119+
images.
120+
121+
122+
For cloud machines that need to have a little more involved
123+
configuration, we run on top of the Hetzner cloud where our existing
124+
relationships with Hetzner make it easier for us to justify our
125+
requirements, and our longer account standing shows that we're not
126+
going to do dumb things on the platform, like run an open forwarder.
127+
Running a mail server on a cloud is itself somewhat challenging, and
128+
will be talked about later this week in more detail, so make sure to
129+
check back for more in this series.
130+
131+
Though we do not actively run services on AWS or GCP, we do maintain
132+
cloud images for these platforms. Sadly in GCP it is not possible for
133+
us to make our images broadly available, but it is relatively easy for
134+
you to create your own image if you desire to run Void on GCP.
135+
Similarly, you can run Void on AWS and make use of their wide service
136+
portfolio. We have evaluated in the past providing a ready to run AMI
137+
for AWS, but ultimately concluded the trade-off wasn't worth it. If
138+
you're interested in having a Void image on AWS, let us know.
139+
140+
## How We Provision the Fleet
141+
142+
Void's fleet spans multiple technologies and architectures, which
143+
makes provisioning it a somewhat difficult to follow process. In
144+
order of increasing complexity, we have manually managed provisioning,
145+
automatically managed OS provisioning and application provisioning,
146+
and full environment management in our cloud hosting environments.
147+
148+
### Manually Managed
149+
150+
This is the most familiar to the average Void user, we just perform
151+
these steps remotely. We'll power on a machine, boot from the live
152+
installer, and install the system to disk (almost always with a RAID
153+
configuration). Once the machine is installed and configured, we can
154+
then manage it remotely like any other machine in the fleet using our
155+
machine management tools.
156+
157+
### Imaged Resources
158+
159+
Some systems we run use a Void Linux image to perform the
160+
installation. These are usually smaller VMs being hosted by the
161+
members of the Void project in slack space on our own servers, and so
162+
the automation of a large hosting company doesn't make sense. These
163+
are systems usually running `qemu` and where the system gets unpacked
164+
from an image that the administrator will have prepared in advance
165+
containing the qemu guest agent and possibly other software required
166+
to connect to the network.
167+
168+
### Cloud Resources
169+
170+
Cloud managed resources are probably the most exciting of the systems
171+
we operate. These are generally managed using Hashicorp Terraform as
172+
fully managed environments. By this we mean that the very existence
173+
of the virtual server is codified in a file, checked into git, and
174+
applied using terraform to grow or shrink the number of resources.
175+
176+
We can only do this in places though that provide the APIs needed to
177+
manage resources in this way. We currently have the most resources at
178+
DigitalOcean managed with terraform, where our entire footprint on the
179+
platform is managed this way. This works out extremely conveniently
180+
when we want to add or remove machines, since its just a matter of
181+
editing a file and then re-running terraform to make the changes real.
182+
Beyond machines though, we also host the DNS records for voidlinux.org
183+
in a DigitalOcean DNS Zone. This enables us to easily track changes
184+
to DNS since its all in the console, but managed via a git-tracked
185+
process.
186+
187+
Having support for terraform is actually a major factor in deciding if
188+
we'll use a commercial hosting service or not. Remember that Void has
189+
developers all over the world in different time zones speaking
190+
different languages with different availability to actually work on
191+
Void. To make it easier to collaborate, we can apply the exact same
192+
workflows of distributed review and changes that make void-packages
193+
work to void-infrastructure and make feature branches for new servers,
194+
send them out for review, and process changes as required.
195+
196+
## Making the Hardware Useful: Provisioning Services
197+
198+
For services like DigitalOcean's hosted DNS or Fastly's CDN, once we
199+
push terraform configuration we're done and the service is live. This
200+
works because we're interfacing with a much larger system and just
201+
configuring our small partition of it. For most of Void's resources
202+
though, Void runs on machines either physical or virtual, and once the
203+
operating system is installed, we need to apply configuration to it to
204+
install packages, configure the system, and make the machines do more
205+
than just idle with SSH listening.
206+
207+
Our tool of choice for this is Ansible, which allows us to express a
208+
series of steps as yaml documents that when applied in order,
209+
configure the machine to have a given state. These files are called
210+
"playbooks" and we have multiple different playbooks for different
211+
machine functions, as well as functions common across the fleet.
212+
Usually upon provisioning a new machine, our first task will be to run
213+
the `base.yml` playbook which configures single sign on, DNS, and
214+
installs some packages that we expect to have available everywhere.
215+
After we've done this base configuration step, we apply `network.yml`
216+
which joins the machines to our network. Given that we run in so many
217+
places where different providers have different network technologies
218+
that are, for the most part, incompatible and proprietary, we need to
219+
operate our own network based on WireGuard to provide secure
220+
connectivity machine to machine.
221+
222+
When we have internal connectivity to Void's private network
223+
available, we can finalize provisioning by running any remaining
224+
playbooks that are required to turn the machine into something useful.
225+
These playbooks may install services directly, or install a higher
226+
level orchestrator that dynamically coordinates services. More on the
227+
services themselves later in the week.
228+
229+
## Why Does Void Run Where it Does?
230+
231+
Alternatively, why doesn't Void make use of cloud `<x>` or hosting
232+
provider `<y>`. The simple answer is because we have sufficient
233+
capacity with the providers we're already in and it takes a
234+
non-trivial amount of effort to build out support for new providers.
235+
236+
The longer answer has to do with the semi-unique way that Void is
237+
funded, which is entirely by the maintainers. We have chosen not to
238+
accept monetary donations since this involves non-trivial
239+
understanding of tax law internationally, and for Void, we've
240+
concluded that's more effort than its worth. As a result, the
241+
selection of hosting providers are either hosts that have reached out
242+
and were willing to provide us with promotional credits on the
243+
understanding that they were interacting with individuals on behalf of
244+
the project, or places that Void maintainers already had accounts and
245+
though the services were good quality and good value to run resources
246+
for Void.
247+
248+
We regularly do re-evaluate though where we're running and what
249+
resources we make use of both from a reliability standpoint as well as
250+
a cost standpoint. If you're with a hosting provider and want to see
251+
Void running in your fleet, drop us a line.
252+
253+
---
254+
255+
This has been day one of Void's infrastructure week. Check back
256+
tomorrow to learn about what services we run, how we run them, and how
257+
we make sure they keep running. This post was authored by `maldridge`
258+
who runs most of the day to day operations of the Void fleet. Feel
259+
free to ask questions on [GitHub
260+
Discussions](https://github.com/void-linux/void-packages/discussions/45072)
261+
or in [IRC](https://web.libera.chat/?nick=Guest?#voidlinux).

0 commit comments

Comments
 (0)