|
| 1 | +--- |
| 2 | +title: "Infrastructure Week - Day 5: Making Distributed Infrastructure Work for Distributed Teams" |
| 3 | +layout: post |
| 4 | +--- |
| 5 | + |
| 6 | +Void runs a distributed team of maintainers and contributors. Making |
| 7 | +infrastructure work for any team is a confluence of goals, user |
| 8 | +experience choices, and hard requirements. Making infrastructure work |
| 9 | +for a distributed team adds on the complexity of accessing everything |
| 10 | +securely over the open internet, and doing so in a way that is still |
| 11 | +convenient and easy to setup. After all, a light switch is difficult |
| 12 | +to use is likely to lead to lights being left on. |
| 13 | + |
| 14 | +We take several design criteria into mind when designing new systems |
| 15 | +and services that make Void work. We also periodically re-evaluate |
| 16 | +systems that have been built to ensure that they still follow good |
| 17 | +design practices in a way that we are able to maintain, and that does |
| 18 | +what we want. Lets dive in to some of these design practices. |
| 19 | + |
| 20 | +## No Maintainer Facing VPNs |
| 21 | + |
| 22 | +VPNs, or Virtual Private Networks are ways of interconnecting systems |
| 23 | +such that the network in between appears to vanish beneath a layer of |
| 24 | +abstraction. WireGuard, OpenVPN, and IPSec are examples of VPNs. |
| 25 | +OpenVPN and IPSec, a client program handles encryption and decryption |
| 26 | +of traffic on a tunnel or tap device that translates packets into and |
| 27 | +out of the kernel network stack. If you work in a field that involves |
| 28 | +using a computer for your job, your employer may make use of a VPN to |
| 29 | +grant your device connectivity to their corporate network environment |
| 30 | +without you having to be physically present in a building. VPN |
| 31 | +technologies can also be used to make multiple physical sites appear |
| 32 | +to all be on the same network. |
| 33 | + |
| 34 | +Void uses WireGuard to provide machine-to-machine connectivity for our |
| 35 | +fleet, but only within our fleet. Maintainers always access services |
| 36 | +without a VPN. Why do we do this, and how do we do it? First the |
| 37 | +why. We operate in this way because corporate VPNs are often |
| 38 | +cumbersome, require split horizon DNS (where you get different DNS |
| 39 | +answers depending on where you resolve from) and require careful |
| 40 | +planning to make sure no subnet overlap occurs between the VPN, the |
| 41 | +network you are connecting to, and your local network. If there were |
| 42 | +an overlap, the kernel would be unable to determine where to send the |
| 43 | +packets since it has multiple routes for the same subnets. There are |
| 44 | +cases where this is a valid network topology (ECMP), but that is not |
| 45 | +what is being discussed here. We also have no reason to use a VPN. |
| 46 | +Most of the use cases that still require a VPN have to do with |
| 47 | +transporting arbitrary TCP streams across a network, but this is |
| 48 | +unnecessary. For Void, all our services are either HTTP based or are |
| 49 | +transported over SSH. |
| 50 | + |
| 51 | +For almost all our systems that we interact with daily, either a web |
| 52 | +interface or HTTP-based API is provided. For the devspace file |
| 53 | +hosting system, maintainers can use SFTP via SSH. Both HTTP and SSH |
| 54 | +have robust, extremely well tested authentication and encryption |
| 55 | +options. When designing a system for secure access, defense in depth |
| 56 | +is important, but so is trust that the cryptographic primitives you |
| 57 | +have selected actually work. We trust that HTTPS works, and so there |
| 58 | +is no need to wrap the connection in an additional layer of |
| 59 | +encryption. The same goes for SSH, which we use exclusively |
| 60 | +public-key authentication for. This choice is sometimes challenging |
| 61 | +to maintain, since it means that we need to ensure highly available |
| 62 | +HTTP proxies and secure, easily maintained SSH key implementations, we |
| 63 | +have found it works well for us. In addition to the static files that |
| 64 | +all our tier 1 mirrors serve, the mirrors are additionally capable of |
| 65 | +acting as proxies. This allows us to terminate the externally trusted |
| 66 | +TLS session at a webserver running nginx, and then pass the traffic |
| 67 | +over our internal encrypted fabric to the destination service. |
| 68 | + |
| 69 | +For SSH we simply make use of `AuthorizedKeysCommand` to summon keys |
| 70 | +from NetAuth allowing authorized maintainers to log onto servers or |
| 71 | +ssh-enabled services wherever their keys are validated. For the |
| 72 | +devspace service which has a broader ACL than our base hardware, we |
| 73 | +can enhance its separation by running an SFTP server distinct from the |
| 74 | +host sshd. This allows us to ensure that it is impossible for a key |
| 75 | +validated for devspace to inadvertently authorize a shell login to the |
| 76 | +underlying host. |
| 77 | + |
| 78 | +For all other services, we make use of the service level |
| 79 | +authentication as and when required. We use combinations of Native |
| 80 | +NetAuth, LDAP proxies, and PAM helpers to make all access seamless for |
| 81 | +maintainers via our single sign on system. Removing the barrier of a |
| 82 | +VPN also means that during an outage, there's one less component we |
| 83 | +need to troubleshoot and debug, and one less place for systems to |
| 84 | +break. |
| 85 | + |
| 86 | +## Use of Composable Systems |
| 87 | + |
| 88 | +Distributed systems are often made up of complex, interdependent |
| 89 | +sub-assemblies. This level of complexity is fine for dedicated teams |
| 90 | +who are paid to maintain systems day in and day out, but is difficult |
| 91 | +to pull off with an all-volunteer team that works on Void in their |
| 92 | +free time. Distributed systems are also best understood on a |
| 93 | +whiteboard, and this doesn't lend itself well to making a change on a |
| 94 | +laptop from a train, or reviewing a delta from a tablet between other |
| 95 | +tasks. While substantive changes are almost always made from a full |
| 96 | +terminal, the ratio of substantive changes to items requiring only |
| 97 | +quick verification is significant, and its important to maintain a |
| 98 | +level of understand-ability. |
| 99 | + |
| 100 | +In order to maintain the level of understand-ability of the |
| 101 | +infrastructure at a level that permits a reasonable time investment, |
| 102 | +we make use of composable systems. Composable systems can best be |
| 103 | +thought of as infrastructure built out of common sub-assemblies. Think |
| 104 | +Lego blocks for servers. This allows us to have a common base library |
| 105 | +of components, for example webservers, synchronization primitives, and |
| 106 | +timers, and then build these into complex systems through joining |
| 107 | +their functionality together. |
| 108 | + |
| 109 | +We primarily use containers to achieve this composeability. Each |
| 110 | +container performs a single task or a well defined sub-process in a |
| 111 | +larger workflow. For example we can look at the workflow required to |
| 112 | +serve <https://man.voidlinux.org/> In this workflow, a task runs |
| 113 | +periodically to extract all man pages from all packages, then another |
| 114 | +process runs to copy those files to the mirrors, and finally a process |
| 115 | +runs to produce an HTTP response to a given man page request. Notice |
| 116 | +there that its an HTTP response, but the man site is served securely |
| 117 | +over HTTPS. This is because across all of our web-based services we |
| 118 | +make use of common infrastructure such as load balancers and our |
| 119 | +internal network. This allows applications to focus on their |
| 120 | +individual functions without needing to think about the complexity of |
| 121 | +serving an encrypted connection to the outside world. |
| 122 | + |
| 123 | +By designing our systems this way, we also gain another neat feature: |
| 124 | +local testing. Since applications can be broken down into smaller |
| 125 | +building blocks, we can take just the single building block under |
| 126 | +scrutiny and run it locally. Likewise, we can upgrade individual |
| 127 | +components of the system to determine if they improve or worsen a |
| 128 | +problem. With some clever configuration, we can even upgrade half of |
| 129 | +a system that's highly available and compare the old and new |
| 130 | +implementations side by side to see if we like one over the other. |
| 131 | +This composability enables us to configure complex systems as |
| 132 | +individual, understandable components. |
| 133 | + |
| 134 | +Its worth clarifying though that this is not necessarily a |
| 135 | +microservices architecture. We don't really have any services that |
| 136 | +could be defined as microservices in the conventional sense. Instead |
| 137 | +this architecture should be thought of as the Unix Philosophy as |
| 138 | +applied to infrastructure components. Each component has a single |
| 139 | +well understood goal and that's all it does. Other goals are |
| 140 | +accomplished by other services. |
| 141 | + |
| 142 | +We assemble all our various composed services into the service suite |
| 143 | +that Void provides via our orchestration system (Nomad) and our load |
| 144 | +balancers (nginx) which allow us to present the various disparate |
| 145 | +systems as though they were one to the outside world, while still |
| 146 | +maintaining them as separate service "verticals" side by side each |
| 147 | +other internally. |
| 148 | + |
| 149 | +## Everything in Git |
| 150 | + |
| 151 | +Void's packages repo is a large git repo with hundreds of contributors |
| 152 | +and many maintainers. This package bazaar contains all manner of |
| 153 | +different software that is updated, verified, and accepted by a team |
| 154 | +that spans the globe. Our infrastructure is no different, but |
| 155 | +involves fewer people. We make use of two key systems to enable our |
| 156 | +Infrastructure as Code (IaC) approach. |
| 157 | + |
| 158 | +The first of these tools is Ansible. Ansible is a configuration |
| 159 | +management utility written in python which can programatically SSH |
| 160 | +into machines, template files, install and remove packages and more. |
| 161 | +Ansible takes its instructions as collections of YAML files called |
| 162 | +roles that are assembled into playbooks (composeability!). These |
| 163 | +roles come from either the main void-infrastructure repo, or as |
| 164 | +individual modules from the void-ansible-roles organization on GitHub. |
| 165 | +Since this is code checked into Git, we can use ansible-lint to ensure |
| 166 | +that the code is consistent and lint-free. We can then review the |
| 167 | +changes as a diff, and work on various features on branches just like |
| 168 | +changes to void-packages. The ability to review what changed is also |
| 169 | +a powerful debugging tool to allow us to see if a configuration delta |
| 170 | +led to or resolved a problem, and if we've encountered any similar |
| 171 | +kind of change in the past. |
| 172 | + |
| 173 | +The second tool we use regularly is Terraform. Whereas Ansible |
| 174 | +configures servers, Terraform configures services. We can apply |
| 175 | +Terraform to almost any service that has an API as most popular |
| 176 | +services that Void consumes have terraform providers. We use |
| 177 | +Terraform to manage our policy files that are loaded into Nomad, |
| 178 | +Consul and Vault, we use it to provision and deprovision machines on |
| 179 | +DigitalOcean, Google and AWS, and we use it to update our DNS records |
| 180 | +as services change. Just like Ansible, Terraform has a linter, a |
| 181 | +robust module system for code re-use, and a really convenient system |
| 182 | +for producing a diff between what the files say the service should be |
| 183 | +doing and what it actually is doing. |
| 184 | + |
| 185 | +Perhaps the most important use of Terraform for us is the formalized |
| 186 | +onboarding and offboarding process for maintainers. When a new |
| 187 | +maintainer is proposed and has been accepted through discussion within |
| 188 | +the Void team, we'll privately reach out to them to ask if they want |
| 189 | +to join the project. Given that a candidate accepts the offer to join |
| 190 | +the group of pkg-committers, the action that formally brings them on |
| 191 | +to the team is a patch applied to the Terraform that manages our |
| 192 | +GitHub organization and its members. We can then log approvals, |
| 193 | +welcome the new contributor to our team with suitable emoji, and grant |
| 194 | +access all in one convenient place. |
| 195 | + |
| 196 | +Infrastructure as Code allows our distributed team to easily maintain |
| 197 | +our complex systems with a written record that we can refer back to. |
| 198 | +The ability to defer changes to an asynchronous review is imperative |
| 199 | +to manage the workflows of a distributed team. |
| 200 | + |
| 201 | +## Good Lines of Communication |
| 202 | + |
| 203 | +Of course, all the infrastructure in the world doesn't help if the |
| 204 | +people using it can't effectively communicate. To make sure this |
| 205 | +issue doesn't occur for Void, we have multiple forms of communication |
| 206 | +with different features. For real-time discussions and even some |
| 207 | +slower ones, we make use of IRC on Libera.chat. Though many |
| 208 | +communities appear to be moving away from synchronous text, we find |
| 209 | +that it works well for us. IRC is a great protocol that allows each |
| 210 | +member of the team to connect using the interface that they believe is |
| 211 | +the best for them, as well as to allow our automated systems to |
| 212 | +connect in as well. |
| 213 | + |
| 214 | +For conversations that need more time or are generally going to be |
| 215 | +longer we make use of email or a group-scoped discussion on GitHub. |
| 216 | +This allows for threaded messaging and a topic that can persist for |
| 217 | +days or weeks if needed. Maintaining a long running thread can help |
| 218 | +us tease apart complicated issues or ensure everyone's voice is heard. |
| 219 | +Long time users of Void may remember our forum, which has since been |
| 220 | +supplanted by a subreddit and most recently GitHub Discussions. These |
| 221 | +threaded message boards are also examples of places that we converse |
| 222 | +and exchange status information, but in a more social context. |
| 223 | + |
| 224 | +For discussion that needs to pertain directly to our infrastructure, |
| 225 | +we open tickets against the infrastructure repo. This provides an |
| 226 | +extremely clear place to report issues, discuss fixes, and collate |
| 227 | +information relating to ongoing work. It also allows us to leverage |
| 228 | +GitHub's commit message parsing to automatically resolve a discussion |
| 229 | +thread once a fix has been applied by closing the issue. For really |
| 230 | +large changes, we can also use GitHub projects, though in recent years |
| 231 | +we have not made use of this particular organization system for |
| 232 | +issues (we use tags). |
| 233 | + |
| 234 | +No matter where we converse though, its always important to make sure |
| 235 | +we converse clearly and concisely. Void's team speaks a variety of |
| 236 | +languages, though we mostly converse in English which is not known for |
| 237 | +its intuitive clarity. When making hazardous changes, we often push |
| 238 | +changes to a central location and ask for explicit review of dangerous |
| 239 | +parts, and call out clearly what the concerns are and what requires |
| 240 | +review. In this way we ensure that all of Void's various services |
| 241 | +stay up, and our team members stay informed. |
| 242 | + |
| 243 | +--- |
| 244 | + |
| 245 | +This post was authored by `maldridge` who runs most of the day to day |
| 246 | +operations of the Void fleet. On behalf of the entire Void team, I |
| 247 | +hope you have enjoyed this week's dive into the infrastructure that |
| 248 | +makes Void happen, and have learned some new things. We're always |
| 249 | +working to improve systems and make them easier to maintain or provide |
| 250 | +more useful features, so if you want to contribute, join us in IRC. |
| 251 | +Feel free to ask questions about this post or any of our others this |
| 252 | +week on [GitHub |
| 253 | +Discussions](https://github.com/void-linux/void-packages/discussions/45165) |
| 254 | +or in [IRC](https://web.libera.chat/?nick=Guest?#voidlinux). |
0 commit comments