|
| 1 | +--- |
| 2 | +title: "Infrastructure Week - Day 3: Are We Down?" |
| 3 | +layout: post |
| 4 | +--- |
| 5 | + |
| 6 | +# Are We Down? |
| 7 | + |
| 8 | +So far we've looked at a relatively sizable fleet of machines |
| 9 | +scattered across a number of different providers, technologies, and |
| 10 | +management styles. We've then looked at the myriad of services that |
| 11 | +were running on top of the fleet and the tools used to deploy and |
| 12 | +maintain those services. At its heart, Void is a large distributed |
| 13 | +system with many parts working in concert to provide the set of |
| 14 | +features that end users and maintainers engage with. |
| 15 | + |
| 16 | +Like any machine, Void's infrastructure has wear items, parts that |
| 17 | +require replacement, and components that break unexpectedly. When |
| 18 | +this happens we need to identify the problem, determine the cause, |
| 19 | +formulate a plan to return to service, and execute a set of workflows |
| 20 | +to either permanently resolve the issue, or temporarily bypass a |
| 21 | +problem to buy time while we work on a more permanent fix. |
| 22 | + |
| 23 | +Lets go through the different systems and services that allow us to |
| 24 | +work out what's gone wrong, or what's still going right. We can |
| 25 | +broadly divide these systems into two kinds of monitoring solutions. |
| 26 | +In the first category we have logs. Logs are easy to understand |
| 27 | +conceptually because they exist all around us on every system. |
| 28 | +Metrics are a bit more abstract, and usually measure specific |
| 29 | +quantifiable qualities of a system or service. Void makes use of both |
| 30 | +Logs and Metrics to determine how the fleet is operating. |
| 31 | + |
| 32 | +## Metrics |
| 33 | + |
| 34 | +Metrics quantify some part of a system. You can think of metrics as a |
| 35 | +wall of gauges and charts that measure how a system works, similarly |
| 36 | +to the dashboard of a car that provides information about the speed of |
| 37 | +the vehicle, the rotational speed of the engine, and the coolant |
| 38 | +temperature and fuel levels. In Void's case, metrics refers to |
| 39 | +quantities like available disk space, number of requests per minute to |
| 40 | +a webserver, time spent processing a mirror sync and other similar |
| 41 | +items. |
| 42 | + |
| 43 | +We collect these metrics to a central point on a dedicated machine |
| 44 | +using Prometheus, which is a widely adopted metrics monitoring system. |
| 45 | +Prometheus "scrapes" all our various sources of metrics by downloading |
| 46 | +data from them over HTTP, parsing it, and adding it to a time-series |
| 47 | +database. From this database we can then query for how a metric has |
| 48 | +changed over time in addition to whatever its current value is. This |
| 49 | +is on the surface not that interesting, but it turns out to be |
| 50 | +extremely useful since it allows checking how a value has changed over |
| 51 | +time. Humans turn out to be really good at pattern recognition, but |
| 52 | +machines are still better and we can have Prometheus predict |
| 53 | +trend lines, compute rates and compare them, and line up a bunch of |
| 54 | +different metrics onto the same graph so we can compare what different |
| 55 | +values were reading at the same time. |
| 56 | + |
| 57 | +The metrics that Prometheus fetches come from programs that are |
| 58 | +collectively referred to as exporters. These exporters export the |
| 59 | +status information of the system they integrate with. Lets look at |
| 60 | +the individual exporters that Void uses and some examples of the |
| 61 | +metrics they provide. |
| 62 | + |
| 63 | +### Node Exporter |
| 64 | + |
| 65 | +Perhaps the most widely deployed exporter, the `node_exporter` |
| 66 | +provides information about nodes. In this case a node is a server |
| 67 | +somewhere, and the exporter provides a lot of general information |
| 68 | +about how the server is performing. Since it is a generic exporter, |
| 69 | +we get many many metrics out of it, not all of which apply to the Void |
| 70 | +fleet. |
| 71 | + |
| 72 | +Some of the metrics that are exported include the status of the disk, |
| 73 | +memory, cpu and network, as well as more specialized information such |
| 74 | +as the number of context switches and various kernel level values from |
| 75 | +`/proc`. |
| 76 | + |
| 77 | +### SSL Exporter |
| 78 | + |
| 79 | +The SSL Exporter provides information about the various TLS |
| 80 | +certificates in use across the fleet. It does this by probing the |
| 81 | +remote services to retrieve the certificate and then extract values |
| 82 | +from it. Having these values allows us to alert on certificates that |
| 83 | +are expiring soon and have failed to renew, as well as to ensure that |
| 84 | +the target sites are reachable at all. |
| 85 | + |
| 86 | +### Compiler Cache Exporter |
| 87 | + |
| 88 | +Void's build farm makes use of `ccache` to speed up rebuilds when a |
| 89 | +build needs to be stopped and restarted. This is rarely useful |
| 90 | +because software has already had a test build by the time it makes it |
| 91 | +to our systems. However for large packages such as chromium, Firefox, |
| 92 | +and boost where a failure can occur due to an out of space condition |
| 93 | +or memory exhaustion. Having the compiler cache statistics allows us |
| 94 | +to determine if we're efficiently using the cache. |
| 95 | + |
| 96 | +### Repository Exporter |
| 97 | + |
| 98 | +The repository exporter is custom software that runs in two different |
| 99 | +configurations for Void. In the first configuration it checks our |
| 100 | +internal sync workflows and repository status. The metrics that are |
| 101 | +reported include the last time a given repository was updated, how |
| 102 | +long it took to copy from its origin builder to the shadow mirror, and |
| 103 | +whether or not the repository is currently staging changes or if the |
| 104 | +data is fully consistent. This status information allows maintainers |
| 105 | +to quickly and easily check whether a long running build has fully |
| 106 | +flushed through the system and the repositories are in steady state. |
| 107 | +It also provides a convenient way for us to catch problems with stuck |
| 108 | +rsync jobs where the rsync service may have become hung mid-copy. |
| 109 | + |
| 110 | +In the second deployment the repo exporter looks not at Void's repos, |
| 111 | +but all of the mirrors. The information gathered in this case is |
| 112 | +whether the remote repo is still synchronizing with the current |
| 113 | +repodata or not, and how far behind the origin the remote repo is. |
| 114 | +The exporter can also work out how long a given mirror takes to sync |
| 115 | +if the remote mirror has configured timer files in their sync |
| 116 | +workflow, which can help us to alert a mirror sponsor to an issue at |
| 117 | +their end. |
| 118 | + |
| 119 | +## Logs |
| 120 | + |
| 121 | +Logs in Void's infrastructure are conceptually not unlike the files on |
| 122 | +disk in `/var/log` on a Void system. We have two primary systems that |
| 123 | +store and retrieve logs within our fleet. |
| 124 | + |
| 125 | +### Build Logs |
| 126 | + |
| 127 | +The build system produces copious amounts of log output that we need |
| 128 | +to retain effectively forever to be able to look back on if a problem |
| 129 | +occurs in a more recent version of a package and we want to know if |
| 130 | +the problem has always been present. Because of this, we use |
| 131 | +buildbot's built in log storage to store a large volume of logs on |
| 132 | +disk with locality to the build servers. These build logs aren't |
| 133 | +searchable, nor are they structured. Its just the output of the build |
| 134 | +workflow and xbps-src's status messages written to disk. |
| 135 | + |
| 136 | +### Service Logs |
| 137 | + |
| 138 | +Service logs are a bit more interesting, since these are logs that |
| 139 | +come from the broad collection of tasks that run on Nomad and may be |
| 140 | +themselves entirely ephemeral. The sync processes are a good example |
| 141 | +of this workflow where the process only exists as long as the copy |
| 142 | +runs, and then the task goes away, but we still need a way to |
| 143 | +determine if any faults occurred. To achieve this result, we stream |
| 144 | +the logs to Loki. |
| 145 | + |
| 146 | +Loki is a complex distributed log processing system which we run in |
| 147 | +all-in-one mode to reduce its operational overhead. The practical |
| 148 | +benefit of Loki is that it handles the full text searching and label |
| 149 | +indexing of our structure log data. Structured logs simply refers to |
| 150 | +the idea that the logs are more than just raw text, but have some |
| 151 | +organizational hierarchy such as tags, JSON data, or a similar kind of |
| 152 | +metadata that enables fast and efficient cataloging of text data. |
| 153 | + |
| 154 | +## Using this Data |
| 155 | + |
| 156 | +Just collecting metrics and logs is one thing, actually using it to |
| 157 | +draw meaningful conclusions about the fleet and what its doing is |
| 158 | +another. We want to be able to visualize the data, but we also don't |
| 159 | +want to have to constantly be watching graphs to determine when |
| 160 | +something is wrong. We use different systems to access the data |
| 161 | +depending on whether a human or a machine is going to watch it. |
| 162 | + |
| 163 | +For human access, we make use of Grafana to display nice graphs and |
| 164 | +dashboards. You can actually view all our public dashboards at |
| 165 | +<https://grafana.voidlinux.org> where you can see the mirror status, |
| 166 | +the builder status, and various other at-a-glance views of our |
| 167 | +systems. We use grafana to quickly explore the data and query logs |
| 168 | +when diagnosing a fault because its extremely optimized for this use |
| 169 | +case. We also are able to edit dashboards on the fly to produce new |
| 170 | +views of data which can help explain or visualize a fault. |
| 171 | + |
| 172 | +For machines, we need some other way to observe the data. This kind |
| 173 | +of workflow, where we want the machine to observe the data and raise |
| 174 | +an alarm or alert if something is wrong is actually built in to |
| 175 | +Prometheus. We just load a collection of alerting rules which tell |
| 176 | +Prometheus what to look for in the pile of data at its disposal. |
| 177 | + |
| 178 | +These rules look for things like predictions that the amount of free |
| 179 | +disk space will reach zero within 4 hours, the system load being too |
| 180 | +high for too long, or a machine thrashing too many context switches. |
| 181 | +Since these rules use the same query language that humans use to |
| 182 | +interactively explore the data, it allows for one-off graphs to |
| 183 | +quickly become alerts if we decide an issue that is intermittent is |
| 184 | +something we should keep an eye on long term. These alerts then raise |
| 185 | +conditions that a human needs to validate and potentially respond to, |
| 186 | +but that isn't something Prometheus does. |
| 187 | + |
| 188 | +Fortunately for managing alerts, we can simply deploy the Prometheus |
| 189 | +Alertmanager, and this is what we do. This dedicated software takes |
| 190 | +care of receiving, deduplicating and grouping, and then forwarding |
| 191 | +alerts to other systems to actually do the summoning of a human to do |
| 192 | +something about the alert. In larger organizations, an alertmanager |
| 193 | +configuration would also route different alerts to different teams of |
| 194 | +people. Since Void is a relatively small organization, we just need |
| 195 | +the general pool of people who can do something to be made aware. |
| 196 | +There are lots of ways to do this, but the easiest is to just send the |
| 197 | +alerts to IRC. |
| 198 | + |
| 199 | +This involves an IRC bot, and fortunately Google already had one |
| 200 | +publicly available we could run. The alertrelay bot connects to IRC |
| 201 | +on one end and alertmanager on the other and passes alerts to an IRC |
| 202 | +channel where all the maintainers are. We can't acknowledge the |
| 203 | +alerts from IRC, but most of the time we're just generally keeping an |
| 204 | +eye on things and making sure no part of the fleet crashes in a way |
| 205 | +that automatic recovery doesn't work. |
| 206 | + |
| 207 | + |
| 208 | +## Monitoring for Void - Altogether |
| 209 | + |
| 210 | +Between metrics and logs we can paint a complete picture of what's |
| 211 | +going on anywhere in the fleet and the status of key systems. Whether |
| 212 | +its a performance question or an outage in progress, the tools at our |
| 213 | +disposal allow us to introspect systems without having to log in |
| 214 | +directly to any particular system. |
| 215 | + |
| 216 | +--- |
| 217 | + |
| 218 | +This has been day three of Void's infrastructure week. Check back |
| 219 | +tomorrow to learn about what we do when things go wrong, and how we |
| 220 | +recover from failure scenarios. This post was authored by `maldridge` |
| 221 | +who runs most of the day to day operations of the Void fleet. Feel |
| 222 | +free to ask questions on [GitHub |
| 223 | +Discussions](https://github.com/void-linux/void-packages/discussions/45123) |
| 224 | +or in [IRC](https://web.libera.chat/?nick=Guest?#voidlinux). |
0 commit comments