Skip to content

Commit cfd949e

Browse files
committed
_posts: Add day 3 of infra week
1 parent 701ee56 commit cfd949e

1 file changed

Lines changed: 224 additions & 0 deletions

File tree

Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
---
2+
title: "Infrastructure Week - Day 3: Are We Down?"
3+
layout: post
4+
---
5+
6+
# Are We Down?
7+
8+
So far we've looked at a relatively sizable fleet of machines
9+
scattered across a number of different providers, technologies, and
10+
management styles. We've then looked at the myriad of services that
11+
were running on top of the fleet and the tools used to deploy and
12+
maintain those services. At its heart, Void is a large distributed
13+
system with many parts working in concert to provide the set of
14+
features that end users and maintainers engage with.
15+
16+
Like any machine, Void's infrastructure has wear items, parts that
17+
require replacement, and components that break unexpectedly. When
18+
this happens we need to identify the problem, determine the cause,
19+
formulate a plan to return to service, and execute a set of workflows
20+
to either permanently resolve the issue, or temporarily bypass a
21+
problem to buy time while we work on a more permanent fix.
22+
23+
Lets go through the different systems and services that allow us to
24+
work out what's gone wrong, or what's still going right. We can
25+
broadly divide these systems into two kinds of monitoring solutions.
26+
In the first category we have logs. Logs are easy to understand
27+
conceptually because they exist all around us on every system.
28+
Metrics are a bit more abstract, and usually measure specific
29+
quantifiable qualities of a system or service. Void makes use of both
30+
Logs and Metrics to determine how the fleet is operating.
31+
32+
## Metrics
33+
34+
Metrics quantify some part of a system. You can think of metrics as a
35+
wall of gauges and charts that measure how a system works, similarly
36+
to the dashboard of a car that provides information about the speed of
37+
the vehicle, the rotational speed of the engine, and the coolant
38+
temperature and fuel levels. In Void's case, metrics refers to
39+
quantities like available disk space, number of requests per minute to
40+
a webserver, time spent processing a mirror sync and other similar
41+
items.
42+
43+
We collect these metrics to a central point on a dedicated machine
44+
using Prometheus, which is a widely adopted metrics monitoring system.
45+
Prometheus "scrapes" all our various sources of metrics by downloading
46+
data from them over HTTP, parsing it, and adding it to a time-series
47+
database. From this database we can then query for how a metric has
48+
changed over time in addition to whatever its current value is. This
49+
is on the surface not that interesting, but it turns out to be
50+
extremely useful since it allows checking how a value has changed over
51+
time. Humans turn out to be really good at pattern recognition, but
52+
machines are still better and we can have Prometheus predict
53+
trend lines, compute rates and compare them, and line up a bunch of
54+
different metrics onto the same graph so we can compare what different
55+
values were reading at the same time.
56+
57+
The metrics that Prometheus fetches come from programs that are
58+
collectively referred to as exporters. These exporters export the
59+
status information of the system they integrate with. Lets look at
60+
the individual exporters that Void uses and some examples of the
61+
metrics they provide.
62+
63+
### Node Exporter
64+
65+
Perhaps the most widely deployed exporter, the `node_exporter`
66+
provides information about nodes. In this case a node is a server
67+
somewhere, and the exporter provides a lot of general information
68+
about how the server is performing. Since it is a generic exporter,
69+
we get many many metrics out of it, not all of which apply to the Void
70+
fleet.
71+
72+
Some of the metrics that are exported include the status of the disk,
73+
memory, cpu and network, as well as more specialized information such
74+
as the number of context switches and various kernel level values from
75+
`/proc`.
76+
77+
### SSL Exporter
78+
79+
The SSL Exporter provides information about the various TLS
80+
certificates in use across the fleet. It does this by probing the
81+
remote services to retrieve the certificate and then extract values
82+
from it. Having these values allows us to alert on certificates that
83+
are expiring soon and have failed to renew, as well as to ensure that
84+
the target sites are reachable at all.
85+
86+
### Compiler Cache Exporter
87+
88+
Void's build farm makes use of `ccache` to speed up rebuilds when a
89+
build needs to be stopped and restarted. This is rarely useful
90+
because software has already had a test build by the time it makes it
91+
to our systems. However for large packages such as chromium, Firefox,
92+
and boost where a failure can occur due to an out of space condition
93+
or memory exhaustion. Having the compiler cache statistics allows us
94+
to determine if we're efficiently using the cache.
95+
96+
### Repository Exporter
97+
98+
The repository exporter is custom software that runs in two different
99+
configurations for Void. In the first configuration it checks our
100+
internal sync workflows and repository status. The metrics that are
101+
reported include the last time a given repository was updated, how
102+
long it took to copy from its origin builder to the shadow mirror, and
103+
whether or not the repository is currently staging changes or if the
104+
data is fully consistent. This status information allows maintainers
105+
to quickly and easily check whether a long running build has fully
106+
flushed through the system and the repositories are in steady state.
107+
It also provides a convenient way for us to catch problems with stuck
108+
rsync jobs where the rsync service may have become hung mid-copy.
109+
110+
In the second deployment the repo exporter looks not at Void's repos,
111+
but all of the mirrors. The information gathered in this case is
112+
whether the remote repo is still synchronizing with the current
113+
repodata or not, and how far behind the origin the remote repo is.
114+
The exporter can also work out how long a given mirror takes to sync
115+
if the remote mirror has configured timer files in their sync
116+
workflow, which can help us to alert a mirror sponsor to an issue at
117+
their end.
118+
119+
## Logs
120+
121+
Logs in Void's infrastructure are conceptually not unlike the files on
122+
disk in `/var/log` on a Void system. We have two primary systems that
123+
store and retrieve logs within our fleet.
124+
125+
### Build Logs
126+
127+
The build system produces copious amounts of log output that we need
128+
to retain effectively forever to be able to look back on if a problem
129+
occurs in a more recent version of a package and we want to know if
130+
the problem has always been present. Because of this, we use
131+
buildbot's built in log storage to store a large volume of logs on
132+
disk with locality to the build servers. These build logs aren't
133+
searchable, nor are they structured. Its just the output of the build
134+
workflow and xbps-src's status messages written to disk.
135+
136+
### Service Logs
137+
138+
Service logs are a bit more interesting, since these are logs that
139+
come from the broad collection of tasks that run on Nomad and may be
140+
themselves entirely ephemeral. The sync processes are a good example
141+
of this workflow where the process only exists as long as the copy
142+
runs, and then the task goes away, but we still need a way to
143+
determine if any faults occurred. To achieve this result, we stream
144+
the logs to Loki.
145+
146+
Loki is a complex distributed log processing system which we run in
147+
all-in-one mode to reduce its operational overhead. The practical
148+
benefit of Loki is that it handles the full text searching and label
149+
indexing of our structure log data. Structured logs simply refers to
150+
the idea that the logs are more than just raw text, but have some
151+
organizational hierarchy such as tags, JSON data, or a similar kind of
152+
metadata that enables fast and efficient cataloging of text data.
153+
154+
## Using this Data
155+
156+
Just collecting metrics and logs is one thing, actually using it to
157+
draw meaningful conclusions about the fleet and what its doing is
158+
another. We want to be able to visualize the data, but we also don't
159+
want to have to constantly be watching graphs to determine when
160+
something is wrong. We use different systems to access the data
161+
depending on whether a human or a machine is going to watch it.
162+
163+
For human access, we make use of Grafana to display nice graphs and
164+
dashboards. You can actually view all our public dashboards at
165+
<https://grafana.voidlinux.org> where you can see the mirror status,
166+
the builder status, and various other at-a-glance views of our
167+
systems. We use grafana to quickly explore the data and query logs
168+
when diagnosing a fault because its extremely optimized for this use
169+
case. We also are able to edit dashboards on the fly to produce new
170+
views of data which can help explain or visualize a fault.
171+
172+
For machines, we need some other way to observe the data. This kind
173+
of workflow, where we want the machine to observe the data and raise
174+
an alarm or alert if something is wrong is actually built in to
175+
Prometheus. We just load a collection of alerting rules which tell
176+
Prometheus what to look for in the pile of data at its disposal.
177+
178+
These rules look for things like predictions that the amount of free
179+
disk space will reach zero within 4 hours, the system load being too
180+
high for too long, or a machine thrashing too many context switches.
181+
Since these rules use the same query language that humans use to
182+
interactively explore the data, it allows for one-off graphs to
183+
quickly become alerts if we decide an issue that is intermittent is
184+
something we should keep an eye on long term. These alerts then raise
185+
conditions that a human needs to validate and potentially respond to,
186+
but that isn't something Prometheus does.
187+
188+
Fortunately for managing alerts, we can simply deploy the Prometheus
189+
Alertmanager, and this is what we do. This dedicated software takes
190+
care of receiving, deduplicating and grouping, and then forwarding
191+
alerts to other systems to actually do the summoning of a human to do
192+
something about the alert. In larger organizations, an alertmanager
193+
configuration would also route different alerts to different teams of
194+
people. Since Void is a relatively small organization, we just need
195+
the general pool of people who can do something to be made aware.
196+
There are lots of ways to do this, but the easiest is to just send the
197+
alerts to IRC.
198+
199+
This involves an IRC bot, and fortunately Google already had one
200+
publicly available we could run. The alertrelay bot connects to IRC
201+
on one end and alertmanager on the other and passes alerts to an IRC
202+
channel where all the maintainers are. We can't acknowledge the
203+
alerts from IRC, but most of the time we're just generally keeping an
204+
eye on things and making sure no part of the fleet crashes in a way
205+
that automatic recovery doesn't work.
206+
207+
208+
## Monitoring for Void - Altogether
209+
210+
Between metrics and logs we can paint a complete picture of what's
211+
going on anywhere in the fleet and the status of key systems. Whether
212+
its a performance question or an outage in progress, the tools at our
213+
disposal allow us to introspect systems without having to log in
214+
directly to any particular system.
215+
216+
---
217+
218+
This has been day three of Void's infrastructure week. Check back
219+
tomorrow to learn about what we do when things go wrong, and how we
220+
recover from failure scenarios. This post was authored by `maldridge`
221+
who runs most of the day to day operations of the Void fleet. Feel
222+
free to ask questions on [GitHub
223+
Discussions](https://github.com/void-linux/void-packages/discussions/45123)
224+
or in [IRC](https://web.libera.chat/?nick=Guest?#voidlinux).

0 commit comments

Comments
 (0)