Skip to content

Commit 690922e

Browse files
authored
Add lab builder infrastructure for creating labs with containerlab on AWS (#105)
Add tooling to spin up EC2 instances with nested virtualization, deploy Juniper vJunos-router via containerlab, and collect show command outputs for lab-validation snapshots. infra/ -- AWS EC2 lifecycle scripts: - ec2-launch.sh: provisions metal/nested-virt instance with KVM - ec2-setup.sh: installs Docker, containerlab, KVM tools; loads pre-built Docker images from S3 - ec2-teardown.sh, ec2-status.sh: instance management - upload-image.sh, build-image.sh: image management for S3 - README.md: end-to-end lab creation documentation - examples/: two-router eBGP topology with Junos configs src/lab_builder/ -- Python library for lab operations: - containerlab deploy/inspect/destroy wrappers - SSH device interaction via netmiko - Convergence polling (BGP, OSPF, ISIS) - Show command collection with lab-validation file naming - Snapshot packaging into snapshots/ directory layout ---- Prompt: ``` Add tooling so that open source contributors can create new labs using virtual router images and containerlab on AWS EC2, with scripts for instance lifecycle, image management, device interaction, and show command collection. ```
1 parent 5a9ebac commit 690922e

22 files changed

Lines changed: 2025 additions & 0 deletions

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,3 +146,8 @@ networks/
146146
working/
147147
CLAUDE.md
148148
.claude/
149+
150+
# VM images (large, user-provided, not distributed)
151+
*.qcow2
152+
*.vmdk
153+
*.ova

infra/README.md

Lines changed: 348 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,348 @@
1+
# Lab Creation Infrastructure
2+
3+
Tools for creating new lab-validation snapshots using
4+
[containerlab](https://containerlab.dev/) on AWS EC2 with Juniper
5+
vJunos-router virtual images.
6+
7+
## Overview
8+
9+
This directory contains everything needed to:
10+
11+
1. Provision an AWS EC2 instance with KVM, Docker, and containerlab
12+
2. Deploy Juniper virtual router topologies
13+
3. Collect device operational data (show command outputs)
14+
4. Package the data as lab-validation snapshots
15+
5. Validate against Batfish
16+
17+
The workflow is designed to be driven by Claude Code or run manually.
18+
19+
## Prerequisites
20+
21+
- **AWS CLI v2.34+** with configured credentials (`aws configure` or
22+
`AWS_PROFILE` environment variable)
23+
- **Juniper vJunos-router qcow2 image** — free download from
24+
[Juniper vJunos Labs](https://www.juniper.net/us/en/dm/vjunos-labs.html)
25+
(non-production use, no time limit)
26+
- **Batfish** running locally for validation (Docker image or built from
27+
source)
28+
29+
## One-Time Setup
30+
31+
### 1. Download the Juniper Image
32+
33+
Download `vJunos-router-*.qcow2` from Juniper's website and save it to
34+
`infra/images/` (this directory is gitignored):
35+
36+
```bash
37+
ls infra/images/
38+
# vJunos-router-25.4R1.12.qcow2
39+
```
40+
41+
### 2. Upload to S3
42+
43+
```bash
44+
cd infra
45+
AWS_PROFILE=<profile> ./upload-image.sh
46+
```
47+
48+
This creates an S3 bucket named `lab-validation-images-<account-id>` (if it
49+
doesn't exist) and uploads all qcow2 files from `infra/images/`. Idempotent —
50+
skips files already in S3.
51+
52+
### 3. First Launch — Build the Docker Image
53+
54+
The first EC2 launch will find the qcow2 in S3 but no pre-built Docker image.
55+
The setup script automatically builds the vrnetlab container and uploads the
56+
result to S3 for future launches. This takes ~10 minutes total for the first
57+
launch.
58+
59+
```bash
60+
AWS_PROFILE=<profile> ./ec2-launch.sh
61+
# Wait for setup to complete (~5-10 min)
62+
ssh -i <key> ubuntu@<ip> 'cat /var/log/ec2-setup-complete'
63+
```
64+
65+
### Subsequent Launches
66+
67+
After the Docker image is in S3, new instances load it directly (~2-3 min):
68+
69+
```bash
70+
AWS_PROFILE=<profile> ./ec2-launch.sh
71+
```
72+
73+
## Creating a Lab
74+
75+
### Step 1: Design the Topology
76+
77+
Create a containerlab topology YAML file and Junos configs. Configs must be
78+
in **curly-brace format** (not `set` format) because vrnetlab concatenates
79+
them with its init.conf and loads them as a config disk.
80+
81+
See `infra/examples/` for working examples:
82+
83+
- `two-router-ebgp.clab.yml` — minimal 2-router eBGP lab
84+
- `evpn-type5/topology.clab.yml` — 4-node EVPN Type 5 fabric
85+
86+
**Interface mapping**: containerlab `ethN` maps to Junos `ge-0/0/(N-1)`:
87+
88+
| containerlab | Junos |
89+
| ------------ | ----------------- |
90+
| eth0 | management (auto) |
91+
| eth1 | ge-0/0/0 |
92+
| eth2 | ge-0/0/1 |
93+
| eth3 | ge-0/0/2 |
94+
| ethN | ge-0/0/(N-1) |
95+
96+
### Step 2: Launch EC2 and Upload
97+
98+
```bash
99+
# Launch instance
100+
AWS_PROFILE=<profile> ./ec2-launch.sh
101+
102+
# Upload topology and configs
103+
IP=<from launch output>
104+
KEY=<from launch output>
105+
ssh -i $KEY ubuntu@$IP 'mkdir -p ~/lab/mylab/configs ~/lab/src'
106+
scp -i $KEY -r src/lab_builder ubuntu@$IP:~/lab/src/
107+
scp -i $KEY topology.clab.yml ubuntu@$IP:~/lab/mylab/
108+
scp -i $KEY configs/*.cfg ubuntu@$IP:~/lab/mylab/configs/
109+
```
110+
111+
### Step 3: Deploy Topology
112+
113+
```bash
114+
ssh -i $KEY ubuntu@$IP \
115+
'cd ~/lab/mylab && sudo containerlab deploy -t topology.clab.yml'
116+
```
117+
118+
vJunos-router takes 5-10 minutes to boot. Monitor with:
119+
120+
```bash
121+
ssh -i $KEY ubuntu@$IP 'sudo containerlab inspect -t ~/lab/mylab/topology.clab.yml'
122+
```
123+
124+
Wait until all nodes show `(healthy)`.
125+
126+
### Step 4: Health Check
127+
128+
Verify SSH access and routing protocol convergence:
129+
130+
```bash
131+
ssh -i $KEY ubuntu@$IP \
132+
'cd ~/lab && PYTHONPATH=src python3 -m lab_builder health-check mylab/topology.clab.yml'
133+
```
134+
135+
This waits for SSH on all nodes, then polls BGP/OSPF/ISIS neighbor status
136+
until sessions are established.
137+
138+
### Step 5: Collect Show Commands
139+
140+
```bash
141+
ssh -i $KEY ubuntu@$IP \
142+
'cd ~/lab && PYTHONPATH=src python3 -m lab_builder collect mylab/topology.clab.yml --output-dir /tmp/collected'
143+
```
144+
145+
Collects 9 show commands per Junos node (see "Show Commands Collected" below).
146+
Files are named to match the lab-validation parser conventions.
147+
148+
### Step 6: Build Snapshot
149+
150+
```bash
151+
ssh -i $KEY ubuntu@$IP \
152+
'cd ~/lab && PYTHONPATH=src python3 -m lab_builder build-snapshot mylab/topology.clab.yml --name junos_my_feature --collected-dir /tmp/collected --snapshots-dir /tmp/snapshots'
153+
```
154+
155+
### Step 7: Download Snapshot
156+
157+
```bash
158+
scp -i $KEY -r ubuntu@$IP:/tmp/snapshots/junos_my_feature snapshots/
159+
```
160+
161+
### Step 8: Tear Down
162+
163+
```bash
164+
AWS_PROFILE=<profile> ./ec2-teardown.sh
165+
```
166+
167+
### Step 9: Validate Against Batfish
168+
169+
Locally (requires Batfish running):
170+
171+
```bash
172+
pytest lab_tests/test_labs.py --labname=junos_my_feature -v --tb=short
173+
```
174+
175+
### Step 10: Triage Failures
176+
177+
For each test failure, determine the cause:
178+
179+
- **Parser bug**: the lab-validation parsers can't handle the device output.
180+
Fix the parser, add unit tests.
181+
- **Batfish modeling discrepancy**: Batfish predicts different routes or
182+
interfaces than the real device. File a GitHub issue in batfish/batfish or
183+
batfish/lab-validation and add a sickbay entry.
184+
- **Config error in the lab**: fix the config, re-deploy, re-collect.
185+
- **Expected difference**: management interfaces, pseudo-interfaces, etc. that
186+
Batfish intentionally doesn't model. Update the validator's exclusion logic.
187+
188+
## Iterating on a Lab
189+
190+
To modify configs without full redeploy (saves 5-10 min boot time):
191+
192+
```bash
193+
# Push new config to a node
194+
ssh -i $KEY ubuntu@$IP \
195+
'cd ~/lab && PYTHONPATH=src python3 -m lab_builder push-config mylab/topology.clab.yml r1 /path/to/new-config.txt'
196+
197+
# Re-collect just that node
198+
ssh -i $KEY ubuntu@$IP \
199+
'cd ~/lab && PYTHONPATH=src python3 -m lab_builder recollect mylab/topology.clab.yml r1 --output-dir /tmp/collected'
200+
```
201+
202+
Then re-download and re-validate locally.
203+
204+
## Scripts Reference
205+
206+
| Script | Where it runs | Purpose |
207+
| ----------------- | ------------- | --------------------------------------------------------- |
208+
| `ec2-launch.sh` | Local | Launch EC2 with KVM, Docker, containerlab, images from S3 |
209+
| `ec2-status.sh` | Local | Show all lab-validation instances, warn about orphans |
210+
| `ec2-teardown.sh` | Local | Terminate instance and clean up |
211+
| `upload-image.sh` | Local | Upload qcow2 images to S3 (idempotent) |
212+
| `build-image.sh` | EC2 | Build vrnetlab Docker image from qcow2, upload to S3 |
213+
| `ec2-setup.sh` | EC2 (auto) | Bootstrap script, runs as user-data |
214+
215+
## lab_builder CLI Reference
216+
217+
Run on EC2 as `PYTHONPATH=src python3 -m lab_builder <command>`:
218+
219+
| Command | Purpose |
220+
| ------------------------------------------------------------------------ | ------------------------------------ |
221+
| `deploy <topo.yml>` | Deploy containerlab topology |
222+
| `inspect <topo.yml>` | Show discovered nodes and IPs |
223+
| `health-check <topo.yml> [--timeout N]` | Wait for SSH + routing convergence |
224+
| `collect <topo.yml> --output-dir DIR` | Collect show commands from all nodes |
225+
| `recollect <topo.yml> NODE --output-dir DIR` | Re-collect one node |
226+
| `push-config <topo.yml> NODE FILE` | Push set-format config and commit |
227+
| `build-snapshot <topo.yml> --name N --collected-dir D --snapshots-dir S` | Package as snapshot |
228+
| `destroy <topo.yml>` | Tear down topology |
229+
230+
## Show Commands Collected
231+
232+
For Juniper (vJunos-router), these are collected automatically:
233+
234+
| Command | Goes to | Purpose |
235+
| ----------------------------------------- | ----------------- | -------------------- |
236+
| `show configuration \| display set` | `configs/<node>/` | Device config |
237+
| `show route \| display json` | `show/<node>/` | Main routing table |
238+
| `show route protocol bgp \| display json` | `show/<node>/` | BGP routes |
239+
| `show interfaces \| display json` | `show/<node>/` | Interface properties |
240+
| `show route instance \| display json` | `show/<node>/` | VRF info |
241+
| `show version \| display json` | `show/<node>/` | Software version |
242+
| `show bgp neighbor \| display json` | `show/<node>/` | BGP peer status |
243+
| `show ospf neighbor \| display json` | `show/<node>/` | OSPF status |
244+
| `show isis adjacency \| display json` | `show/<node>/` | ISIS status |
245+
246+
## Snapshot Directory Structure
247+
248+
The output matches the lab-validation framework's expected layout:
249+
250+
```
251+
snapshots/<name>/
252+
├── configs/
253+
│ ├── <node1>/
254+
│ │ └── show_configuration_|_display_set.txt
255+
│ └── <node2>/
256+
│ └── show_configuration_|_display_set.txt
257+
├── show/
258+
│ ├── host_nos.txt # {"node1": "junos", "node2": "junos"}
259+
│ ├── <node1>/
260+
│ │ ├── show_route_|_display_json.txt
261+
│ │ ├── show_route_protocol_bgp_|_display_json.txt
262+
│ │ ├── show_interfaces_|_display_json.txt
263+
│ │ └── ...
264+
│ └── <node2>/
265+
│ └── ...
266+
└── validation/ # optional
267+
└── sickbay.yaml # expected failure entries
268+
```
269+
270+
## EC2 Instance Details
271+
272+
### Instance Types
273+
274+
Default is **m8i.2xlarge** (8 vCPU, 32 GB RAM). These support nested
275+
virtualization via `--cpu-options NestedVirtualization=enabled`, which vrnetlab
276+
needs to run VM-based router images inside Docker containers.
277+
278+
| Instance | vCPU | RAM | ~$/hr | Routers |
279+
| ----------- | ---- | ----- | ------ | ------- |
280+
| m8i.xlarge | 4 | 16 GB | ~$0.23 | 1-2 |
281+
| m8i.2xlarge | 8 | 32 GB | ~$0.46 | 2-4 |
282+
| m8i.4xlarge | 16 | 64 GB | ~$0.92 | 4-8 |
283+
284+
Each vJunos-router needs ~5 GB RAM and 4 vCPUs.
285+
286+
For spot pricing (~70% cheaper), add `--spot`.
287+
288+
### ec2-launch.sh Options
289+
290+
```
291+
--instance-type TYPE EC2 instance type (default: m8i.2xlarge)
292+
--key-name NAME Use existing EC2 key pair (auto-created if omitted)
293+
--timeout-hours N Auto-terminate after N hours (default: 4)
294+
--spot Request spot instance
295+
```
296+
297+
### Cost Safety
298+
299+
- Auto-terminate alarm after 4 hours (configurable)
300+
- `ec2-status.sh` warns about orphaned instances
301+
- The launch script prevents creating multiple tracked instances
302+
303+
### What Gets Installed (ec2-setup.sh)
304+
305+
- Docker CE
306+
- containerlab (from netdevops apt repo)
307+
- KVM/QEMU tools (qemu-kvm, libvirt)
308+
- Python 3 with netmiko, paramiko, PyYAML, awscli
309+
- Pre-built Docker images from S3 (or builds from qcow2 as fallback)
310+
311+
## Lab Design Principles
312+
313+
- **Simplicity**: minimum routers to demonstrate the feature (2-3 typical)
314+
- **Feature isolation**: one feature per lab
315+
- **Corner cases**: misconfigurations, asymmetric settings, boundary values
316+
- **Reproducibility**: deterministic results, avoid time-dependent behavior
317+
- **Documentation**: README explaining what the lab tests and why
318+
319+
## Supported Vendor Profiles
320+
321+
| containerlab kind | Vendor | Default creds | Boot time | KVM required |
322+
| ----------------------- | ------------------- | ----------------- | --------- | ------------ |
323+
| `juniper_vjunosrouter` | Junos (MX) | admin / admin@123 | 5-10 min | Yes |
324+
| `juniper_vjunosevolved` | Junos Evolved (PTX) | admin / admin@123 | ~15 min | Yes |
325+
| `juniper_crpd` | Junos cRPD | root / clab123 | ~1 min | No |
326+
327+
## Troubleshooting
328+
329+
**SSH connection refused after deploy**: vJunos-router takes 5-10 minutes to
330+
boot. Wait for `(healthy)` in `containerlab inspect` output before attempting
331+
SSH.
332+
333+
**Startup config not applied**: Configs must be in curly-brace format, not
334+
set format. vrnetlab concatenates the config with its init.conf and mounts
335+
it as a USB config disk.
336+
337+
**KVM not available**: Verify the instance type supports nested virtualization
338+
(M8i/C8i/R8i families) and that `--cpu-options NestedVirtualization=enabled`
339+
was used at launch. Check with `ls /dev/kvm` on the instance.
340+
341+
**Docker image not loaded**: Check `docker images | grep vjunos`. If empty,
342+
the S3 bucket may not have the Docker tarball. The setup script falls back to
343+
building from qcow2 if available.
344+
345+
**Node names wrong in collected data**: containerlab names containers as
346+
`clab-<topology>-<node>`. The lab_builder extracts node names by stripping
347+
this prefix. If node names contain hyphens, verify with
348+
`python3 -m lab_builder inspect topology.clab.yml`.

0 commit comments

Comments
 (0)