diff --git a/docs/docs/architecture.md b/docs/docs/architecture.md index adba648..100fd3b 100644 --- a/docs/docs/architecture.md +++ b/docs/docs/architecture.md @@ -36,17 +36,21 @@ Runbook-style pages remain separate: The lab shape is: -- The `UM760` and all three `MS-02 Ultra` systems run IncusOS directly on bare - metal. -- Those four hosts form one Incus cluster. +- The MINISFORUM N5 Pro NAS runs IncusOS directly on bare metal and is the + genesis Incus cluster member. +- The `UM760` and all three `MS-02 Ultra` systems remain later IncusOS capacity + unless a future prototype removes or repurposes one of them. - Talos Linux runs as Incus VMs and provides the Kubernetes nodes. -- The first platform Talos VM starts on the `UM760`; the final platform cluster - expands to three dual-role Talos nodes once the `MS-02` hosts join. +- A disposable single-node Talos cluster runs inside Incus on the N5 Pro to + host the bootstrap controllers. +- The first platform Talos VM is expected to start on the N5 Pro; the final + platform cluster expands to three dual-role Talos nodes once the `MS-02` + hosts join. - Cluster API Provider Incus, the Talos CAPI providers, and GitOps own normal Kubernetes cluster lifecycle. -- VyOS remains the lab network boundary and provides DHCP, DNS, PXE support, - the real platform Kubernetes API TCP frontend, and BGP peering for Kubernetes - service VIPs. +- VyOS remains the lab network boundary and provides DHCP, DNS, PXE support + when needed, the real platform Kubernetes API TCP frontend, and BGP peering + for Kubernetes service VIPs. - Cilium provides the Kubernetes datapath, LoadBalancer IP allocation, and BGP advertisements for service VIPs. - AWS anchors bootstrap identity, SOPS/KMS access, selected DNS material, and @@ -61,11 +65,10 @@ This is the desired architecture, not a claim that every piece is already live. The following items are deliberately still prototype-validation work: -- IncusOS `Operation` image generation and seeding for a final disk image. -- Writing the chosen IncusOS image through Tinkerbell `image2disk` or - `oci2disk`. -- Running the released `bootstrap-k0s` image on VyOS with the required - privileges, mounts, and host-network behavior. +- IncusOS installation and seeding on the N5 Pro with the `128GB` OS SSD and + mirrored `1TB` NVMe data pool. +- Running a disposable single-node Talos bootstrap cluster as an Incus VM on + the N5 Pro. - CAPN plus the Talos providers creating the desired Talos VM shape. - Exact VLANs, static addresses, ASNs, DNS records, and service VIP pools. @@ -81,3 +84,4 @@ The v1 architecture does not include: - Proxmox Backup Server as the default VM backup system. - Shared Incus VM storage, Ceph, LINSTOR, or Incus OVN in v1. - A manual USB path as the preferred host bootstrap workflow. +- A VyOS-hosted `k0s` cluster as the active bootstrap strategy. diff --git a/docs/docs/architecture/bootstrap-and-lifecycle.md b/docs/docs/architecture/bootstrap-and-lifecycle.md index c884964..301d190 100644 --- a/docs/docs/architecture/bootstrap-and-lifecycle.md +++ b/docs/docs/architecture/bootstrap-and-lifecycle.md @@ -5,86 +5,78 @@ description: Bootstrap flow, CAPI pivot, and cluster lifecycle ownership. # Bootstrap and Cluster Lifecycle -Bootstrap uses the same core tools that should own long-term lifecycle: -Tinkerbell for bare-metal provisioning, CAPI for cluster lifecycle, CAPN for -Incus infrastructure, and the Talos providers for Talos Kubernetes nodes. +Bootstrap uses the same core tools that should own long-term lifecycle: CAPI +for cluster lifecycle, CAPN for Incus infrastructure, and the Talos providers +for Talos Kubernetes nodes. + +Tinkerbell remains relevant for later bare-metal provisioning, but it is no +longer the first-host bootstrap anchor. The design deliberately avoids a separate hand-built host install path that would be thrown away after day one. ## Prototype First -Before touching the real `UM760`, prove the risky assumptions locally: +Before depending on the real N5 Pro path, prove the risky assumptions in the +smallest useful slice: -- generate or download a seeded IncusOS USB/IMG `Operation` image -- write it to a VM's only disk -- boot that disk as the steady-state IncusOS host -- confirm Incus initialization, trusted client certificate access, network - reachability, and API access -- exercise CAPN plus the Talos providers against Incus +- install and seed IncusOS on the N5 Pro +- confirm the `128GB` OS device and mirrored `1TB` NVMe ZFS data pool shape +- confirm Incus initialization, trusted client certificate access, OIDC + readiness, network reachability, and API access +- run a single-node Talos VM inside Incus with control-plane scheduling enabled +- exercise CAPN plus the Talos providers from that bootstrap cluster This prototype should be disposable. Its purpose is to learn which parts of the new path are real before producing exact runbooks. -## Temporary Bootstrap Cluster - -The first real bootstrap cluster is a disposable single-node `k0s` cluster on -VyOS, likely as a host-networked container. - -The implementation boundary is: +## Genesis Bootstrap Cluster -- `platform/bootstrap/k0s` owns the released bootstrap image -- `infra/network/vyos` pins and runs that released image on the router -- `infra/compute/incusos` owns the declarative IncusOS operation-image inputs - and rendered Tinkerbell `Hardware`, `Template`, and `Workflow` objects +The first real bootstrap cluster is a disposable single-node Talos Linux +cluster running as an Incus VM on the N5 Pro NAS. -It exists only to run: +It exists only to run the controllers needed to create and hand off the real +platform cluster: -- Tinkerbell - Cluster API - Cluster API Provider Incus - Talos bootstrap provider - Talos control-plane provider -The bootstrap image itself is a reusable artifact. Host-specific networking and -runtime values are injected by `infra` at deploy time rather than baked into -the image. +The bootstrap cluster is not the platform cluster. It should be easy to delete +after CAPI ownership moves into the platform cluster. + +The old VyOS-hosted `bootstrap-k0s` image path is historical context from the +abandoned UM760-first direction. Do not build new NAS bootstrap work around +that image unless a future cleanup session deliberately reactivates it. ## Host Bootstrap Flow The intended host bootstrap sequence is: -1. Start the released `bootstrap-k0s` image on VyOS through the `infra` - runtime wiring. -2. Let that image bring up Tinkerbell and the CAPI providers in the temporary - bootstrap cluster. -3. Generate a seeded IncusOS `Operation` image for the `UM760` from the - declarative inputs under `infra/compute/incusos/`. -4. Use Tinkerbell and HookOS to write that final-disk image directly to the - internal `UM760` disk through `image2disk` or `oci2disk`. -5. Boot the `UM760` into IncusOS as the steady-state host OS. -6. Initialize Incus on the `UM760` with the first-node defaults needed for the - final cluster. -7. Enable Incus clustering. -8. Import or publish the Talos nocloud image needed by CAPN. -9. Use CAPN and the Talos providers to create the first platform Talos VM on - the `UM760`. - -The same Tinkerbell path provisions the `MS-02` hosts later. Joining nodes must -use IncusOS seed settings appropriate for joining the existing Incus cluster, -not for creating independent local Incus defaults. - -The first supported rendered host flow is `um760`. Joiner-specific Tinkerbell -templates and workflows for the `MS-02` nodes come later, after the first-node -path is proven. - -A normal IncusOS `Installation` image must not be treated as equivalent to the -`Operation` image for the single-disk `UM760` path. The bootstrap assumption is -that the selected image is already a bootable final-disk artifact. +1. Install IncusOS on the N5 Pro using the `128GB` NVMe device as the OS disk. +2. Create the mirrored ZFS data pool from the two `1TB` WD_Black NVMe devices. +3. Initialize Incus with the first-node defaults needed for the final cluster. +4. Enable Incus clustering so later hosts can join. +5. Import or publish the Talos nocloud image needed by CAPN. +6. Create a disposable single-node Talos VM inside Incus on the N5 Pro. +7. Bootstrap Talos once and enable scheduling on the control-plane node for + bootstrap workloads. +8. Install CAPI, CAPN, and the Talos providers into the bootstrap cluster. +9. Use those providers to create the first real platform Talos VM. + +Tinkerbell can still provision later bare-metal hosts if that remains the best +path. Joining nodes must use IncusOS seed settings appropriate for joining the +existing Incus cluster, not for creating independent local Incus defaults. + +The first supported host flow is the N5 Pro. Joiner-specific Tinkerbell +templates, workflows, or replacement tooling for the `MS-02` and `UM760` nodes +come later, after the genesis path is proven. ## Platform Cluster Bring-Up -The platform cluster starts as one Talos VM on the `UM760`. +The platform cluster starts as one Talos VM created from the disposable Talos +bootstrap cluster on the N5 Pro. Day-0 Talos configuration installs only the substrate needed to make the cluster reachable and let GitOps take over: @@ -110,13 +102,6 @@ The gitops repo owns per-cluster version selection and cluster-local desired state. Infra and CAPI templates own only immutable day-0 references needed for fresh installs and reinstalls. -For the temporary bootstrap plane specifically: - -- `platform` publishes `ghcr.io/gilmanlab/platform/bootstrap-k0s:` -- `infra/network/vyos` consumes that exact released tag on the router -- `infra/compute/incusos` owns the bootstrap image inputs and the rendered - Tinkerbell objects applied to that cluster - ## Bootstrap/Core Artifact Contract The `platform/bootstrap/` subtree carries both Talos/CAPI day-0 substrate @@ -143,12 +128,6 @@ platform/ │ └── render/ │ ├── bootstrap.yaml │ └── full.yaml - ├── k0s/ - │ ├── Dockerfile - │ ├── VERSION - │ ├── k0s.yaml - │ ├── bootstrap-k0s.sh - │ └── manifests/ └── kro/ ├── Chart.yaml ├── Chart.lock @@ -174,9 +153,9 @@ The contract for each component is: `kro` has no Talos/CAPI bootstrap variant, so it does not need `bootstrap-values.yaml` or `render/bootstrap.yaml`. -`bootstrap/k0s` is the exception to the chart-wrapper pattern. It is a released -bootstrap image directory, not a wrapper chart. It is consumed by `infra` as a -GHCR image tag rather than as a raw manifest or OCI Helm chart. +`bootstrap/k0s` exists in the repository from the abandoned VyOS-hosted +bootstrap path. It is not part of the active NAS-first bootstrap contract. +Leave deletion, archiving, or repurposing to a focused cleanup session. The release and selection rules are: @@ -184,7 +163,6 @@ The release and selection rules are: - Re-render `render/bootstrap.yaml` and `render/full.yaml`. - Publish the wrapper chart as an OCI artifact under a component-scoped release tag. -- Publish `bootstrap/k0s` as `ghcr.io/gilmanlab/platform/bootstrap-k0s:`. - Select versions per cluster through `gitops/clusters//bootstrap.yaml`. - Reference raw Talos/CAPI artifacts by immutable commit SHA, not floating tags. - Keep the SHA referenced by Talos/CAPI aligned with the released artifact @@ -204,10 +182,10 @@ After the `MS-02` hosts join the Incus cluster: 1. Add two more Talos VMs on the `MS-02` tier. 2. Run the platform cluster as three dual-role control-plane/worker nodes. 3. Install the CAPI providers into the platform cluster. -4. Use `clusterctl move` to transfer ownership from the temporary bootstrap +4. Use `clusterctl move` to transfer ownership from the disposable bootstrap cluster to the platform cluster. -5. Remove the temporary VyOS bootstrap cluster and any temporary PXE behavior - once the platform cluster and Incus cluster are healthy. +5. Remove the disposable Talos bootstrap VM once the platform cluster and Incus + cluster are healthy. `clusterctl move` is a bootstrap pivot mechanism. It is not a backup or disaster recovery model. @@ -225,14 +203,15 @@ syncs cluster-core and platform API state to them after they exist. The bootstrap path is not complete until these are proven: -- IncusOS `Operation` image generation and seeding for first-node and - joining-node modes -- Tinkerbell image writing from HookOS to the selected disks -- VyOS-hosted `bootstrap-k0s` runtime stability with the required privileges, - mounts, and host-network behavior +- IncusOS installation and first-node seeding on the N5 Pro +- N5 Pro mirrored NVMe ZFS pool behavior under IncusOS and Incus +- single-node Talos bootstrap VM behavior inside Incus, including + control-plane workload scheduling - CAPN plus Talos providers creating Talos VMs with the desired boot mode, network attachment, storage pool, and endpoint model -- `clusterctl move` from the temporary cluster to the platform cluster +- `clusterctl move` from the disposable Talos bootstrap cluster to the platform + cluster +- later Tinkerbell or equivalent provisioning for joining bare-metal hosts ## References @@ -240,4 +219,5 @@ The bootstrap path is not complete until these are proven: - [Tinkerbell image2disk](https://github.com/tinkerbell/actions/tree/main/image2disk) - [Tinkerbell oci2disk](https://github.com/tinkerbell/actions/tree/main/oci2disk) - [CAPN Talos template](https://capn.linuxcontainers.org/reference/templates/talos.html) -- [VyOS containers](https://docs.vyos.io/en/latest/configuration/container/index.html) +- [Talos control plane](https://docs.siderolabs.com/talos/v1.12/learn-more/control-plane) +- [Talos control-plane scheduling](https://docs.siderolabs.com/talos/v1.12/deploy-and-manage-workloads/workers-on-controlplane) diff --git a/docs/docs/architecture/hosts-and-substrate.md b/docs/docs/architecture/hosts-and-substrate.md index 43ddcdf..0450204 100644 --- a/docs/docs/architecture/hosts-and-substrate.md +++ b/docs/docs/architecture/hosts-and-substrate.md @@ -17,24 +17,33 @@ Linux host management layer. ### VyOS Router The `VP6630` runs VyOS and remains the lab network appliance. It owns routing, -DHCP, DNS entrypoints, PXE coordination, bootstrap support, the real platform +DHCP, DNS entrypoints, PXE reachability when needed, the real platform Kubernetes API TCP frontend, and BGP peering with Cilium. -VyOS is intentionally part of the bootstrap path. The lab already depends on it -for network reachability, so using it for API fronting and temporary bootstrap -coordination keeps the early system small. +VyOS is intentionally part of the network bootstrap path. The lab already +depends on it for network reachability, but it no longer hosts the active +temporary Kubernetes bootstrap cluster. -In the current implementation split, VyOS consumes the released -`ghcr.io/gilmanlab/platform/bootstrap-k0s:` image through `infra` and -also serves the IncusOS operation image on `LAB_PROV` during host bootstrap. +### N5 Pro NAS + +The MINISFORUM N5 Pro runs IncusOS as the first permanent host and the genesis +Incus cluster member. + +It replaces the Synology as the NAS and in-lab storage boundary. The `128GB` +NVMe device is reserved for IncusOS. The two `1TB` WD_Black SN7100 NVMe +devices form the initial mirrored ZFS data pool for Incus and NAS duties. + +During bootstrap, Incus on the N5 Pro hosts a disposable single-node Talos +cluster. That cluster exists to run the bootstrap controllers before ownership +moves into the real platform cluster. ### UM760 -The `UM760` runs IncusOS as the first permanent host and the first Incus -cluster member. +The `UM760` remains available as later IncusOS capacity. -During bootstrap it hosts the first platform Talos VM. After the `MS-02` hosts -join, it remains useful as bootstrap, recovery, and light-duty capacity. +It is no longer the genesis host. Any future `UM760` role should be proven +after the N5 Pro path works, rather than carried forward from the abandoned +UM760-first bootstrap design. ### MS-02 Ultra Hosts @@ -46,18 +55,18 @@ this tier. ## Incus Cluster -The final Incus cluster spans: +The intended Incus cluster starts with the N5 Pro and later expands to: -- `um760` - `ms02-1` - `ms02-2` - `ms02-3` +- `um760`, if it remains useful after the NAS-first path is proven The cluster is intentionally heterogeneous. Incus cluster groups should be used -only as lightweight placement and CPU-boundary labels, for example `amd-um760` -and `intel-ms02`. Kubernetes remains the main workload scheduler. +only as lightweight placement and CPU-boundary labels, for example `amd-nas`, +`amd-um760`, and `intel-ms02`. Kubernetes remains the main workload scheduler. -The `UM760` is not a disposable bootstrap host. It is the first durable Incus +The N5 Pro is not a disposable bootstrap host. It is the first durable Incus cluster member. ## VM Substrate @@ -76,6 +85,9 @@ non-Talos VM that holds unique state must bring its own backup story. Use local ZFS-backed Incus storage on each IncusOS host in v1. +The N5 Pro starts with a mirrored ZFS pool over the two `1TB` NVMe drives. The +small `128GB` NVMe device is for the OS only. + An Incus cluster is a management cluster, not automatically a replicated storage system. For most Incus storage drivers, volumes remain on the member where they are created. That is acceptable for Talos VM disks because the @@ -86,10 +98,11 @@ Do not introduce shared VM storage in v1: - no Ceph - no LINSTOR - no Incus OVN/storage architecture just for VM mobility -- no NAS-backed default VM disks +- no remote NAS-backed default VM disks for every host -The NAS remains a durable backup and artifact boundary, not the default block -storage path for every VM. +The N5 Pro NAS is the durable backup and artifact boundary. Its local Incus +pool may host local VMs, but it is not a remote block-storage platform for +every VM in the lab. ## Network Attachment @@ -107,10 +120,11 @@ allowing the underlay to remain visible to VyOS. Before treating the host substrate as implementation reference material, prove: -- the selected IncusOS image mode is a correct final-disk artifact for the - single-disk `UM760` -- first-node and joining-node IncusOS seeds apply the right default Incus - settings +- IncusOS installs cleanly on the N5 Pro with the OS/data disk split described + above +- the first-node Incus seed applies the right default Incus settings +- the mirrored NVMe ZFS pool is exposed to Incus the way the bootstrap VM path + expects - joining nodes do not create local networks or storage pools that block cluster join - CAPN can place Talos VMs against the intended Incus profiles and storage pools diff --git a/docs/docs/architecture/keycloak-runtime.md b/docs/docs/architecture/keycloak-runtime.md index 1954b9b..8fe2a90 100644 --- a/docs/docs/architecture/keycloak-runtime.md +++ b/docs/docs/architecture/keycloak-runtime.md @@ -133,8 +133,8 @@ is already the durable store for it. Backups are written nightly to an S3 bucket in the `lab` account. The bucket uses SSE-KMS and object lock or versioning so corruptions cannot silently -overwrite known-good backups. The Synology NAS pulls a secondary copy on its -own schedule. +overwrite known-good backups. The N5 Pro NAS pulls a secondary copy on its own +schedule. Retention contract: diff --git a/docs/docs/architecture/state-and-recovery.md b/docs/docs/architecture/state-and-recovery.md index 415e44b..afb6400 100644 --- a/docs/docs/architecture/state-and-recovery.md +++ b/docs/docs/architecture/state-and-recovery.md @@ -25,8 +25,8 @@ The lab accumulates state in these tiers: - RouterOS configuration history, covered by [Network Device Backups](../network-device-backups.md). -The NAS is the main in-lab durable backup and artifact boundary for state that -must survive host rebuilds. +The N5 Pro NAS is the main in-lab durable backup and artifact boundary for +state that must survive host rebuilds. ## IncusOS Hosts @@ -63,16 +63,17 @@ If a non-Talos VM holds unique state, use an explicit VM-level backup path such as Incus snapshots, Incus exports, or copying to another backup Incus server. Do not build a VM backup platform before a real non-Talos VM requirement exists. -## Router Boundary Data +## Boundary Data -VyOS-hosted services that are bootstrap dependencies must have direct backup -paths because the platform cluster may be unavailable during recovery. +Boundary services that are bootstrap dependencies must have direct backup paths +because the platform cluster may be unavailable during recovery. Examples include: - local DNS zonefile state - Tailscale machine identity where applicable -- temporary bootstrap artifacts during an active bootstrap +- temporary bootstrap artifacts on the NAS, router, or operator machine during + an active bootstrap RouterOS device configuration history is operational evidence and reviewable change history. It is intentionally handled by the network-device backup flow diff --git a/docs/docs/hardware.md b/docs/docs/hardware.md index a1c1076..5b06070 100644 --- a/docs/docs/hardware.md +++ b/docs/docs/hardware.md @@ -1,19 +1,21 @@ --- title: Hardware Reference -description: Physical inventory and identifiers carried forward from the old lab. +description: Physical inventory, identifiers, and current hardware role notes. --- # Hardware Reference -This document is a rough inventory of physical lab equipment referenced in the -old lab repository. +This document is a rough inventory of physical lab equipment referenced by the +lab. It is intentionally descriptive rather than prescriptive: -- It captures what hardware appears to exist in the old repo. -- It records identifiers, model references, concrete specs, and network details where they were explicitly documented. -- It does not imply current or future architecture. -- The only operational detail retained here is that the `VP6630` is the lab router running VyOS. +- It captures hardware carried forward from the old repo plus confirmed newer + replacements. +- It records identifiers, model references, concrete specs, and network details + where they were explicitly documented. +- Current architecture roles are recorded only where explicitly decided, + currently the `VP6630` router role and the N5 Pro NAS genesis role. ## Inventory @@ -81,22 +83,29 @@ It is intentionally descriptive rather than prescriptive: - `ms02-node2` / `ms02-2` -> `10.10.10.12` - `ms02-node3` / `ms02-3` -> `10.10.10.13` -### Synology DiskStation DS923+ +### MINISFORUM N5 Pro NAS - Quantity: `1` -- Repo identifiers: `Synology NAS`, `NAS`, `nas.lab.local` -- Hardware details referenced in repo: - - `DiskStation DS923+` - - One ADR references `32GB RAM` shared with DSM - - User-confirmed `10GbE` PCIe add-in NIC installed +- Repo identifiers: `NAS`, `N5 Pro`, `nas.lab.local` +- Current role: + - Replaces the Synology NAS. + - Runs IncusOS as the genesis Incus cluster node. + - Hosts the disposable single-node Talos bootstrap cluster inside Incus. +- Hardware details: + - MINISFORUM N5 Pro 5-bay desktop AI NAS + - AMD Ryzen AI 9 HX PRO 370, `12c/24t` + - Crucial `32GB` DDR5 SODIMM kit, `2x16GB` + - `128GB` NVMe SSD reserved for the OS + - `2x WD_Black SN7100 1TB` NVMe SSDs in the remaining NVMe slots + - Planned mirrored ZFS data pool across the two `1TB` NVMe drives + - `1x10GbE` and `1x5GbE` onboard networking + - `2xUSB4`, HDMI, and OCuLink available but not currently architecture + drivers - Network details found in repo: - No fixed IP found - Example hostname reference: `nas.lab.local` - - NFS paths referenced in repo: - - `/volume1/images` - - `/volume1/backups` - - `/volume1/media` - - `/volume1/iso` +- Current network note: + - Connected to the 10GbE switch on the last port over copper. ### MikroTik CCR2004 @@ -148,7 +157,7 @@ The old repo contains a few places where terminology or specs drifted over time. - `MS-02` is confirmed by AMT screenshots as `MS-02-Ultra`. - The switch model is `CRS309-1G-8S+IN`; previous `CRS310-8G+2S+IN` references were incorrect. -- The `DS923+` uses an installed `10GbE` PCIe NIC rather than native `10GbE`. +- The N5 Pro replaces the Synology as the NAS and storage boundary. ## Primary Sources In The Old Repo