Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 17 additions & 13 deletions docs/docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,17 +36,21 @@ Runbook-style pages remain separate:

The lab shape is:

- The `UM760` and all three `MS-02 Ultra` systems run IncusOS directly on bare
metal.
- Those four hosts form one Incus cluster.
- The MINISFORUM N5 Pro NAS runs IncusOS directly on bare metal and is the
genesis Incus cluster member.
- The `UM760` and all three `MS-02 Ultra` systems remain later IncusOS capacity
unless a future prototype removes or repurposes one of them.
- Talos Linux runs as Incus VMs and provides the Kubernetes nodes.
- The first platform Talos VM starts on the `UM760`; the final platform cluster
expands to three dual-role Talos nodes once the `MS-02` hosts join.
- A disposable single-node Talos cluster runs inside Incus on the N5 Pro to
host the bootstrap controllers.
- The first platform Talos VM is expected to start on the N5 Pro; the final
platform cluster expands to three dual-role Talos nodes once the `MS-02`
hosts join.
- Cluster API Provider Incus, the Talos CAPI providers, and GitOps own normal
Kubernetes cluster lifecycle.
- VyOS remains the lab network boundary and provides DHCP, DNS, PXE support,
the real platform Kubernetes API TCP frontend, and BGP peering for Kubernetes
service VIPs.
- VyOS remains the lab network boundary and provides DHCP, DNS, PXE support
when needed, the real platform Kubernetes API TCP frontend, and BGP peering
for Kubernetes service VIPs.
- Cilium provides the Kubernetes datapath, LoadBalancer IP allocation, and BGP
advertisements for service VIPs.
- AWS anchors bootstrap identity, SOPS/KMS access, selected DNS material, and
Expand All @@ -61,11 +65,10 @@ This is the desired architecture, not a claim that every piece is already live.

The following items are deliberately still prototype-validation work:

- IncusOS `Operation` image generation and seeding for a final disk image.
- Writing the chosen IncusOS image through Tinkerbell `image2disk` or
`oci2disk`.
- Running the released `bootstrap-k0s` image on VyOS with the required
privileges, mounts, and host-network behavior.
- IncusOS installation and seeding on the N5 Pro with the `128GB` OS SSD and
mirrored `1TB` NVMe data pool.
- Running a disposable single-node Talos bootstrap cluster as an Incus VM on
the N5 Pro.
- CAPN plus the Talos providers creating the desired Talos VM shape.
- Exact VLANs, static addresses, ASNs, DNS records, and service VIP pools.

Expand All @@ -81,3 +84,4 @@ The v1 architecture does not include:
- Proxmox Backup Server as the default VM backup system.
- Shared Incus VM storage, Ceph, LINSTOR, or Incus OVN in v1.
- A manual USB path as the preferred host bootstrap workflow.
- A VyOS-hosted `k0s` cluster as the active bootstrap strategy.
140 changes: 60 additions & 80 deletions docs/docs/architecture/bootstrap-and-lifecycle.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,86 +5,78 @@ description: Bootstrap flow, CAPI pivot, and cluster lifecycle ownership.

# Bootstrap and Cluster Lifecycle

Bootstrap uses the same core tools that should own long-term lifecycle:
Tinkerbell for bare-metal provisioning, CAPI for cluster lifecycle, CAPN for
Incus infrastructure, and the Talos providers for Talos Kubernetes nodes.
Bootstrap uses the same core tools that should own long-term lifecycle: CAPI
for cluster lifecycle, CAPN for Incus infrastructure, and the Talos providers
for Talos Kubernetes nodes.

Tinkerbell remains relevant for later bare-metal provisioning, but it is no
longer the first-host bootstrap anchor.

The design deliberately avoids a separate hand-built host install path that
would be thrown away after day one.

## Prototype First

Before touching the real `UM760`, prove the risky assumptions locally:
Before depending on the real N5 Pro path, prove the risky assumptions in the
smallest useful slice:

- generate or download a seeded IncusOS USB/IMG `Operation` image
- write it to a VM's only disk
- boot that disk as the steady-state IncusOS host
- confirm Incus initialization, trusted client certificate access, network
reachability, and API access
- exercise CAPN plus the Talos providers against Incus
- install and seed IncusOS on the N5 Pro
- confirm the `128GB` OS device and mirrored `1TB` NVMe ZFS data pool shape
- confirm Incus initialization, trusted client certificate access, OIDC
readiness, network reachability, and API access
- run a single-node Talos VM inside Incus with control-plane scheduling enabled
- exercise CAPN plus the Talos providers from that bootstrap cluster

This prototype should be disposable. Its purpose is to learn which parts of the
new path are real before producing exact runbooks.

## Temporary Bootstrap Cluster

The first real bootstrap cluster is a disposable single-node `k0s` cluster on
VyOS, likely as a host-networked container.

The implementation boundary is:
## Genesis Bootstrap Cluster

- `platform/bootstrap/k0s` owns the released bootstrap image
- `infra/network/vyos` pins and runs that released image on the router
- `infra/compute/incusos` owns the declarative IncusOS operation-image inputs
and rendered Tinkerbell `Hardware`, `Template`, and `Workflow` objects
The first real bootstrap cluster is a disposable single-node Talos Linux
cluster running as an Incus VM on the N5 Pro NAS.

It exists only to run:
It exists only to run the controllers needed to create and hand off the real
platform cluster:

- Tinkerbell
- Cluster API
- Cluster API Provider Incus
- Talos bootstrap provider
- Talos control-plane provider

The bootstrap image itself is a reusable artifact. Host-specific networking and
runtime values are injected by `infra` at deploy time rather than baked into
the image.
The bootstrap cluster is not the platform cluster. It should be easy to delete
after CAPI ownership moves into the platform cluster.

The old VyOS-hosted `bootstrap-k0s` image path is historical context from the
abandoned UM760-first direction. Do not build new NAS bootstrap work around
that image unless a future cleanup session deliberately reactivates it.

## Host Bootstrap Flow

The intended host bootstrap sequence is:

1. Start the released `bootstrap-k0s` image on VyOS through the `infra`
runtime wiring.
2. Let that image bring up Tinkerbell and the CAPI providers in the temporary
bootstrap cluster.
3. Generate a seeded IncusOS `Operation` image for the `UM760` from the
declarative inputs under `infra/compute/incusos/`.
4. Use Tinkerbell and HookOS to write that final-disk image directly to the
internal `UM760` disk through `image2disk` or `oci2disk`.
5. Boot the `UM760` into IncusOS as the steady-state host OS.
6. Initialize Incus on the `UM760` with the first-node defaults needed for the
final cluster.
7. Enable Incus clustering.
8. Import or publish the Talos nocloud image needed by CAPN.
9. Use CAPN and the Talos providers to create the first platform Talos VM on
the `UM760`.

The same Tinkerbell path provisions the `MS-02` hosts later. Joining nodes must
use IncusOS seed settings appropriate for joining the existing Incus cluster,
not for creating independent local Incus defaults.

The first supported rendered host flow is `um760`. Joiner-specific Tinkerbell
templates and workflows for the `MS-02` nodes come later, after the first-node
path is proven.

A normal IncusOS `Installation` image must not be treated as equivalent to the
`Operation` image for the single-disk `UM760` path. The bootstrap assumption is
that the selected image is already a bootable final-disk artifact.
1. Install IncusOS on the N5 Pro using the `128GB` NVMe device as the OS disk.
2. Create the mirrored ZFS data pool from the two `1TB` WD_Black NVMe devices.
3. Initialize Incus with the first-node defaults needed for the final cluster.
4. Enable Incus clustering so later hosts can join.
5. Import or publish the Talos nocloud image needed by CAPN.
6. Create a disposable single-node Talos VM inside Incus on the N5 Pro.
7. Bootstrap Talos once and enable scheduling on the control-plane node for
bootstrap workloads.
8. Install CAPI, CAPN, and the Talos providers into the bootstrap cluster.
9. Use those providers to create the first real platform Talos VM.

Tinkerbell can still provision later bare-metal hosts if that remains the best
path. Joining nodes must use IncusOS seed settings appropriate for joining the
existing Incus cluster, not for creating independent local Incus defaults.

The first supported host flow is the N5 Pro. Joiner-specific Tinkerbell
templates, workflows, or replacement tooling for the `MS-02` and `UM760` nodes
come later, after the genesis path is proven.

## Platform Cluster Bring-Up

The platform cluster starts as one Talos VM on the `UM760`.
The platform cluster starts as one Talos VM created from the disposable Talos
bootstrap cluster on the N5 Pro.

Day-0 Talos configuration installs only the substrate needed to make the
cluster reachable and let GitOps take over:
Expand All @@ -110,13 +102,6 @@ The gitops repo owns per-cluster version selection and cluster-local desired
state. Infra and CAPI templates own only immutable day-0 references needed for
fresh installs and reinstalls.

For the temporary bootstrap plane specifically:

- `platform` publishes `ghcr.io/gilmanlab/platform/bootstrap-k0s:<version>`
- `infra/network/vyos` consumes that exact released tag on the router
- `infra/compute/incusos` owns the bootstrap image inputs and the rendered
Tinkerbell objects applied to that cluster

## Bootstrap/Core Artifact Contract

The `platform/bootstrap/` subtree carries both Talos/CAPI day-0 substrate
Expand All @@ -143,12 +128,6 @@ platform/
│ └── render/
│ ├── bootstrap.yaml
│ └── full.yaml
├── k0s/
│ ├── Dockerfile
│ ├── VERSION
│ ├── k0s.yaml
│ ├── bootstrap-k0s.sh
│ └── manifests/
└── kro/
├── Chart.yaml
├── Chart.lock
Expand All @@ -174,17 +153,16 @@ The contract for each component is:
`kro` has no Talos/CAPI bootstrap variant, so it does not need
`bootstrap-values.yaml` or `render/bootstrap.yaml`.

`bootstrap/k0s` is the exception to the chart-wrapper pattern. It is a released
bootstrap image directory, not a wrapper chart. It is consumed by `infra` as a
GHCR image tag rather than as a raw manifest or OCI Helm chart.
`bootstrap/k0s` exists in the repository from the abandoned VyOS-hosted
bootstrap path. It is not part of the active NAS-first bootstrap contract.
Leave deletion, archiving, or repurposing to a focused cleanup session.

The release and selection rules are:

- Change canonical inputs in `platform`.
- Re-render `render/bootstrap.yaml` and `render/full.yaml`.
- Publish the wrapper chart as an OCI artifact under a component-scoped release
tag.
- Publish `bootstrap/k0s` as `ghcr.io/gilmanlab/platform/bootstrap-k0s:<version>`.
- Select versions per cluster through `gitops/clusters/<cluster>/bootstrap.yaml`.
- Reference raw Talos/CAPI artifacts by immutable commit SHA, not floating tags.
- Keep the SHA referenced by Talos/CAPI aligned with the released artifact
Expand All @@ -204,10 +182,10 @@ After the `MS-02` hosts join the Incus cluster:
1. Add two more Talos VMs on the `MS-02` tier.
2. Run the platform cluster as three dual-role control-plane/worker nodes.
3. Install the CAPI providers into the platform cluster.
4. Use `clusterctl move` to transfer ownership from the temporary bootstrap
4. Use `clusterctl move` to transfer ownership from the disposable bootstrap
cluster to the platform cluster.
5. Remove the temporary VyOS bootstrap cluster and any temporary PXE behavior
once the platform cluster and Incus cluster are healthy.
5. Remove the disposable Talos bootstrap VM once the platform cluster and Incus
cluster are healthy.

`clusterctl move` is a bootstrap pivot mechanism. It is not a backup or
disaster recovery model.
Expand All @@ -225,19 +203,21 @@ syncs cluster-core and platform API state to them after they exist.

The bootstrap path is not complete until these are proven:

- IncusOS `Operation` image generation and seeding for first-node and
joining-node modes
- Tinkerbell image writing from HookOS to the selected disks
- VyOS-hosted `bootstrap-k0s` runtime stability with the required privileges,
mounts, and host-network behavior
- IncusOS installation and first-node seeding on the N5 Pro
- N5 Pro mirrored NVMe ZFS pool behavior under IncusOS and Incus
- single-node Talos bootstrap VM behavior inside Incus, including
control-plane workload scheduling
- CAPN plus Talos providers creating Talos VMs with the desired boot mode,
network attachment, storage pool, and endpoint model
- `clusterctl move` from the temporary cluster to the platform cluster
- `clusterctl move` from the disposable Talos bootstrap cluster to the platform
cluster
- later Tinkerbell or equivalent provisioning for joining bare-metal hosts

## References

- [IncusOS installation seed](https://linuxcontainers.org/incus-os/docs/main/reference/seed/)
- [Tinkerbell image2disk](https://github.com/tinkerbell/actions/tree/main/image2disk)
- [Tinkerbell oci2disk](https://github.com/tinkerbell/actions/tree/main/oci2disk)
- [CAPN Talos template](https://capn.linuxcontainers.org/reference/templates/talos.html)
- [VyOS containers](https://docs.vyos.io/en/latest/configuration/container/index.html)
- [Talos control plane](https://docs.siderolabs.com/talos/v1.12/learn-more/control-plane)
- [Talos control-plane scheduling](https://docs.siderolabs.com/talos/v1.12/deploy-and-manage-workloads/workers-on-controlplane)
60 changes: 37 additions & 23 deletions docs/docs/architecture/hosts-and-substrate.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,24 +17,33 @@ Linux host management layer.
### VyOS Router

The `VP6630` runs VyOS and remains the lab network appliance. It owns routing,
DHCP, DNS entrypoints, PXE coordination, bootstrap support, the real platform
DHCP, DNS entrypoints, PXE reachability when needed, the real platform
Kubernetes API TCP frontend, and BGP peering with Cilium.

VyOS is intentionally part of the bootstrap path. The lab already depends on it
for network reachability, so using it for API fronting and temporary bootstrap
coordination keeps the early system small.
VyOS is intentionally part of the network bootstrap path. The lab already
depends on it for network reachability, but it no longer hosts the active
temporary Kubernetes bootstrap cluster.

In the current implementation split, VyOS consumes the released
`ghcr.io/gilmanlab/platform/bootstrap-k0s:<version>` image through `infra` and
also serves the IncusOS operation image on `LAB_PROV` during host bootstrap.
### N5 Pro NAS

The MINISFORUM N5 Pro runs IncusOS as the first permanent host and the genesis
Incus cluster member.

It replaces the Synology as the NAS and in-lab storage boundary. The `128GB`
NVMe device is reserved for IncusOS. The two `1TB` WD_Black SN7100 NVMe
devices form the initial mirrored ZFS data pool for Incus and NAS duties.

During bootstrap, Incus on the N5 Pro hosts a disposable single-node Talos
cluster. That cluster exists to run the bootstrap controllers before ownership
moves into the real platform cluster.

### UM760

The `UM760` runs IncusOS as the first permanent host and the first Incus
cluster member.
The `UM760` remains available as later IncusOS capacity.

During bootstrap it hosts the first platform Talos VM. After the `MS-02` hosts
join, it remains useful as bootstrap, recovery, and light-duty capacity.
It is no longer the genesis host. Any future `UM760` role should be proven
after the N5 Pro path works, rather than carried forward from the abandoned
UM760-first bootstrap design.

### MS-02 Ultra Hosts

Expand All @@ -46,18 +55,18 @@ this tier.

## Incus Cluster

The final Incus cluster spans:
The intended Incus cluster starts with the N5 Pro and later expands to:

- `um760`
- `ms02-1`
- `ms02-2`
- `ms02-3`
- `um760`, if it remains useful after the NAS-first path is proven

The cluster is intentionally heterogeneous. Incus cluster groups should be used
only as lightweight placement and CPU-boundary labels, for example `amd-um760`
and `intel-ms02`. Kubernetes remains the main workload scheduler.
only as lightweight placement and CPU-boundary labels, for example `amd-nas`,
`amd-um760`, and `intel-ms02`. Kubernetes remains the main workload scheduler.

The `UM760` is not a disposable bootstrap host. It is the first durable Incus
The N5 Pro is not a disposable bootstrap host. It is the first durable Incus
cluster member.

## VM Substrate
Expand All @@ -76,6 +85,9 @@ non-Talos VM that holds unique state must bring its own backup story.

Use local ZFS-backed Incus storage on each IncusOS host in v1.

The N5 Pro starts with a mirrored ZFS pool over the two `1TB` NVMe drives. The
small `128GB` NVMe device is for the OS only.

An Incus cluster is a management cluster, not automatically a replicated
storage system. For most Incus storage drivers, volumes remain on the member
where they are created. That is acceptable for Talos VM disks because the
Expand All @@ -86,10 +98,11 @@ Do not introduce shared VM storage in v1:
- no Ceph
- no LINSTOR
- no Incus OVN/storage architecture just for VM mobility
- no NAS-backed default VM disks
- no remote NAS-backed default VM disks for every host

The NAS remains a durable backup and artifact boundary, not the default block
storage path for every VM.
The N5 Pro NAS is the durable backup and artifact boundary. Its local Incus
pool may host local VMs, but it is not a remote block-storage platform for
every VM in the lab.

## Network Attachment

Expand All @@ -107,10 +120,11 @@ allowing the underlay to remain visible to VyOS.

Before treating the host substrate as implementation reference material, prove:

- the selected IncusOS image mode is a correct final-disk artifact for the
single-disk `UM760`
- first-node and joining-node IncusOS seeds apply the right default Incus
settings
- IncusOS installs cleanly on the N5 Pro with the OS/data disk split described
above
- the first-node Incus seed applies the right default Incus settings
- the mirrored NVMe ZFS pool is exposed to Incus the way the bootstrap VM path
expects
- joining nodes do not create local networks or storage pools that block cluster
join
- CAPN can place Talos VMs against the intended Incus profiles and storage pools
Expand Down
4 changes: 2 additions & 2 deletions docs/docs/architecture/keycloak-runtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,8 +133,8 @@ is already the durable store for it.

Backups are written nightly to an S3 bucket in the `lab` account. The bucket
uses SSE-KMS and object lock or versioning so corruptions cannot silently
overwrite known-good backups. The Synology NAS pulls a secondary copy on its
own schedule.
overwrite known-good backups. The N5 Pro NAS pulls a secondary copy on its own
schedule.

Retention contract:

Expand Down
Loading