Skip to content

fix: vertex stuck not reaching Running due to lastScaledAt null validation error after SSA migration #3357

@Koalk

Description

@Koalk

Bug description

After PR #2570 migrated vertex status updates to Server-Side Apply (SSA), vertices can get permanently stuck in a reconcile error loop and never reach Running phase, producing no pods.

Error

The controller logs the following error on every reconcile loop:

Vertex.numaflow.numaproj.io "<vertex-name>" is invalid: status.lastScaledAt: Invalid value: "null": lastScaledAt in body must be of type string: "null"

Root cause

With SSA, the entire VertexStatus struct is serialized and sent as a JSON patch. A zero-value metav1.Time (i.e. LastScaledAt that has never been set) serializes to JSON null. The full CRD schema defines lastScaledAt as type: string, format: date-time without nullable: true, so the Kubernetes API server rejects the patch.

This creates a deadlock:

  1. Vertex created → LastScaledAt is zero → SSA writes null → API rejects
  2. Vertex never reaches Running
  3. Autoscaler skips it: "Vertex not in Running phase, skip scaling"
  4. LastScaledAt never gets set → back to step 1

The trigger condition is any vertex where currentReplicas == desiredReplicas on first reconcile (e.g. a source vertex with min: 1, max: 1, or any vertex during pipeline pause with min: 0, max: 0), because the scale branch that sets LastScaledAt is never taken.

Affected versions

All versions since #2570 was merged (v1.7.x+). The minimal CRD install is not affected since status uses x-kubernetes-preserve-unknown-fields: true.

Fix

Add // +nullable to LastScaledAt in VertexStatus and MonoVertexStatus so the generated CRD schema includes nullable: true, allowing the API server to accept null for this field.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions