Skip to content

[codex] Apply generated cpflow GitHub Actions flow#732

Draft
justin808 wants to merge 23 commits intomasterfrom
codex/cpflow-github-actions-flow
Draft

[codex] Apply generated cpflow GitHub Actions flow#732
justin808 wants to merge 23 commits intomasterfrom
codex/cpflow-github-actions-flow

Conversation

@justin808
Copy link
Copy Markdown
Member

@justin808 justin808 commented Apr 30, 2026

Summary

  • Replace the older hand-written Control Plane GitHub Actions with the generated cpflow-* action and workflow set from cpflow generate-github-actions.
  • Keep review apps opt-in, staging deploys branch-gated, production promotion manual, and nightly review-app cleanup in the generated flow.
  • Update Control Plane docs links so they point at the new cpflow-* action/workflow names.

Verification

  • BUNDLE_GEMFILE=/tmp/cpflow-pr-278/Gemfile bin/conductor-exec bundle exec ruby /tmp/cpflow-pr-278/bin/cpflow github-flow-readiness — no blocking readiness issues detected.
  • bin/conductor-exec ruby -e 'require "yaml"; paths = Dir[".github/**/*.yml"] + Dir[".github/**/*.yaml"] + Dir[".controlplane/**/*.yml"]; paths.sort.each { |path| YAML.load_file(path, aliases: true); puts path }'
  • bin/conductor-exec git diff --cached --check
  • bin/conductor-exec bundle exec rubocop

Not Run / Blocked

  • bin/conductor-exec bundle exec rspec could not complete locally because PostgreSQL was not running at /tmp/.s.PGSQL.5432; after the first Rails load failure, Coveralls setup also reported repeated coverage initialization errors.

Note

Medium Risk
Replaces and expands CI/CD automation for staging/review apps and production promotion, so misconfiguration could affect deployments or delete the wrong resources despite added guards.

Overview
Switches Control Plane automation from the bespoke GitHub Actions to the generated cpflow-* composite actions and workflows, updating docs accordingly.

Adds a full generated flow for opt-in review apps (slash-command /deploy-review-app + auto-redeploy after creation), review-app deletion (on PR close or /delete-review-app), staging deploys gated to the staging branch, manual promote-to-production with env-var name parity checks + health polling + best-effort rollback, and a nightly stale review-app cleanup.

Hardens automation around secrets and safety: centralized config validation, safer Control Plane token handling, optional SSH-based Docker builds, and stricter review-app deletion prefix checks to prevent accidental staging/production deletes.

Reviewed by Cursor Bugbot for commit 3331332. Bugbot is set up for automated code reviews on this repo. Configure here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 565a7e06-a2ec-40a5-b4dc-c82729f95dda

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/cpflow-github-actions-flow

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Upstream token not passed to copy-image command
    • Added the required -t upstream token argument to the cpflow copy-image-from-upstream command while keeping the token sourced from secrets via the environment.

Create PR

Or push these changes by commenting:

@cursor push 3606e16506
Preview (3606e16506)
diff --git a/.github/workflows/cpflow-promote-staging-to-production.yml b/.github/workflows/cpflow-promote-staging-to-production.yml
--- a/.github/workflows/cpflow-promote-staging-to-production.yml
+++ b/.github/workflows/cpflow-promote-staging-to-production.yml
@@ -203,14 +203,13 @@
 
       - name: Copy image from staging
         env:
-          # Pass the upstream token via env rather than `-t` so it doesn't appear in /proc/<pid>/cmdline.
           CPLN_UPSTREAM_TOKEN: ${{ secrets.CPLN_TOKEN_STAGING }}
           PRODUCTION_APP_NAME: ${{ vars.PRODUCTION_APP_NAME }}
           CPLN_ORG_PRODUCTION: ${{ vars.CPLN_ORG_PRODUCTION }}
         shell: bash
         run: |
           set -euo pipefail
-          cpflow copy-image-from-upstream -a "${PRODUCTION_APP_NAME}" --org "${CPLN_ORG_PRODUCTION}"
+          cpflow copy-image-from-upstream -a "${PRODUCTION_APP_NAME}" -t "${CPLN_UPSTREAM_TOKEN}" --org "${CPLN_ORG_PRODUCTION}"
 
       - name: Deploy image to production
         env:

You can send follow-ups to the cloud agent here.

Comment thread .github/workflows/cpflow-promote-staging-to-production.yml
@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Code Review: cpflow GitHub Actions Migration

This is a solid migration from hand-written to generated cpflow-* workflows. The overall architecture is a clear improvement. Here's a summary of what stands out.


Security — meaningful fixes

  • Token exposure fixed: The old setup-environment action passed the token as --token "$TOKEN" (visible in /proc/<pid>/cmdline and ps). The new action correctly uses CPLN_TOKEN env var. Same for CPLN_UPSTREAM_TOKEN in the promote workflow.
  • Critical: /deploy-review-app now requires author_association. The old workflow had no author_association check, meaning any GitHub user who could comment on a PR could trigger a deployment. The new check ("OWNER","MEMBER","COLLABORATOR") correctly gates this.
  • Exact command match (== '/deploy-review-app') replaces contains(). Prevents accidental triggers from comments like "I ran /deploy-review-app earlier and it worked".
  • SSH key isolation: The cpflow-build-docker-image action scopes DOCKER_BUILD_SSH_KEY to a dedicated step so it never appears in the build step's environment — even when ACTIONS_STEP_DEBUG=true is set. The EXIT trap cleans up the key file and kills the agent. Well designed.
  • Fork PR handling: Fork PRs correctly skip auto-deploy (which can't access secrets) while still allowing approved collaborators to deploy via /deploy-review-app. The cpflow-review-app-help.yml is explicit about why pull_request_target is used there (no code checkout = safe for write access from forks).

Bugs fixed

  • cancel-in-progress: false on all deploy workflows. The old staging workflow used cancel-in-progress: true, which could interrupt a cpflow deploy-image mid-rollout and leave the workload partially updated. Now fixed across all three deployment workflows.
  • Delete on PR close now uses pull_request_target. The old delete-review-app.yml used pull_request: types: [closed], which runs in a restricted context without secret access when a fork PR closes. The new workflow correctly uses pull_request_target for the close event.
  • --run-release-phase is now conditional. The old staging deploy hardcoded it unconditionally; the new workflows detect release_script: presence at runtime before deciding.
  • Delete safety uses prefix allowlist, not blocklist. The old delete-app.sh used grep -iqE '(production|staging)' to block "dangerous" names — fragile against future app naming. The new delete-app.sh requires the app name to start with REVIEW_APP_PREFIX, which is a stronger guarantee.

Potential issues

  1. cpflow exists in the deploy workflow doesn't distinguish auth errors from "not found" (see inline comment). The delete-app.sh carefully pattern-matches error output to surface API/auth failures, but the deploy workflow's check-app step treats all non-zero exits as exists=false. An intermittent auth error on a push event would silently skip the deploy.

  2. staging deploy job has no needs: validate-branch: The deploy job correctly declares needs: [validate-branch, build] and has an if: guard, but if validate-branch produces no output (e.g., first run where the variable step fails), the if: could evaluate against an empty string. In practice the validation step exits non-zero, so build never runs and deploy is skipped by the needs: dependency. This is fine but the if: guard is slightly redundant with the needs: dependency on validate-branch.

  3. Help command has no contents: read permission: cpflow-help-command.yml declares only issues: write and pull-requests: write. There's no explicit contents: read but GitHub's default GITHUB_TOKEN grants read on public repos, so this works — just worth noting that it relies on the repo's default permissions.


Nits

  • cpflow-deploy-staging.yml uses GITHUB_REF_NAME (simpler), whereas the old workflow used ${GITHUB_REF#refs/heads/} (strip-prefix approach that would silently behave oddly for tag refs). Small but cleaner.
  • The shakacode-team.md link fix (-info org URL corrected to shakacode-open-source-examples-production) is a separate correctness fix bundled in cleanly.

Overall this is well-structured with meaningful security and reliability improvements. The one item I'd prioritize is the cpflow exists error handling consistency noted in the inline comment.

if cpflow exists -a "${APP_NAME}" --org "${CPLN_ORG}"; then
echo "exists=true" >> "$GITHUB_OUTPUT"
else
echo "exists=false" >> "$GITHUB_OUTPUT"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if/else around cpflow exists conflates two different failure modes: "app does not exist" (expected non-zero) and "API/auth error" (unexpected non-zero). Both land in the else branch and set exists=false.

Compare delete-app.sh in this same PR, which carefully pattern-matches the output ("Double check your org", "Unknown API token format", "ERROR", etc.) to distinguish a real failure from a genuine not-found and surface it as a hard error.

If there's a transient auth issue when a PR is pushed, this step sets exists=false and the pull_request path silently writes a step-summary saying the app doesn't exist yet — masking the real problem until the deploy is manually retried.

Suggested approach: mirror delete-app.sh's pattern — capture the output, check exit code, inspect the text for known error tokens, and exit non-zero on a real failure rather than treating it as "not found".

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b9095c6 by capturing cpflow exists output and failing hard on auth/API error tokens instead of treating every non-zero exit as app-not-found.


if [[ -f "${HOME}/.ssh/cpflow_build_key" ]]; then
eval "$(ssh-agent -s)"
trap 'ssh-agent -k >/dev/null; rm -f "${HOME}/.ssh/cpflow_build_key"' EXIT
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EXIT trap kills the SSH agent and removes cpflow_build_key, but the ~/.ssh/known_hosts file written in the previous step is not cleaned up here. On ephemeral GitHub-hosted runners this is harmless (the VM is discarded), but if this action is ever adopted on self-hosted runners with a persistent ~/.ssh, the custom known_hosts entries would accumulate across runs.

Consider adding rm -f "${HOME}/.ssh/known_hosts" to the trap, or using a temp file under $RUNNER_TEMP instead of ~/.ssh for the known_hosts.

default: "3.3.1"
cpflow_version:
description: cpflow gem version
required: false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI version is pinned here as a default, which is good for reproducibility. Worth noting: since sudo npm install -g fetches from the npm registry at runtime, a compromised release of @controlplane/cli@3.3.1 would silently affect all runs. Consider adding a SHA/integrity check (e.g. via npm install --integrity) or at minimum locking to a digest-pinned download, especially since this step runs in workflows that access staging/production tokens.


current_image="${selected_image:-${first_image}}"
current_version="${selected_version:-${first_version}}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The staging env-var parity check only compares names, not values. It will catch a missing DATABASE_URL in production, but it won't catch a case where production has the right key names but wrong or stale values (e.g. a staging secret was rotated but the production counterpart wasn't updated). This is expected behaviour for a name-only check — just worth documenting explicitly so it's clear what the check does and doesn't guarantee.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented in b9095c6: the production promotion workflow checks that production has the same environment variable names as staging, but it does not compare secret values.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Code Review: Replace hand-written CI/CD Actions with generated cpflow- flow*

Well-considered overhaul that hardens security, safety, and operational reliability. No blocking issues found.

Security improvements

Token no longer passed via --token flag — old setup-environment exposed it in /proc/cmdline and ps; now uses CPLN_TOKEN env var only. SSH key scoped to a dedicated step so ACTIONS_STEP_DEBUG cannot leak it from the build step. Delete command now gated on author_association (old workflow had no permission check). Fork PR protection blocks syncs before any secret is consumed. Old contains() command matching replaced with exact match.

Safety improvements

cancel-in-progress changed from true to false everywhere — old staging deploy risked partial GVC deployments mid-rollout. Prefix-based delete guard replaces the old substring check for production/staging names. Production rollback logic added with per-workload state capture and restore — old workflow had no rollback at all. Health checks now configurable (HEALTH_CHECK_RETRIES, HEALTH_CHECK_INTERVAL, HEALTH_CHECK_ACCEPTED_STATUSES).

Issue 1: Fork-PR /deploy-review-app comment silently fails with no PR feedback

When a collaborator comments /deploy-review-app on a fork PR, the job condition passes but Validate review app deployment source exits 1 with only a runner log message. No PR comment explains the rejection. Suggest posting an issues.createComment before exit 1, similar to how the unconfigured-secrets path writes to GITHUB_STEP_SUMMARY.

Issue 2: cpflow exists error detection is string-matching fragile

delete-app.sh uses case matching over stderr to distinguish app-not-found from auth/network errors. The inline TODO acknowledges this. If cpflow changes an error message, the pattern will not match and a real failure will be silently treated as app not found, causing delete to return success vacuously. The TODO to switch to a structured exit code is the right long-term fix.

Issue 3: gem install / npm install without integrity pinning

cpflow-setup-environment/action.yml installs @controlplane/cli@3.3.1 and cpflow 4.2.0 without hash verification. A re-tagged or compromised release would install silently. For production-adjacent CI, consider verifying SHA256 after install or using lockfiles with integrity entries.

Issue 4: Nightly cleanup has no failure notification

cpflow-cleanup-stale-review-apps.yml runs on cron with no failure notification. Silent failures accumulate stale apps over days. Consider a step gated on if: failure() that posts a GitHub issue or annotation.

Minor observations

APP_NAME / PR_NUMBER double-assignment is intentional (workflow-level env needed early for concurrency.group, step overwrites with resolved values) but a short comment would clarify intent. cpflow-review-app-help.yml using pull_request_target is correct and well-documented. deploy job in cpflow-deploy-staging.yml re-runs cpflow-setup-environment unnecessarily (only needs cpflow deploy-image) — minor redundancy, acceptable for simplicity.

echo "🗑️ Deleting application: $APP_NAME"
cpflow delete -a "$APP_NAME" --org "$CPLN_ORG" --yes

echo "✅ Successfully deleted application: $APP_NAME"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case pattern matching over stderr is the right approach given cpflow exists doesn't expose a structured signal, and the inline TODO is the right way to track this. One additional defensive measure worth considering: log the full exists_output to the runner even when the "not found" path is taken, so there's an audit trail if unexpected output silently passed through the filter. Right now the output is printed only if non-empty, but it's easy to miss in logs.

Also worth noting: the pattern list doesn't include "timeout", "connection refused", or "ENOTFOUND" — all plausible network failure messages. If cpflow's Ruby HTTP client times out, it may surface Errno::ECONNREFUSED or Net::ReadTimeout, which would match "Net::", so that's covered. But curl-style timeouts would not. Low risk given this is a Ruby gem, but worth keeping in mind when bumping the cpflow version.

shell: bash
run: |
set -euo pipefail
cpflow cleanup-stale-apps -a "${{ vars.REVIEW_APP_PREFIX }}" --org "${{ vars.CPLN_ORG_STAGING }}" --yes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No failure notification exists for this cron job. If cpflow cleanup-stale-apps fails (auth error, API hiccup, etc.), stale review apps accumulate silently across days.

Consider appending a step:

      - name: Notify on failure
        if: failure()
        run: |
          echo "::error::Stale review app cleanup failed. Check the workflow logs."

Or, if the team monitors GitHub issues, use gh issue create to open a tracking issue on failure so it's not lost in workflow history.

} >> "$GITHUB_STEP_SUMMARY"
exit 0
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a collaborator comments /deploy-review-app on a fork PR, execution reaches this exit 1 with only a runner log message — no feedback is posted to the PR thread. The collaborator sees a red workflow run with no explanation in the PR.

Consider posting a comment before exiting. The GH_TOKEN is available from the Resolve PR ref and commit step but not inherited here; you'd need to add GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} to this step's env block. Something like:

gh api "repos/${GITHUB_REPOSITORY}/issues/${PR_NUMBER}/comments" \
  --method POST \
  --field body="Fork PRs cannot use /deploy-review-app because this workflow uses repository secrets. Ask a maintainer to deploy from a branch in the base repository." \
  --silent || true

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1e46bfe: fork PR /deploy-review-app rejections now post a PR comment before exiting, while still keeping the deployment blocked because the workflow uses repository secrets.

run: |
set -euo pipefail

sudo npm install -g @controlplane/cli@${{ inputs.cpln_cli_version }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version-pinned but not hash-verified. A malicious actor who gains write access to the @controlplane/cli npm package (or re-tags a release) could substitute a compromised binary. Same applies to gem install cpflow below.

For production-adjacent pipelines, consider adding a post-install checksum check, or at minimum document the expected SHA256 for each pinned version in a comment so drift is detectable during future version bumps.

run: |
# Best-effort rollback: try every workload, aggregate failures, exit non-zero at the end
# if any failed. A single cpln hiccup shouldn't leave other workloads mid-promotion.
set -uo pipefail
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intentional omission of -e here (using set -uo pipefail instead of set -euo pipefail) is the correct design — it lets rollback continue across all workloads even if one cpln call fails, with rollback_failures tracking the aggregate. Worth a brief comment explaining the omission so a future reader doesn't add -e thinking it was forgotten:

Suggested change
set -uo pipefail
# Intentionally no -e: we want every workload to attempt rollback even if earlier ones fail.
set -uo pipefail

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1e46bfe with a comment explaining that set -e is intentionally omitted so rollback can attempt every workload and aggregate failures.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Token not persisted for subsequent workflow steps
    • The composite action now writes CPLN_TOKEN to GITHUB_ENV so subsequent cpflow and cpln workflow steps inherit authentication without passing the token on the command line.

Create PR

Or push these changes by commenting:

@cursor push f1eee564c7
Preview (f1eee564c7)
diff --git a/.github/actions/cpflow-setup-environment/action.yml b/.github/actions/cpflow-setup-environment/action.yml
--- a/.github/actions/cpflow-setup-environment/action.yml
+++ b/.github/actions/cpflow-setup-environment/action.yml
@@ -63,6 +63,12 @@
           exit 1
         fi
 
+        {
+          echo "CPLN_TOKEN<<__CPLN_TOKEN__"
+          echo "$CPLN_TOKEN"
+          echo "__CPLN_TOKEN__"
+        } >> "$GITHUB_ENV"
+
         create_output=""
         if ! create_output="$(cpln profile create default --org "$ORG" 2>&1)"; then
           if ! echo "$create_output" | grep -qi "already exists"; then

You can send follow-ups to the cloud agent here.

Comment thread .github/actions/cpflow-setup-environment/action.yml
@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Code Review — PR #732: Generated cpflow GitHub Actions Flow

Overview

This is a well-structured replacement of hand-written CI/CD workflows with the generated cpflow-* composite actions and workflows. The new flows are meaningfully safer and more correct than the old ones across several dimensions. The PR is in draft but the code quality is high enough to call out specifics.


Significant Improvements ✅

  • Secret handling: Old code passed CPLN_TOKEN via --token (visible in /proc/<pid>/cmdline and ps output). New code passes it exclusively via environment — good security hygiene across all three cpln/cpflow call sites.
  • Prefix guard on deletion: delete-app.sh now validates that the app name starts with $REVIEW_APP_PREFIX- before deleting, which is a much stronger guard than the old "reject names containing 'production' or 'staging'" heuristic. The old approach would have blocked deletion of any app with those words in the name (too broad) while leaving other non-review apps unprotected (too narrow).
  • Fork PR protection: The deploy workflow explicitly detects fork PRs and blocks them from accessing repository secrets — the old workflow had no such guard.
  • Author association gating: Slash commands now check author_association is OWNER, MEMBER, or COLLABORATOR. The old delete workflow had no collaborator check at all.
  • Non-cancelling concurrency (cancel-in-progress: false): Prevents partial deployments and half-deleted apps — this was cancel-in-progress: true in the old staging workflow.
  • Multi-workload rollback: The promote workflow rolls back all workloads independently on failure, aggregating errors rather than stopping at the first failure.
  • Env-var parity check: Pre-promotion check that production has all staging env var names prevents silent nil variable errors after promotion.
  • Health checks with configurable accepted statuses: The default 200 301 302 is well-reasoned for auth-gated Rails apps; HEALTH_CHECK_ACCEPTED_STATUSES makes it overridable without editing the file.
  • Randomized heredoc delimiter in capture-current: EOF_$(openssl rand -hex 8) correctly prevents heredoc injection if rollback state ever contained a literal EOF line.

Issues

Bug: deploy job may run even when build fails in cpflow-deploy-staging.yml

In GitHub Actions, an explicit if: on a job replaces the implicit success() guard. The deploy job condition is:

if: needs.validate-branch.outputs.is_deployable == 'true'

Because this doesn't check needs.build.result, the deploy job will run even if the build job fails (as long as the branch is deployable). This burns runner minutes and produces a confusing deploy attempt against a missing image.

Fix: add && needs.build.result == 'success' to the condition.

Risk: String-matching on cpflow exists error output is fragile

delete-app.sh (and the inline version in cpflow-deploy-review-app.yml) detect "real" errors by checking for specific strings like "Double check your org", "ERROR", "Traceback", "Net::". The file already has a # TODO comment acknowledging this, but the consequence of missing a new error pattern is silently treating a real failure as "app not found" — which could skip deletion without any alert.

This is acceptable given the stated cpflow limitation, but the heuristic string list should be kept in sync with cpflow's actual output. An alternative interim approach: treat any non-zero exit with non-empty stderr as a failure unless it matches a known "not found" pattern, rather than enumerating known error patterns.

Minor: Workflow-level APP_NAME env var is redundant and overridden in cpflow-deploy-review-app.yml

The top-level env: block in the deploy review app workflow sets APP_NAME from the raw event payload, but the Resolve PR ref and commit step later overwrites it in $GITHUB_ENV with the normalized PR number. The step-level value wins, making the workflow-level declaration dead configuration. It's also the only workflow in this set that has this pattern — the others don't duplicate APP_NAME this way.


Nitpicks

  • cpflow-setup-environment/action.yml: gem install cpflow -v ... without --no-document is slower than needed on CI runners.
  • cpflow-cleanup-stale-review-apps.yml: The Validate required secrets and variables step runs as a plain shell: bash step but has no id: — this is fine, but if you ever want to condition a downstream step on it, you'll need to add one.
  • The old nightly-remove-stale-review-apps.yml hardcoded the app name (qa-react-webpack-rails-tutorial) rather than using ${{ vars.REVIEW_APP_PREFIX }}. The new workflow correctly uses the variable — worth calling out as a subtle bug fix.

Summary

The core logic is solid and the security posture is materially improved. The one real bug (deploy running after build failure) should be fixed before merge. The string-matching fragility is an acknowledged limitation with a TODO already in place.

fi
deploy_args+=(--org "${CPLN_ORG}" --verbose)

cpflow deploy-image "${deploy_args[@]}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: deploy runs even when build fails.

When a job has an explicit if:, GitHub Actions drops the implicit success() guard. Because this condition only checks validate-branch's output, the deploy job will start even if the build job failed, wasting runner minutes and attempting to deploy a missing image.

Suggested change
cpflow deploy-image "${deploy_args[@]}"
if: needs.validate-branch.outputs.is_deployable == 'true' && needs.build.result == 'success'

echo "🗑️ Deleting application: $APP_NAME"
cpflow delete -a "$APP_NAME" --org "$CPLN_ORG" --yes

echo "✅ Successfully deleted application: $APP_NAME"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string-matching heuristic on cpflow exists error output (lines 43–54) is the main reliability risk in this file. The consequence of an unmatched error pattern is silently exiting 0 ("app not found") when the API actually returned an error, meaning the delete is skipped without any visible failure.

An alternative that inverts the logic: treat any non-zero exit with non-empty output as a failure unless it matches a known "not found" pattern:

# instead of matching known error strings, match known "not found" patterns
case "$exists_output" in
  *"not found"*|*"does not exist"*)
    echo "⚠️ Application does not exist: $APP_NAME"
    exit 0
    ;;
  *)
    echo "❌ ERROR: failed to determine whether application exists" >&2
    printf '%s\n' "$exists_output" >&2
    exit 1
    ;;
esac

This is safer by default — unknown output is treated as an error rather than a no-op. The tradeoff is that you'd need to know the exact "not found" wording from cpflow exists, which has the same coupling problem in the other direction. The TODO comment already captures this; just flagging it for awareness.

echo "ready=false" >> "$GITHUB_OUTPUT"
{
echo "Control Plane review app automation is not configured yet."
echo
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This top-level APP_NAME is computed from the raw event payload and is later overwritten in $GITHUB_ENV by the Resolve PR ref and commit step (echo "APP_NAME=..." >> "$GITHUB_ENV"). The step-level write wins, so this declaration is dead configuration and can be confusing. Consider removing it and relying solely on the step-computed value, which is already the authoritative one (it handles all three event types correctly).

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Code Review: cpflow Generated GitHub Actions Flow

This is a meaningful security and reliability upgrade over the handwritten workflows. Several issues from the old setup are fixed here; I've also flagged a few new items worth addressing.


✅ Notable Improvements

  • Token passed via env, not --token flag — prevents leakage in /proc/<pid>/cmdline and ps output (old setup-environment used cpln profile update default --org "$ORG" --token "$TOKEN")
  • Author association gating on /deploy-review-app and /delete-review-app — the old delete workflow accepted the command from any commenter
  • Fork PR protection in the deploy workflow — old workflow ran on all pull_request events without a fork check, allowing fork PRs to attempt using repo secrets
  • Prefix guard in delete-app.sh — prevents accidental deletion of non-review-app resources
  • SSH key isolated to a dedicated step — keeps it out of the main build step's env where ACTIONS_STEP_DEBUG=true would expose it
  • Randomized heredoc delimiters in the promotion workflow — prevents content injection if rollback state contains EOF
  • Old staging workflow fixed — the old deploy-to-control-plane-staging.yml triggered on branches: ['*'] (every push), burning runners on every feature branch; new workflow correctly filters to main/master
  • cancel-in-progress: false on all deployment concurrency groups — prevents partial rollout state from mid-cancelled deploys

🔴 Issues

1. cpflow-delete-review-app.yml uses pull_request_target without an explanation comment

The delete workflow triggers on pull_request_target: [closed]. This event runs with write permissions in the base repo context and can post PR comments on fork PRs. It's the correct choice here (since no PR code is checked out), but it's a well-known footgun — future maintainers may switch it to pull_request thinking it's "safer" and break fork PR cleanup. cpflow-review-app-help.yml has a good comment explaining this; the delete workflow should have one too.

2. cpflow exists error-pattern matching is duplicated

The same fragile string-matching logic ("Double check your org", "ERROR", "Traceback", "Net::", etc.) appears both in delete-app.sh and inline in the Check if review app exists step of cpflow-deploy-review-app.yml. The delete-app.sh has a TODO explaining the fragility; the workflow copy does not. If cpflow changes its error output, both need updating in sync. Consider either extracting this into a shared shell script, or at minimum adding the same TODO comment in the workflow.


🟡 Warnings

3. CPLN_TOKEN is written to GITHUB_ENV and persists to all subsequent steps

In cpflow-setup-environment/action.yml (lines 67–72), the token is written to GITHUB_ENV so later workflow steps can authenticate. This is necessary for cpflow to work, but it means the token is present in the environment of every subsequent step — including any step that might env | sort or log environment state. Consider documenting this tradeoff in a comment.

4. Production health check accepts 301/302 by default

HEALTH_CHECK_ACCEPTED_STATUSES defaults to '200 301 302'. A redirect to a login page or a broken URL returns a redirect, so the health check passes even if the app is misconfigured post-deploy. The docs note the recommended mitigation (set a dedicated /health endpoint and override to "200"). Consider making the default stricter, or adding a warning to the step summary when a redirect is accepted.

5. Nightly cleanup has no failure notification

cpflow-cleanup-stale-review-apps.yml runs at midnight UTC. If it fails (bad token, cpflow error, etc.), there is no alert — stale apps silently accumulate cost. Consider adding a Slack/email notification on failure, or at minimum ensure the repo has GitHub workflow failure notifications enabled for scheduled runs.


ℹ️ Informational

6. /help command has no author_association check

cpflow-help-command.yml responds to /help from any commenter (no association guard). This is probably intentional — help is informational — but on a high-volume repo the bot will reply to every /help comment from any user. Noting for awareness.

7. ruby/setup-ruby@v1 and actions/github-script@v7 use floating major versions

These will pick up any non-breaking updates automatically, which is generally fine. However, actions/checkout@v4 is also a floating major — all three could be pinned to a specific SHA for a more locked-down supply chain posture if that's a concern.

name: Delete Review App

on:
pull_request_target:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull_request_target deserves an explanation comment here.

pull_request_target runs in the context of the base repo (write permissions, access to secrets) and is required for fork PR cleanup — pull_request can't post comments on fork PRs because the token is read-only. This is safe here because no PR code is ever checked out (the actions/checkout@v4 below fetches base-branch code only).

cpflow-review-app-help.yml has a good comment explaining this exact pattern. Without a parallel comment here, a future maintainer may "fix" this to pull_request thinking it's safer, silently breaking fork PR cleanup.

Suggested change
pull_request_target:
# pull_request_target is intentional: it has write permission to comment on PRs from
# forks, where `pull_request` would be read-only. This is safe because no PR code is
# checked out — the job only calls cpflow with base-repo credentials.
# Do not switch this to `pull_request` or add a checkout of the PR ref without re-evaluating.
pull_request_target:

Comment on lines +66 to +72
# Make the token available to later workflow steps without putting it on argv.
token_delimiter="CPLN_TOKEN_$(openssl rand -hex 8)"
{
echo "CPLN_TOKEN<<${token_delimiter}"
printf '%s\n' "$CPLN_TOKEN"
echo "${token_delimiter}"
} >> "$GITHUB_ENV"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing CPLN_TOKEN to GITHUB_ENV propagates it to every subsequent step in the calling workflow — not just the steps that explicitly use it. This is a necessary tradeoff (cpflow picks it up from the environment automatically), but it's worth documenting so future maintainers don't inadvertently add steps that log environment state.

Suggested change
# Make the token available to later workflow steps without putting it on argv.
token_delimiter="CPLN_TOKEN_$(openssl rand -hex 8)"
{
echo "CPLN_TOKEN<<${token_delimiter}"
printf '%s\n' "$CPLN_TOKEN"
echo "${token_delimiter}"
} >> "$GITHUB_ENV"
# Persist CPLN_TOKEN to GITHUB_ENV so all subsequent steps can authenticate
# with cpflow/cpln without putting the token on the command line.
# Trade-off: the token is present in every subsequent step's environment.
# Do not add steps that dump env vars (e.g., `env | sort`) after this point.
token_delimiter="CPLN_TOKEN_$(openssl rand -hex 8)"
{
echo "CPLN_TOKEN<<${token_delimiter}"
printf '%s\n' "$CPLN_TOKEN"
echo "${token_delimiter}"
} >> "$GITHUB_ENV"

Comment on lines +28 to +35
# TODO: replace this string-matching with a structured signal once `cpflow exists` exposes one
# (e.g. a distinct exit code for "not found" vs. API/auth errors, or `cpflow exists --json`).
# Until then, keep this list in sync if `cpflow exists` starts emitting new error patterns —
# any unmatched error string would otherwise be silently treated as "app not found".
exists_output=""
if ! exists_output="$(cpflow exists -a "$APP_NAME" --org "$CPLN_ORG" 2>&1)"; then
case "$exists_output" in
*"Double check your org"*|*"Unknown API token format"*|*"ERROR"*|*"Error:"*|*"Traceback"*|*"Net::"*)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same cpflow exists error-pattern matching appears verbatim in the Check if review app exists step in cpflow-deploy-review-app.yml (line ~195). The TODO comment here is accurate but only covers this file — the workflow copy is uncovered.

If cpflow adds a new error pattern (e.g. "rate limit" or "timeout"), both locations need updating in sync. Consider extracting this into a shared script (e.g. .github/scripts/cpflow-app-exists.sh) that both the action and the workflow call, so the pattern list is maintained in one place.

# repo variable to tighten this for apps that expose a dedicated health endpoint
# (e.g. "200" for a plain /health, or "200 401 403" for apps that auth-gate / without
# redirecting).
HEALTH_CHECK_ACCEPTED_STATUSES: ${{ vars.HEALTH_CHECK_ACCEPTED_STATUSES || '200 301 302' }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting 301/302 as "healthy" means a redirect to a login page, maintenance page, or even a misconfigured CDN passes the health check. The comment correctly explains why (curl without -L, Rails auth-gating /), but a redirect can pass even when the destination is broken.

Consider adding a step summary warning when a non-200 status is accepted, so operators reviewing the promotion log have visibility:

if [[ "${http_status}" == "${accepted}" ]]; then
  if [[ "${http_status}" != "200" ]]; then
    echo "::warning::Health check passed with HTTP ${http_status} (not 200) — verify the redirect destination is healthy."
  fi
  echo "healthy=true" >> "$GITHUB_OUTPUT"
  exit 0
fi

Also worth documenting in .controlplane/shakacode-team.md that setting HEALTH_CHECK_ACCEPTED_STATUSES=200 and adding a /health endpoint is the recommended production posture.

on:
workflow_dispatch:
schedule:
- cron: "0 0 * * *"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduled workflows that fail silently are easy to miss — GitHub does send an email for failed scheduled runs but only to the repo owner by default, and only after 7 days of consecutive failures (as of current GitHub behavior).

Consider adding a failure notification step, or at minimum documenting that the team should verify GitHub's "Notify me of failed workflow runs" notification setting is enabled. Stale review apps that aren't cleaned up accrue Control Plane costs.

Suggested change
- cron: "0 0 * * *"
- cron: "0 3 * * *" # 3 AM UTC — avoids exact midnight when many repos run scheduled jobs simultaneously

(The cron timing is a minor suggestion — staggering off midnight reduces GitHub's scheduler contention, making the run more likely to start on time.)

workload_json="$(cpln workload get "${workload_name}" --gvc "${PRODUCTION_APP_NAME}" --org "${CPLN_ORG_PRODUCTION}" -o json)"
# current_image/current_version are summary fields for the first container
# of the selected workload; rollback_state below captures all containers.
workload_image="$(echo "${workload_json}" | jq -r '.spec.containers[0].image')"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is inconsistent with the rest of the loop body. workload_image is at 10 spaces while workload_containers and workload_version below it are at 12. Bash ignores the difference, but it looks like a copy/paste artifact and will confuse readers into thinking this line is outside the loop.

Suggested change
workload_image="$(echo "${workload_json}" | jq -r '.spec.containers[0].image')"
workload_image="$(echo "${workload_json}" | jq -r '.spec.containers[0].image')"

Comment on lines +148 to +149
staging_vars="$(CPLN_TOKEN="${CPLN_TOKEN_STAGING}" cpln gvc get "${STAGING_APP_NAME}" --org "${CPLN_ORG_STAGING}" -o json | jq -r '.spec.env // [] | .[].name' | sort)"
production_vars="$(CPLN_TOKEN="${CPLN_TOKEN_PRODUCTION}" cpln gvc get "${PRODUCTION_APP_NAME}" --org "${CPLN_ORG_PRODUCTION}" -o json | jq -r '.spec.env // [] | .[].name' | sort)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parity check covers GVC-level env vars only. Workload-level environment variables (set directly on individual workloads rather than the GVC) and Control Plane secret references (cpln://secret/... values) are not compared. A variable added to a specific staging workload won't be caught here and could cause a silent mismatch in production after promotion.

Consider adding a note to the step summary output — e.g. echo "Note: workload-level env vars and cpln:// secret references are not compared." >> "$GITHUB_STEP_SUMMARY" — so operators are reminded of the scope on every promotion run.

shell: bash
run: |
set -euo pipefail
cpflow cleanup-stale-apps -a "${REVIEW_APP_PREFIX}" --org "${CPLN_ORG_STAGING}" --yes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs destructively with --yes on every scheduled tick with no dry-run path. If REVIEW_APP_PREFIX is ever misconfigured (e.g., set to a prefix that matches staging-adjacent apps), apps are silently deleted. A low-cost safeguard would be to add a workflow_dispatch input for dry-run mode:

Suggested change
cpflow cleanup-stale-apps -a "${REVIEW_APP_PREFIX}" --org "${CPLN_ORG_STAGING}" --yes
cpflow cleanup-stale-apps -a "${REVIEW_APP_PREFIX}" --org "${CPLN_ORG_STAGING}" --yes
echo "Stale review apps under prefix '${REVIEW_APP_PREFIX}' have been cleaned up." >> "$GITHUB_STEP_SUMMARY"

(Or add a separate dry_run dispatch input that swaps --yes for a list-only command.)

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Fork PR slash-command rejection posts no PR comment
    • Added an issue_comment-only rejection comment step for fork PR deploy requests before all deployment-gated steps are skipped.

Create PR

Or push these changes by commenting:

@cursor push 85e534a5ab
Preview (85e534a5ab)
diff --git a/.github/workflows/cpflow-deploy-review-app.yml b/.github/workflows/cpflow-deploy-review-app.yml
--- a/.github/workflows/cpflow-deploy-review-app.yml
+++ b/.github/workflows/cpflow-deploy-review-app.yml
@@ -157,6 +157,18 @@
           echo "Review app deploys from fork pull requests are not allowed for workflow_dispatch because this workflow uses repository secrets." >&2
           exit 1
 
+      - name: Comment on fork PR deploy rejection
+        if: steps.config.outputs.ready == 'true' && steps.source.outputs.allowed == 'false' && github.event_name == 'issue_comment'
+        uses: actions/github-script@v7
+        with:
+          script: |
+            await github.rest.issues.createComment({
+              owner: context.repo.owner,
+              repo: context.repo.repo,
+              issue_number: Number(process.env.PR_NUMBER),
+              body: `Review app deploys from fork pull requests require a branch in ${process.env.GITHUB_REPOSITORY}.\n\nThis workflow builds Docker images with repository secrets, so comment-triggered deploys only run for branches in the base repository.`
+            });
+
       - name: Checkout PR commit
         if: steps.config.outputs.ready == 'true' && steps.source.outputs.allowed == 'true'
         uses: actions/checkout@v4

You can send follow-ups to the cloud agent here.

Comment thread .github/workflows/cpflow-deploy-review-app.yml
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

@github-actions github-actions Bot temporarily deployed to review/qa-react-webpack-rails-tutorial-732 May 1, 2026 22:26 Inactive
@claude
Copy link
Copy Markdown

claude Bot commented May 1, 2026

Code Review: Generated cpflow GitHub Actions Flow

Overview

This PR replaces hand-written Control Plane GitHub Actions with the generated cpflow-* action and workflow set. It's a significant improvement in both security posture and operational robustness. The PR is well-considered and production-deployable with a few items worth addressing or acknowledging before merge.


Security — Notable Improvements ✅

  • Token no longer leaks into process args: The old setup-environment action called cpln profile update default --org "$ORG" --token "$TOKEN", which exposes the token in /proc/<pid>/cmdline and ps output. The new setup action passes it via CPLN_TOKEN env var only.
  • SSH key isolation: The new cpflow-build-docker-image action handles the SSH private key in a dedicated step with a scoped env, so ACTIONS_STEP_DEBUG=true (which dumps env before commands) cannot expose the key in the build step.
  • Randomized heredoc delimiters: CPLN_TOKEN is written to GITHUB_ENV using CPLN_TOKEN_$(openssl rand -hex 8) as the terminator, preventing a multiline token value from terminating the heredoc early.
  • Prefix guard + production/staging word guard on deletion: The new delete-app.sh checks both that the app name matches REVIEW_APP_PREFIX-* AND that it doesn't contain production or staging as word segments — defense in depth against a misconfigured prefix.
  • author_association check added to slash commands: The old /delete-review-app handler had no author check; the new one requires OWNER, MEMBER, or COLLABORATOR.
  • Old staging workflow triggered on ALL branches: The deleted deploy-to-control-plane-staging.yml had branches: ['*'], meaning every branch push would attempt a staging deploy. The new workflow correctly gates to master only.
  • eval() removed from JavaScript steps: The deleted workflows used eval(process.env.GET_CONSOLE_LINK) — an eval of an env-derived string in JS — which is now gone entirely.

Issues and Observations

🔴 High — Rollback doesn't reverse migration side-effects

The rollback step in cpflow-promote-staging-to-production.yml restores workload container images via cpln workload update --set spec.containers[N].image=..., but any release_script side-effects (database migrations, cache warm-ups, etc.) that ran during the failed promotion are not reversed. The shakacode-team.md docs mention this, but the workflow itself has no warning comment at the rollback step. Teams discovering a failed promotion may not immediately check docs and could assume a clean rollback.

Suggestion: Add a ::warning:: annotation at the top of the rollback step so it's visible in the Actions UI when a rollback runs:

echo "::warning::Rollback restores workload images only. Database migrations and release_script side-effects are NOT reversed. Verify database state manually before re-promoting."

🟡 Medium — issue_comment slash-command changes don't take effect until merged

The cpflow-deploy-review-app.yml workflow is triggered by issue_comment events, which GitHub always runs from the default branch. A comment like /deploy-review-app on a PR that contains changes to this very workflow will execute the default branch version, not the PR branch version. The workaround (gh workflow run cpflow-deploy-review-app.yml --ref <branch> -f pr_number=<pr-number>) is documented inline and in the docs — but developers iterating on the workflow may be surprised the first time. Consider adding a visible note in cpflow-help.md about this constraint.

🟡 Medium — pull_request_target in delete workflow requires ongoing vigilance

cpflow-delete-review-app.yml uses pull_request_target (which has secret access) and adds a correct comment explaining the safe checkout. However, pull_request_target with composite actions loaded from the checked-out base is a pattern that historically has been misused. The current code is correct, but future contributors editing this workflow should be aware of the trust boundary. Consider adding a prominent # SECURITY: comment block at the workflow level (not just inside the checkout step) summarizing the constraint.

🟡 Medium — Serial rollback readiness polling

The "Wait for rollback readiness" step in cpflow-promote-staging-to-production.yml polls each workload sequentially, so the worst-case wait is N_workloads × ROLLBACK_READINESS_RETRIES × ROLLBACK_READINESS_INTERVAL. For single-workload apps this is fine, but multi-workload apps could exceed the 45-minute job timeout if retries are high. Consider either parallelising (requires a matrix job, complex), or documenting the calculation in the defaults comment.

🟡 Medium — Environment variable parity check scope

The "Verify production environment variables" step compares GVC-level env var names only. It does not check:

  • Whether the values match (intentional — secrets can't be compared)
  • Workload-level env vars
  • cpln:// secret references

This is documented in the step summary output and in the docs. However, a renamed secret (old name still in production, new name only in staging) would pass this check since only names are compared. The check provides a useful signal but not a guarantee of parity — make sure the team understands its scope.

🟢 Minor — Two /help workflows may produce duplicate comments

cpflow-review-app-help.yml posts a quick-reference comment when a PR is opened, and cpflow-help-command.yml posts the full cpflow-help.md when /help is commented. On a new PR, both could end up posting within seconds of each other. The two messages are complementary, but consider whether the PR-open message is worth the noise. Alternatively, deduplicate by checking for existing bot comments before posting.

🟢 Minor — validate-branch job holds a 5-minute timeout but may checkout

In cpflow-deploy-staging.yml, the validate-branch job runs a fast branch check first, then conditionally does actions/checkout@v4 within the same 5-minute timeout window. For a large repository the checkout alone can take 1-2 minutes, leaving little headroom. Consider bumping validate-branch's timeout-minutes to 10 or splitting the checkout into the build job's scope.


Summary

The overall direction is excellent. The security improvements are real and meaningful. The main pre-merge items are: adding a warning annotation to the rollback step (so on-call engineers see it in the UI during an incident), and considering the issue_comment workflow-version caveat in the user-facing docs. Everything else is polish or awareness.

# Guard against a misconfigured REVIEW_APP_PREFIX that would otherwise match a
# well-known shared environment. This intentionally rejects review-app prefixes
# containing these word segments; failing closed is safer for deletion.
if echo "$APP_NAME" | grep -iqE '(^|-)(production|staging)(-|$)'; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word-boundary regex (^|-)(production|staging)(-|$) is a significant improvement over the deleted action's grep -iqE '(production|staging)', which would have incorrectly blocked a hypothetical review-app prefix like staging-feature-demos-pr. The new pattern correctly rejects only apps where production or staging appears as a full hyphen-delimited word segment.

One edge case to be aware of: this guard runs on $APP_NAME (the fully-formed app name including prefix + PR number), not on $REVIEW_APP_PREFIX alone. So a review app named qa-app-pr-1234 passes the guard even if someone later misconfigures the prefix. That's the intended "fail closed on the constructed name" behavior — just confirming it's deliberate.


# Persist the token for later cpflow/cpln steps via env. Use a randomized
# delimiter so a multiline token cannot terminate the heredoc early.
delim="CPLN_TOKEN_$(openssl rand -hex 8)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of a randomized heredoc delimiter. The old setup-environment action called cpln profile update default --org "$ORG" --token "$TOKEN", which would expose the token in /proc/<pid>/cmdline and ps output on any runner with a shared process table. The env-var approach here avoids that, and the randomized delimiter (CPLN_TOKEN_<hex>) prevents a multiline token value from accidentally terminating the heredoc early.

Note: writing CPLN_TOKEN to GITHUB_ENV means it persists for all subsequent steps in the job. That's intentional so downstream cpflow/cpln commands pick it up automatically, but it does mean any step that runs after this one has an implicit dependency on the env var being set. Worth keeping in mind when adding new steps to calling workflows.

run: |
# Best-effort rollback: try every workload, aggregate failures, exit non-zero at the end
# if any failed. A single cpln hiccup shouldn't leave other workloads mid-promotion.
set -uo pipefail
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -uo pipefail intentionally omits -e so that a failure for one workload doesn't abort the loop before attempting all remaining workloads. This is the correct pattern for best-effort rollback (aggregate failures, then exit non-zero at the end). Worth a brief inline note here to prevent a future reviewer from "fixing" it to set -euo pipefail:

Suggested change
set -uo pipefail
# Intentionally omit -e: we want to attempt rollback on all workloads even if one fails.
set -uo pipefail

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Delete workflow missing config-ready gate on subsequent steps
    • Added the config step id and gated every downstream delete workflow step on the validate action's ready output so missing configuration exits quietly.

Create PR

Or push these changes by commenting:

@cursor push bf3a06d409
Preview (bf3a06d409)
diff --git a/.github/workflows/cpflow-delete-review-app.yml b/.github/workflows/cpflow-delete-review-app.yml
--- a/.github/workflows/cpflow-delete-review-app.yml
+++ b/.github/workflows/cpflow-delete-review-app.yml
@@ -53,6 +53,7 @@
         uses: actions/checkout@v4
 
       - name: Validate required secrets and variables
+        id: config
         uses: ./.github/actions/cpflow-validate-config
         env:
           CPLN_TOKEN_STAGING: ${{ secrets.CPLN_TOKEN_STAGING }}
@@ -66,6 +67,7 @@
           pull_request_friendly: "true"
 
       - name: Setup environment
+        if: steps.config.outputs.ready == 'true'
         uses: ./.github/actions/cpflow-setup-environment
         with:
           token: ${{ secrets.CPLN_TOKEN_STAGING }}
@@ -74,6 +76,7 @@
           cpflow_version: ${{ vars.CPFLOW_VERSION }}
 
       - name: Set workflow links
+        if: steps.config.outputs.ready == 'true'
         uses: actions/github-script@v7
         with:
           script: |
@@ -85,6 +88,7 @@
             );
 
       - name: Create initial PR comment
+        if: steps.config.outputs.ready == 'true'
         id: create-comment
         uses: actions/github-script@v7
         with:
@@ -98,6 +102,7 @@
             core.setOutput("comment-id", comment.data.id);
 
       - name: Delete review app
+        if: steps.config.outputs.ready == 'true'
         uses: ./.github/actions/cpflow-delete-control-plane-app
         with:
           app_name: ${{ env.APP_NAME }}
@@ -105,6 +110,7 @@
           review_app_prefix: ${{ vars.REVIEW_APP_PREFIX }}
 
       - name: Mark GitHub deployment inactive
+        if: steps.config.outputs.ready == 'true'
         uses: actions/github-script@v7
         with:
           script: |
@@ -132,7 +138,7 @@
             }
 
       - name: Finalize delete status
-        if: always()
+        if: always() && steps.config.outputs.ready == 'true'
         uses: actions/github-script@v7
         with:
           script: |

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit fdda08d. Configure here.

Comment thread .github/workflows/cpflow-delete-review-app.yml

{
echo "### Environment parity scope"
echo "This check compares GVC-level environment variable names only."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that the step summary explicitly documents the scope limitation. One scenario that can slip through: if a staging secret is renamed (old name removed from staging, new name added), and production still has the old name, the comm -23 check only reports what's missing from production — so a renamed var where staging has NEW_NAME and production has OLD_NAME would not be caught (neither appears as "missing" from the other's perspective; they're just different names).

This is inherent to comparing names without values, and worth noting in the step summary output or the docs. The current wording "compares GVC-level environment variable names only" does cover this, but the rename case is subtle enough that an explicit sentence could help: "Renamed variables (removed from staging, not yet removed from production) are reported as production-only variables, not as missing."

# deploy branches unless `cpflow generate-github-actions --staging-branch BRANCH`
# was used. If STAGING_APP_BRANCH is later changed in repository variables, keep
# this list in sync so pushes to that branch actually trigger the workflow.
branches: ["master"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a meaningful improvement. The deleted deploy-to-control-plane-staging.yml had branches: ['*'], which would trigger a staging deploy attempt on every branch push to the repo — relying solely on the validate-branch job's runtime check to bail out. The new workflow correctly gates at the workflow trigger level, so only pushes to master start any jobs at all, saving runner minutes and reducing noise.

(github.event_name == 'issue_comment' &&
github.event.issue.pull_request &&
github.event.comment.body == '/deploy-review-app' &&
contains(fromJson('["OWNER","MEMBER","COLLABORATOR"]'), github.event.comment.author_association))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition: the old delete-review-app.yml checked github.event.issue.pull_request (ensuring the comment was on a PR) but had no author_association guard, meaning any user who could comment on a PR could trigger a deploy. The new filter restricts slash commands to OWNER, MEMBER, and COLLABORATOR — appropriate for secrets-backed deployments.

One subtlety: COLLABORATOR in GitHub's model means a user explicitly invited to the repo with at least triage access. External contributors on a public repo default to NONE. This is the right default for protecting Control Plane credentials.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

@github-actions github-actions Bot temporarily deployed to review/qa-react-webpack-rails-tutorial-732 May 1, 2026 22:35 Inactive
@claude
Copy link
Copy Markdown

claude Bot commented May 1, 2026

Code Review: Replace hand-written CI/CD with generated cpflow GitHub Actions

Overview

This PR replaces ~1,150 lines of hand-crafted Control Plane GitHub Actions with ~2,025 lines of generated cpflow-* composite actions and workflows. The scope covers four deployment flows: opt-in review apps (slash-command triggered), branch-gated staging deploys, manual staging→production promotion with rollback, and nightly stale review-app cleanup.


Security — Significant Improvements ✅

The new code fixes several real security issues present in the deleted workflows:

Issue Old New
Token in argv --token "$TOKEN" (visible in ps//proc) CPLN_TOKEN env var
Token scope Workflow-level env: (all steps, all jobs) Passed only to setup action, then via GITHUB_ENV with masked value
Slash command auth No author_association check OWNER/MEMBER/COLLABORATOR gate on both deploy and delete
Slash command injection contains(github.event.comment.body, '/deploy-review-app') Exact body match (== '/deploy-review-app')
Fork PR secrets pull_request event: fork PRs could trigger build steps Fork head detection + same_repo guard
PR-close cleanup from forks pull_request (read-only for forks) pull_request_target with base-branch checkout only
SSH key in env N/A Isolated step, trap cleanup, defence-in-depth if: always() removal
delete-app.sh safety guard grep -iqE '(production|staging)' (matches substrings) Word-boundary regex (^|-)(production|staging)(-|$)

Concerns

1. pull_request: opened trigger burns CI minutes on every new same-repo PR

cpflow-deploy-review-app.yml now triggers on [opened, synchronize, reopened]. For opened events, no review app exists yet, so the workflow always reaches the "Skip auto deploy until a review app is created" step — but only after running checkout, Ruby setup, cpln/cpflow install, PR resolve, source validation, and the cpflow exists call. The old workflow used [synchronize, reopened] only.

If the intent is just to inform developers, a lighter approach would be to move the skip check before the setup steps, or gate opened events out of the job-level if condition (note that it's already gated to same-repo PRs, so fork churn is avoided, but same-repo churn is not).

2. CPLN_TOKEN written to $GITHUB_ENV (broader than needed)

In cpflow-setup-environment/action.yml, the CP token is persisted to all subsequent steps in the calling job via GITHUB_ENV. GitHub Actions masks the secret value in logs, so this doesn't expose it in plain text. But it does increase the attack surface: any later step (including third-party composite actions called after setup) inherits the CPLN_TOKEN env var and could exfiltrate it in a compromised supply-chain scenario.

An alternative: cpflow's CLI supports --token in non-interactive use, or the env var could be cleared after the login step using a follow-up echo "CPLN_TOKEN=" >> "$GITHUB_ENV". That said, cpflow commands in downstream steps do rely on this token, so clearing it would require rethinking those steps.

3. cpflow-promote-staging-to-production.ymlcontents: write is job-wide

permissions: contents: write is required only for the gh release create step. Since all jobs share this workflow-level permission, every step in the promote-to-production job (including Ruby scripts reading controlplane.yml) runs with write access to repo contents. Splitting the release creation into a separate job with scoped permissions would reduce the blast radius of a compromised step.

4. Env-var parity check gap (documented but worth flagging)

The "Verify production environment variables" step in the promotion workflow compares GVC-level env var names only. Workload-level env vars and cpln:// secret references are not checked. This is correctly documented in both the code and shakacode-team.md, but it means a new workload-level env var added to staging but missing in production will not block promotion.


Code Quality

The new code is a substantial improvement:

  • set -euo pipefail on all shell scripts (old code used set -e only)
  • Randomized heredoc delimiters for multi-line GITHUB_OUTPUT/GITHUB_ENV values (prevents injection via newlines)
  • Proper cpflow exists exit-code semantics (0 = exists, 2 = not found, other = error)
  • Ruby installed without sudo npm; user-local npm prefix avoids system permissions
  • Ruby auto-detected from .ruby-version/Gemfile instead of hardcoded 3.4.6
  • URL extraction via jq -r '.status.endpoint' (old code used a fragile grep -oP regex)
  • Rollback captures state for all workloads before deploying, handles container-set drift, and aggregates failures
  • Concurrency groups prevent partial deploys from cancellation mid-rollout

Minor Notes

  • cpflow-deploy-staging.yml uses cancel-in-progress: false, which means rapid pushes to master queue up sequentially. Deploys always complete in submission order, so the final state is always the latest push — this is correct behavior, but could cause a slow drain under a burst of merges.
  • The staged PR comment updates (starting → building → deploying → done) give a good UX and avoid the "silent for 5 minutes" problem common in deploy workflows.
  • cpflow-review-app-help.yml correctly uses pull_request_target without a code checkout, with a clear warning comment against adding one.

Summary

This is a high-quality, well-documented overhaul. The security posture is meaningfully better than what it replaces. The three items above (CI minute waste on opened, token scope, and permission scope) are worth considering before merge, but none are blockers. The PR is in draft — once the rspec gap (PostgreSQL not running locally) is resolved and the staging deploy is verified end-to-end, this looks ready.


on:
pull_request:
types: [opened, synchronize, reopened]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding opened to the trigger list means every new same-repo PR runs this workflow even when no review app has been created yet. The job will checkout, install Ruby + cpln + cpflow, resolve the PR, validate the source, and call cpflow exists before reaching the "Skip auto deploy" step. The old workflow used [synchronize, reopened] only.

If the goal is just to inform developers, consider moving the skip earlier (before setup) or removing opened and relying on the help message posted by cpflow-review-app-help.yml instead. At minimum, a comment explaining the intentional design trade-off would help future maintainers.

Suggested change
types: [opened, synchronize, reopened]
types: [synchronize, reopened]

Comment on lines +86 to +89
# Persist the token for later cpflow/cpln steps via env. Use a randomized
# delimiter so a multiline token cannot terminate the heredoc early.
delim="CPLN_TOKEN_$(openssl rand -hex 8)"
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing CPLN_TOKEN to $GITHUB_ENV makes it available to every subsequent step in the calling job, including any third-party composite actions invoked after this one. GitHub's log masking prevents the value from appearing in plain text, but the env var is still accessible to step code.

This is likely necessary because downstream cpflow and cpln commands depend on it. If it can be scoped more narrowly (e.g., only passed as env: to steps that need it rather than persisted globally), that would reduce the blast radius of a compromised downstream action. Worth a comment here explaining why global persistence is required.

type: string

permissions:
contents: write # Required for `gh release create` in the "Create GitHub release" step.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contents: write is required only for the gh release create step, but this permission applies to the entire promote-to-production job. All steps — including the Ruby script that reads controlplane.yml, the env-var parity check calling cpln gvc get, and the rollback logic — run with write access to repo contents.

Consider splitting the GitHub release creation into a separate job that depends on promote-to-production with only contents: write, and narrowing this workflow permission to contents: read. That limits the exposure if any intermediate step (e.g., a cpln CLI bug or a compromised gem) tries to write to the repo.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

@claude
Copy link
Copy Markdown

claude Bot commented May 1, 2026

Code Review

Overview

This PR replaces bespoke, hand-written GitHub Actions with the generated cpflow-* composite actions and workflows. It adds structured review app lifecycle management (opt-in create, auto-redeploy on push, delete on PR close), staging deploys gated to a branch, manual production promotion with health-check + best-effort rollback, and a nightly stale-app cleanup. The documentation updates are comprehensive and thorough.

Overall this is a significant quality improvement. The security posture is meaningfully better than the code being replaced (details below).


Security Improvements (notable fixes over old code)

  • Token exposure fixed: Old setup-environment/action.yml called cpln profile update default --org "$ORG" --token "$TOKEN", which leaked the token into /proc/<pid>/cmdline and ps output. New code passes the token exclusively via CPLN_TOKEN env var.
  • Fork PR secret exposure fixed: Old deploy-to-control-plane-review-app.yml triggered on pull_request for all PRs including forks, and used contains(github.event.comment.body, '/deploy-review-app') (partial match — any comment containing the substring triggered a deploy). New code uses exact match, restricts to OWNER/MEMBER/COLLABORATOR, and explicitly rejects fork-head builds.
  • PR-close cleanup from forks: Old delete-review-app.yml used pull_request for the closed event type, which has no write access for fork PRs — review apps from forks were silently never cleaned up. New code correctly uses pull_request_target with a clear trust-boundary comment.
  • Deletion guard hardened: Old delete-app.sh checked a simple regex on the app name for "production"/"staging". New code also validates the app name starts with the required REVIEW_APP_PREFIX before any deletion proceeds.
  • SSH key isolation: SSH private key is handled in a dedicated step whose env scope is explicitly limited, with a defense-in-depth always() cleanup step to cover runner cancellations.

Issues to Address

1. CPLN_TOKEN is written into GITHUB_ENV as plaintext

In cpflow-setup-environment/action.yml lines 86-93, the token is written to $GITHUB_ENV with a randomized heredoc delimiter (good: prevents heredoc injection). However, $GITHUB_ENV is a file on the runner's filesystem readable by any subsequent step in the same job. GitHub Actions masks known secret values in logs, but the token is now accessible as an env var named CPLN_TOKEN to every subsequent step — including third-party actions. This is a common pattern, but for a token with production access (CPLN_TOKEN_PRODUCTION in the promote workflow), consider scoping it to only steps that need it via env: on individual steps rather than broadcasting it via GITHUB_ENV.

2. Production promotion env-var check is incomplete

The "Verify production environment variables" step (promote workflow) compares GVC-level env var names only — it explicitly skips workload-level env vars and cpln:// secret references. This is called out clearly in the step summary, but a workload-level secret (e.g. DATABASE_URL set on the rails workload via a cpln:// reference) could be absent from production with no warning. Consider documenting a companion manual checklist or a future enhancement to cover workload-level checks.

3. Rollback does not cover release-phase side effects

The rollback step correctly warns: "Database migrations and release_script side effects are not reversed." However, .controlplane/shakacode-team.md doesn't yet have a "What to do if production promotion fails" section that references this limitation and describes manual recovery steps. Adding this would reduce MTTR during an incident.

4. Nightly cleanup timing

cpflow-cleanup-stale-review-apps.yml runs at 0 0 * * * (midnight UTC). If a review app is actively being tested late at night the nightly sweep could delete it mid-session. Consider whether cpflow cleanup-stale-apps has a --max-age or inactivity threshold that maps to your team's typical PR lifetime.


Minor Notes

  • cpflow-review-app-help.yml posts a comment on every non-bot PR open. For repos with high PR velocity this generates noise. A team-preference label (e.g. skip-review-app-help) could allow opting out per-PR.
  • cpflow-deploy-review-app.yml triggers on synchronize and reopened but not opened. This is the intentional opt-in design, and the PR-open help comment mitigates the discoverability gap well.
  • cpflow-help-command.yml's /help handler does fs.readFileSync('.github/cpflow-help.md', 'utf8') without a try/catch. If the file is accidentally deleted, the step throws an uncaught exception instead of a clean message. Wrapping in try/catch with a fallback would be more user-friendly.
  • Old setup-environment/action.yml used sudo npm install -g for the CP CLI. New code uses a user-local ~/.npm-global prefix — better practice, avoids unnecessary sudo escalation.

Summary

The PR is in good shape. The security regressions in the old code have been properly fixed. The main items to address before merging are: (1) noting the GITHUB_ENV token persistence as an accepted risk or planned follow-up, and (2) adding a brief on-call runbook entry for production promotion failures that references the rollback limitation.

echo "CPLN_TOKEN<<${delim}"
echo "${CPLN_TOKEN}"
echo "${delim}"
} >> "$GITHUB_ENV"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The randomized heredoc delimiter here is a good defence against a malicious token value terminating the heredoc early — nice touch.

One concern worth noting: writing to $GITHUB_ENV broadcasts CPLN_TOKEN to all subsequent steps in the job, including any third-party actions added later. For the staging org this is acceptable, but the promote workflow reuses this action with CPLN_TOKEN_PRODUCTION. If a third-party action is ever added to that job it will have implicit access to the production token.

Alternative: instead of persisting to GITHUB_ENV, pass the token to only the steps that need it via env: on each individual step. This is more verbose but tightly scopes the secret.

# Guard against a misconfigured REVIEW_APP_PREFIX that would otherwise match a
# well-known shared environment. This intentionally rejects review-app prefixes
# containing these word segments; failing closed is safer for deletion.
if echo "$APP_NAME" | grep -iqE '(^|-)(production|staging)(-|$)'; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex (^|-)(production|staging)(-|$) is a good improvement over the old plain grep -i 'production|staging', which would have blocked names like qa-productions-preview unnecessarily. The word-boundary anchors reduce false positives.

One edge case: this check runs after the prefix check on line 10. If someone sets REVIEW_APP_PREFIX to something that passes line 10 but contains staging or production as a segment (e.g. qa-staging-pr), this guard fires and blocks all review app deletions entirely — a confusing failure mode. The comment on lines 17-19 already calls this out ("intentionally rejects review-app prefixes containing these word segments"), which is the right trade-off (fail closed). Just make sure README / team docs note this constraint when choosing a prefix value.

# SECURITY: pull_request_target is intentional for PR-close cleanup from forks.
# This workflow must keep checking out trusted base-branch code only; do not
# change checkout to a PR head ref without re-evaluating secret exposure.
pull_request_target:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull_request_target is the correct event here — pull_request cannot write to secrets, so fork PRs would silently fail to delete their review apps on close. The existing comment (lines 4-6) explaining the trust boundary is important and should not be removed.

One subtlety to keep in mind: pull_request_target runs from the default branch, so this workflow's code at the closed event always reflects what's on master, not what's on the merged/closed PR branch. This is intentional and desirable (it means the workflow itself cannot be tampered with via the PR), but it also means that changes to this workflow file only take effect after they're merged to master — testing workflow changes via a PR won't exercise the PR-close path until after merge.

github.event_name == 'workflow_dispatch' ||
(github.event_name == 'issue_comment' &&
github.event.issue.pull_request &&
github.event.comment.body == '/deploy-review-app' &&
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exact match (==) rather than the old contains() is the right choice — the old approach would have triggered a deploy from any comment that mentioned /deploy-review-app as part of a longer message (e.g. "don't run /deploy-review-app yet"). The exact-match requirement is also documented in the help text, so this is consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant