Skip to content

docs: fix issues found running the guide end-to-end on H100 spot#1

Merged
kvinwang merged 1 commit into
mainfrom
doctest-fixes-2026-05-27
May 27, 2026
Merged

docs: fix issues found running the guide end-to-end on H100 spot#1
kvinwang merged 1 commit into
mainfrom
doctest-fixes-2026-05-27

Conversation

@kvinwang
Copy link
Copy Markdown
Collaborator

Summary

Walked through guide_EN.md end-to-end on a fresh us-central1-a a3-highgpu-1g SPOT instance (deploy → SSH → CC verify → matmul → remove). Booted in ~3.5 min; verified CC status: ON, Confidential Compute GPUs Ready state: ready, and 100x 4096^3 matmul: 0.360s -> 38229 GFLOPs. This PR fixes the things that broke or surprised me during that run.

Changes

# Section Problem Fix
1 §2.1 tools gsutil not listed; deploy crashed with FileNotFoundError: 'gsutil' because my gcloud SDK install put it under $(gcloud info ... sdk_root)/bin/ but not on $PATH Added a row with the symlink snippet
2 §2.3 install note An older dstack-cloud (pre-#15) parses fine but drops provisioning_model on the round-trip, so §3.2's "provisioning_model": "SPOT" silently regresses to STANDARD Made the "refresh your install" step explicit with the consequence
3 §3.4 prelaunch.sh The fenced script body ended with chmod +x prelaunch.sh, which would only run inside the TEE guest (no prelaunch.sh in cwd there). dstack-cloud new already creates the script 0755, so editing preserves the mode Removed the stray line; added a one-liner noting the file is already executable
4 §4.1 ls shared/ The expected output listed .user-config, but prepare() only writes 3 files. .user-config is copied from the project root into the FAT shared image at deploy time Corrected the listing; added a one-liner explaining the timing
5 §4.3 boot log markers libspdm_check_crypto_backend: LKCA wrappers found. does not appear on the serial console that dstack-cloud logs taps (verified by grep -ci "spdm|lkca" boot.log → 0). CC mode was still ON, so LKCA was working — the message just doesn't reach the console Removed the line from the expected output; redirected the LKCA verification to nvidia-smi conf-compute -f (§5.3) and clarified that the ... found stubs! warning is the 0.6.0-only symptom
6 README Linked to guide_CN.md and workshop/ — neither exists in the repo Removed guide_CN.md link (kept "coming soon" text); rewrote the workshop sentence to point inside guide_EN.md

Test plan

  • dstack-cloud pull dstack-cloud-nvidia-0.6.1 — works, produces disk.raw + auth_hash.txt.
  • dstack-cloud new gpu-hello-doctest — generates app.json with provisioning_model: STANDARD (after refreshing install per fix #2); manual edit to SPOT works.
  • dstack-cloud prepare — generates .instance_info, .sys-config.json, app-compose.json (3 files; fix #4 reflects this).
  • dstack-cloud -v deploy — fails without gsutil on PATH (fix docs: fix issues found running the guide end-to-end on H100 spot #1 addresses); after linking, deploys cleanly. Gcloud args include --provisioning-model=SPOT --instance-termination-action=STOP.
  • §5.2 SSH via openssl s_client ProxyCommand — works.
  • §5.3 nvidia-smi conf-compute -fCC status: ON; -grsready.
  • §5.4 docker logs dstack-pytorch-1 after ~3 min → matmul prints expected ~38 TFLOPs.
  • §6 dstack-cloud remove + gsutil rm gs://.../dstack-gpu-hello-doctest-shared.tar.gz — clean.

Walked through the guide end-to-end on a fresh us-central1-a
a3-highgpu-1g SPOT instance (boot to CC verification to remove).

Fixes:
- §2.1: list gsutil explicitly — `dstack-cloud deploy` shells out to
  `gsutil cp` for the boot/shared image upload, and a partial gcloud
  SDK layout that omits gsutil from PATH crashes deploy at the upload
  step with `FileNotFoundError: 'gsutil'`. Include the symlink snippet
  for the common case.
- §2.3: a dstack-cloud copy installed before #15 was merged still
  parses cleanly but drops the `provisioning_model` field on
  serialization, so the SPOT setting from §3.2 silently regresses to
  STANDARD. Spell out the "refresh your install" step.
- §3.4: drop the stray `chmod +x prelaunch.sh` line from inside the
  script body — it would only run inside the TEE guest where the cwd
  has no such file. `dstack-cloud new` already creates the script as
  0755, so editing it preserves the mode bits.
- §4.1: `ls shared/` after `prepare` shows three files, not four —
  `.user-config` stays at project root until deploy's FAT image build.
- §4.3: `libspdm_check_crypto_backend: LKCA wrappers found.` does not
  reach the serial console that `dstack-cloud logs` taps. Confirming
  LKCA is healthy via `nvidia-smi conf-compute -f` (§5.3) is the
  reliable check; keep the stub-fallback warning as a 0.6.0-only
  symptom.
- README: drop dead links to guide_CN.md and workshop/ (neither exists
  in the repo).
Copilot AI review requested due to automatic review settings May 27, 2026 07:37
@kvinwang kvinwang merged commit 0ab795d into main May 27, 2026
1 check passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the documentation to reflect issues encountered when running the H100 SPOT guide end-to-end, aiming to make the walkthrough reproducible on a fresh GCP a3-highgpu-1g spot instance.

Changes:

  • Document gsutil as a required tool for dstack-cloud deploy, and clarify installation/refresh expectations for SPOT support.
  • Fix guide steps/output expectations around prelaunch.sh, shared/ contents, and boot-log markers/LKCA verification.
  • Remove/adjust README references to non-existent Chinese guide and workshop/ directory.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
README.md Removes broken links and points readers to relevant sections inside guide_EN.md.
guide_EN.md Fixes prerequisites and corrects walkthrough steps/expected outputs based on an end-to-end run on H100 SPOT.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread guide_EN.md
| Tool | Tested version | Notes |
| ----------------------------------------------- | -------------- | ----- |
| [gcloud SDK](https://cloud.google.com/sdk/docs/install) | 565+ | `gcloud auth login` against your GCP account |
| `gsutil` | bundled with gcloud SDK | `dstack-cloud deploy` shells out to `gsutil cp` to stage the boot/shared images; make sure it's on `$PATH` (`ln -s "$(gcloud info --format='value(installation.sdk_root)')/bin/gsutil" ~/.local/bin/gsutil` if missing) |
Comment thread guide_EN.md

```
dstack-prepare.sh ... Requesting app keys from KMS
dstack-prepare.sh ... Requesting app keys from KMS: https://kms.tdxlab.dstack.org:13001/prpc
Comment thread guide_EN.md
Comment on lines +349 to +350
Note that the SPDM/LKCA handshake messages (`libspdm_check_crypto_backend:
LKCA wrappers found.` etc.) go to the kernel/journal but are **not**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants