docs: fix issues found running the guide end-to-end on H100 spot#1
Merged
Conversation
Walked through the guide end-to-end on a fresh us-central1-a a3-highgpu-1g SPOT instance (boot to CC verification to remove). Fixes: - §2.1: list gsutil explicitly — `dstack-cloud deploy` shells out to `gsutil cp` for the boot/shared image upload, and a partial gcloud SDK layout that omits gsutil from PATH crashes deploy at the upload step with `FileNotFoundError: 'gsutil'`. Include the symlink snippet for the common case. - §2.3: a dstack-cloud copy installed before #15 was merged still parses cleanly but drops the `provisioning_model` field on serialization, so the SPOT setting from §3.2 silently regresses to STANDARD. Spell out the "refresh your install" step. - §3.4: drop the stray `chmod +x prelaunch.sh` line from inside the script body — it would only run inside the TEE guest where the cwd has no such file. `dstack-cloud new` already creates the script as 0755, so editing it preserves the mode bits. - §4.1: `ls shared/` after `prepare` shows three files, not four — `.user-config` stays at project root until deploy's FAT image build. - §4.3: `libspdm_check_crypto_backend: LKCA wrappers found.` does not reach the serial console that `dstack-cloud logs` taps. Confirming LKCA is healthy via `nvidia-smi conf-compute -f` (§5.3) is the reliable check; keep the stub-fallback warning as a 0.6.0-only symptom. - README: drop dead links to guide_CN.md and workshop/ (neither exists in the repo).
There was a problem hiding this comment.
Pull request overview
Updates the documentation to reflect issues encountered when running the H100 SPOT guide end-to-end, aiming to make the walkthrough reproducible on a fresh GCP a3-highgpu-1g spot instance.
Changes:
- Document
gsutilas a required tool fordstack-cloud deploy, and clarify installation/refresh expectations for SPOT support. - Fix guide steps/output expectations around
prelaunch.sh,shared/contents, and boot-log markers/LKCA verification. - Remove/adjust README references to non-existent Chinese guide and
workshop/directory.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| README.md | Removes broken links and points readers to relevant sections inside guide_EN.md. |
| guide_EN.md | Fixes prerequisites and corrects walkthrough steps/expected outputs based on an end-to-end run on H100 SPOT. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | Tool | Tested version | Notes | | ||
| | ----------------------------------------------- | -------------- | ----- | | ||
| | [gcloud SDK](https://cloud.google.com/sdk/docs/install) | 565+ | `gcloud auth login` against your GCP account | | ||
| | `gsutil` | bundled with gcloud SDK | `dstack-cloud deploy` shells out to `gsutil cp` to stage the boot/shared images; make sure it's on `$PATH` (`ln -s "$(gcloud info --format='value(installation.sdk_root)')/bin/gsutil" ~/.local/bin/gsutil` if missing) | |
|
|
||
| ``` | ||
| dstack-prepare.sh ... Requesting app keys from KMS | ||
| dstack-prepare.sh ... Requesting app keys from KMS: https://kms.tdxlab.dstack.org:13001/prpc |
Comment on lines
+349
to
+350
| Note that the SPDM/LKCA handshake messages (`libspdm_check_crypto_backend: | ||
| LKCA wrappers found.` etc.) go to the kernel/journal but are **not** |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Walked through
guide_EN.mdend-to-end on a freshus-central1-aa3-highgpu-1gSPOT instance (deploy → SSH → CC verify → matmul → remove). Booted in ~3.5 min; verifiedCC status: ON,Confidential Compute GPUs Ready state: ready, and100x 4096^3 matmul: 0.360s -> 38229 GFLOPs. This PR fixes the things that broke or surprised me during that run.Changes
gsutilnot listed;deploycrashed withFileNotFoundError: 'gsutil'because my gcloud SDK install put it under$(gcloud info ... sdk_root)/bin/but not on$PATHdstack-cloud(pre-#15) parses fine but dropsprovisioning_modelon the round-trip, so §3.2's"provisioning_model": "SPOT"silently regresses to STANDARDprelaunch.shchmod +x prelaunch.sh, which would only run inside the TEE guest (noprelaunch.shin cwd there).dstack-cloud newalready creates the script 0755, so editing preserves the models shared/.user-config, butprepare()only writes 3 files..user-configis copied from the project root into the FAT shared image at deploy timelibspdm_check_crypto_backend: LKCA wrappers found.does not appear on the serial console thatdstack-cloud logstaps (verified bygrep -ci "spdm|lkca" boot.log→ 0). CC mode was still ON, so LKCA was working — the message just doesn't reach the consolenvidia-smi conf-compute -f(§5.3) and clarified that the... found stubs!warning is the 0.6.0-only symptomguide_CN.mdandworkshop/— neither exists in the repoguide_CN.mdlink (kept "coming soon" text); rewrote the workshop sentence to point insideguide_EN.mdTest plan
dstack-cloud pull dstack-cloud-nvidia-0.6.1— works, producesdisk.raw+auth_hash.txt.dstack-cloud new gpu-hello-doctest— generatesapp.jsonwithprovisioning_model: STANDARD(after refreshing install per fix #2); manual edit to SPOT works.dstack-cloud prepare— generates.instance_info,.sys-config.json,app-compose.json(3 files; fix #4 reflects this).dstack-cloud -v deploy— fails withoutgsutilon PATH (fix docs: fix issues found running the guide end-to-end on H100 spot #1 addresses); after linking, deploys cleanly. Gcloud args include--provisioning-model=SPOT --instance-termination-action=STOP.openssl s_clientProxyCommand — works.nvidia-smi conf-compute -f→CC status: ON;-grs→ready.docker logs dstack-pytorch-1after ~3 min → matmul prints expected~38 TFLOPs.dstack-cloud remove+gsutil rm gs://.../dstack-gpu-hello-doctest-shared.tar.gz— clean.