Skip to content

amplifier update: stale editable installs left pointing at deleted cache dirs → silent provider load failures #299

@Joi

Description

@Joi

Symptoms

After running amplifier update, two of four configured providers (openai, gemini) silently failed to load. The only visible signal was:

Partial provider failure: 2/4 loaded. Missing: {'openai': 1, 'gemini': 1}.
Loaded: ['anthropic', 'github-copilot']. Session continuing with available providers
(self-healing NOT triggered for partial failure).

The actual exception (ModuleNotFoundError: No module named 'amplifier_module_provider_openai') is swallowed before reaching the user-visible log. Self-healing did not trigger because the failure was classified as "partial."

The same amplifier update run also produced this clone error at the top, which appears to be correlated:

✗ Bundle: joi: Failed to clone ssh://git@github.com/Joi/amplifier-bundle-joi@main:
fatal: could not open '/Users/joi/.amplifier/cache/amplifier-bundle-joi-<hash>/.git/objects/pack/tmp_pack_dnAj5l' for reading: No such file or directory
fatal: fetch-pack: invalid index-pack output

amplifier update ran git clone into a populated cache dir instead of git pull, failed mid-pack, but left the bundle in its previously healthy state.

Root cause

amplifier update re-clones modules into content-hash-suffixed cache directories when upstream changes. For openai+gemini on the affected run, the re-clone produced new hash dirs, but the editable install (uv pip install -e) was not re-run against the new paths. The old hash dirs were subsequently removed by the updater's GC step, leaving the editable install pointers dangling.

Observed state on the affected machine (uv tool install-managed amplifier venv at ~/.local/share/uv/tools/amplifier/):

Module .pth and .dist-info/direct_url.json pointed to Cache currently contained
openai cache/amplifier-module-provider-openai-e2c4b8c10222ed8c (deleted) …-f62df101b02f7b3e
gemini cache/amplifier-module-provider-gemini-06d1437d03d6b064 (deleted) …-055b52ecbb4ac482
anthropic (control, working) …-5181591dcf06d076 …-5181591dcf06d076

import amplifier_module_provider_openaiModuleNotFoundError because the .pth pointed to a non-existent directory. Anthropic and github-copilot worked because their content hashes hadn't changed and their install pointers were still consistent.

The "tmp_pack_dnAj5l" clone error at the top of the update run is likely related: the bundle clone failure may have caused the updater to abort its post-clone editable-install consistency check, leaving modules whose hash DID change in an inconsistent state.

Surgical fix that worked

Without touching the cache directories:

AMP_PYTHON=~/.local/share/uv/tools/amplifier/bin/python
uv pip install --python "$AMP_PYTHON" -e ~/.amplifier/cache/amplifier-module-provider-openai-<current-hash>
uv pip install --python "$AMP_PYTHON" -e ~/.amplifier/cache/amplifier-module-provider-gemini-<current-hash>

uv pip install cleanly reported the swap from the dead path to the live one:

- amplifier-module-provider-openai==1.0.0 (from file:///…/openai-e2c4b8c10222ed8c)
+ amplifier-module-provider-openai==1.0.0 (from file:///…/openai-f62df101b02f7b3e)

After this, Partial provider failure disappeared and all 4 providers loaded.

The heavier alternative amplifier reset --remove cache -y && amplifier update would also have worked but re-clones everything.

Diagnostic indicators (for other users hit by this)

  1. amplifier session start prints Partial provider failure: … with {'<provider>': N, ...}
  2. <venv>/lib/python3.*/site-packages/_editable_impl_amplifier_module_provider_*.pth points at a ~/.amplifier/cache/... directory that no longer exists on disk
  3. The matching *.dist-info/direct_url.json confirms the stale URL

Suggested fixes

  1. Surface the actual exception in the partial-failure message. Today it just shows a count ({'openai': 1, 'gemini': 1}) — include the underlying ImportError / ModuleNotFoundError so users can diagnose without dropping into the venv.
  2. Treat editable-install consistency as a post-update invariant. After cloning modules to new hash dirs, verify (via direct_url.json or equivalent) that every editable install points at a path that still exists; reinstall against the current path if not. Sub-millisecond check.
  3. Prefer git pull over git clone when the cache dir already contains a valid git repo with a matching origin URL. The tmp_pack_dnAj5l error would not have happened, and the partial pipeline failure would not have left consistency gaps downstream.
  4. Don't treat "partial provider failure" as silently OK. It's the worst kind of failure for a system that routes by role — a session configured with model_role: critical-ops against openai will silently downgrade to anthropic with no signal to the calling agent. Consider non-zero exit and/or hard-fail by default with an explicit --allow-partial-providers flag if quiet downgrade is desired.

Environment

  • macOS 26.5 arm64 (M-series)
  • amplifier installed via uv tool install amplifier
  • Python 3.12 (uv-managed)
  • Bundle: a private bundle composed against microsoft/amplifier-foundation + microsoft/amplifier-bundle-routing-matrix
  • All 4 provider modules sourced from git+https://github.com/microsoft/amplifier-module-provider-*@main

Happy to attach pip list, the affected .pth contents, or the cache directory listing if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions