Skip to content

feat(config): DJ_STORES env var + DJ_IGNORE_CONFIG_FILE flag + arbitrary .secrets/ attrs — 2.3#1452

Open
dimitri-yatsenko wants to merge 2 commits into
masterfrom
fix/store-secrets-arbitrary-attr
Open

feat(config): DJ_STORES env var + DJ_IGNORE_CONFIG_FILE flag + arbitrary .secrets/ attrs — 2.3#1452
dimitri-yatsenko wants to merge 2 commits into
masterfrom
fix/store-secrets-arbitrary-attr

Conversation

@dimitri-yatsenko
Copy link
Copy Markdown
Member

@dimitri-yatsenko dimitri-yatsenko commented May 19, 2026

Summary

Closes the env-var gap for object-store configuration. Until now the DataJoint platform (and any deployment that ships container images without datajoint.json) could inject DJ_HOST / DJ_USER / DJ_PASS but had no env-var path for stores.<name>.*. Plugin-registered adapters (Databricks Unity Catalog Volumes, custom HTTP-based protocols, lab archive systems) with credential fields like token, api_key, workspace_url couldn't be used on the platform at all.

This PR adds two env vars and broadens the file-based secrets loader, all targeting DataJoint 2.3:

  • DJ_STORES — JSON-encoded copy of the stores dict, same shape as in datajoint.json. Replaces the file's stores block when set. A single env var avoids per-field delimiter / underscore-stripping ambiguity (access_key vs accesskey, raw_data vs rawdata) and accepts arbitrary adapter-defined attr names without negotiating an env-var schema.
  • DJ_IGNORE_CONFIG_FILE (default false) — when true, skips datajoint.json, the project .secrets/, and /run/secrets/datajoint/ entirely. Hard guarantee that no file on disk leaks into config.
  • .secrets/stores.<name>.<attr> now accepts any attribute, not only access_key / secret_key. (This was the original scope of the PR; kept and expanded.)

Motivation

A plugin that wraps Databricks Unity Catalog Volumes authenticates via a Bearer token — its store spec wants a token field, not access_key / secret_key. With the previous whitelist, .secrets/stores.uc.token was silently ignored, and there was no env-var path for token at all. Users had to either rename to a misleading access_key or set the value programmatically at notebook startup — neither of which works on the DataJoint platform, which has no datajoint.json and no .secrets/ on disk.

After this PR, the same adapter is configured identically across:

Deployment Source
Local notebook datajoint.json + .secrets/stores.uc.token
Customer-managed VM datajoint.json + env vars for credentials
DataJoint Platform / Kubernetes / Lambda DJ_IGNORE_CONFIG_FILE=true + DJ_STORES='{"uc": {...}}'

What changed

src/datajoint/settings.py

  • Added Config.ignore_config_file: bool field (env: DJ_IGNORE_CONFIG_FILE, default False) — pydantic-settings auto-binds it via validation_alias.
  • Added a validation_alias placeholder on the stores field so pydantic-settings does not auto-bind DJ_STORES at Config() time. Otherwise its built-in JSON parser intercepts before precedence logic runs and reports SettingsError instead of a clean ValueError.
  • Added Config._apply_stores_env() — parses DJ_STORES JSON, replaces self.stores wholesale, raises ValueError on bad JSON or non-object payloads.
  • Restructured _create_config(): skip file + secrets when ignore_config_file is set; apply DJ_STORES between file load and secrets fill so env wins over file and secrets only fill gaps.
  • Removed the access_key / secret_key whitelist from _load_secrets()'s per-store branch — any stores.<name>.<attr> file is loaded.
  • Updated save_template() docstring to mention DJ_STORES / DJ_IGNORE_CONFIG_FILE.

src/datajoint/storage_adapter.py

Incidental cleanup (unchanged from the prior PR scope): removed a try/except TypeError fallback around entry_points(group=...) for Python <3.10. requires-python = ">=3.10", so the branch was dead code (and was producing a stale mypy error).

tests/unit/test_settings.py

New TestStoreEnv class with 8 tests:

  1. test_dj_stores_sets_stores_dict — happy-path JSON parse with no config file.
  2. test_dj_stores_overrides_config_file — env wins over file.
  3. test_dj_stores_invalid_json_raises — clear ValueError on bad JSON.
  4. test_dj_stores_non_object_raises — clear ValueError on non-object JSON.
  5. test_dj_stores_plus_secrets_dir — secrets fill attrs DJ_STORES omits.
  6. test_ignore_config_file_skips_jsonDJ_IGNORE_CONFIG_FILE=true blocks datajoint.json.
  7. test_ignore_config_file_skips_secrets — same blocks .secrets/.
  8. test_ignore_config_file_default_loads_both — default behavior preserved.

Existing TestStoreSecrets tests (incl. the new test_load_store_arbitrary_attr for the whitelist removal) and all 23 test_storage_adapter tests still pass.

Functional precedence

Source Priority
dj.config["stores"][...] (programmatic) 1 (highest)
DJ_STORES env var 2
stores block of datajoint.json 3
.secrets/stores.<name>.<attr> files 4 (fills missing attrs only)

When DJ_IGNORE_CONFIG_FILE=true, sources 3 and 4 are skipped entirely.

Companion PR

Test plan

  • 80 tests pass in tests/unit/test_settings.py (including the 8 new TestStoreEnv tests)
  • 23 tests pass in tests/unit/test_storage_adapter.py
  • Pre-commit hooks green (ruff, ruff-format, mypy, codespell)
  • Manual: DJ_IGNORE_CONFIG_FILE=true DJ_STORES='{...}' python -c "import datajoint as dj; print(dj.config['stores'])" from an empty directory shows the env-supplied stores and emits no "No datajoint.json found" warning.
  • Manual: with DJ_IGNORE_CONFIG_FILE=true and a datajoint.json adjacent to cwd containing {"database": {"host": "should-not-load"}}, dj.config["database.host"] is the default (localhost), not should-not-load.

Release notes (2.3)

  • DJ_STORES env var — JSON-encoded stores dict
  • DJ_IGNORE_CONFIG_FILE env var — skip file-based config sources (default false)
  • .secrets/stores.<name>.<attr> accepts any attribute, not only access_key / secret_key
  • Storage Adapter API (datajoint.storage entry-point group) formalized as part of the public API; see the spec in the companion docs PR.

Two changes:

1. settings.py — Config._load_secrets()
   Was hardcoded to only auto-load stores.<name>.access_key and
   stores.<name>.secret_key from the secrets directory. That whitelist
   made sense when only the built-in s3 / gcs / azure stores existed,
   but it's wrong for plugin-registered adapters that define their own
   secret fields.

   Concrete example: a plugin that wraps Databricks Unity Catalog
   Volumes authenticates via a Bearer token. Its store spec wants a
   `token` field. Today the file .secrets/stores.uc.token is silently
   ignored, forcing users to either rename to a misleading
   "access_key" or to set the value programmatically at notebook
   startup (defeating the auto-load story).

   This change drops the whitelist. Any stores.<name>.<attr> file under
   the secrets directory is loaded into
   dj.config["stores"][<name>][<attr>]. The existing "don't override
   pre-configured values" precedence is preserved — config file and
   env vars still win.

   New test: TestStoreSecrets.test_load_store_arbitrary_attr confirms
   that `token` and `api_key` fields load via the same path as
   access_key/secret_key.

2. storage_adapter.py — _discover_adapters()
   Removed the try/except wrapping importlib.metadata.entry_points().
   The fallback path was for Python < 3.10 (where entry_points()
   returned a SelectableGroups dict instead of accepting the
   group= kwarg), but DataJoint's minimum supported Python is 3.10
   (per requires-python in pyproject.toml). The dead branch was also
   producing a mypy type error (SelectableGroups.get signature).
@dimitri-yatsenko dimitri-yatsenko requested review from kushalbakshi, mweitzel and ttngu207 and removed request for mweitzel May 19, 2026 22:04
dimitri-yatsenko added a commit that referenced this pull request May 19, 2026
Same fix #1383 applied to the Job table's antijoin in refresh(),
now applied to AutoPopulate._populate_direct's antijoin and the
progress() fallback path. The two-arg subtract `key_source - self`
triggers QueryExpression.__sub__ which calls .restrict(Not(...))
with semantic_check=True by default.

The semantic-check requirement is wrong here: this antijoin is a
plain set-difference, not a join — we ask "which key_source rows
aren't yet in self." Whether the same-named PK attribute carries
the same source-table lineage tag on both sides is irrelevant.

Where it bites: dj.Imported / dj.Computed tables whose primary key
is fully inherited from a single FK, with no own-table PK attributes.
On those, self.proj() returns the PK attribute with lineage=None
(or pointing to self rather than the FK parent), while key_source's
matching attribute carries the parent's lineage tag. The
semantic-check fails with:

    Cannot join on attribute 'X': different lineages
    (schema.parent.X vs None). Use .proj() to rename one of the
    attributes.

This pattern is legitimate ("one row downstream per parent row,
no intermediate ID") but rare in typical Elements / SciOps pipelines,
which extend the inherited PK with own-table attributes (trial_id,
experiment_id, etc.) that anchor proj()'s lineage. That's why the
existing #1405 test suite didn't surface it.

Changes:
- src/datajoint/autopopulate.py
  - Import Not from .condition at module top.
  - _populate_direct: replace `(LHS - self.proj())` with
    `LHS.restrict(Not(self.proj()), semantic_check=False)`.
  - progress(): same swap on the no-common-attrs fallback branch.
- tests/integration/test_autopopulate.py
  - New test_populate_antijoin_fk_inherited_pk regression test:
    Spec(Manual) -> Item(Imported with only -> Spec) — the minimal
    shape that triggers the bug. Without the fix Item.populate()
    raises DataJointError; with the fix it populates correctly,
    progress() reports correct counts, and partial-then-full
    populate works.

Stacked on top of #1452 (the secrets-loading + dead-code fix); rebase
to master after that lands.
ttngu207
ttngu207 previously approved these changes May 20, 2026
… 2.3

Adds env-var configuration for object stores so the DataJoint platform — and
any env-var-only deployment — can configure plugin-registered storage adapters
(Databricks Unity Catalog Volumes, custom HTTP stores, lab archive systems)
without files on disk.

- DJ_STORES (JSON-encoded) carries the entire `stores` dict in the same shape
  used in `datajoint.json`. Replaces the file's `stores` block when set.
- DJ_IGNORE_CONFIG_FILE (default false) skips `datajoint.json`, the project
  `.secrets/`, and `/run/secrets/datajoint/` entirely. Hard guarantee that no
  file on disk leaks into config.

Implementation notes:

- New `Config.ignore_config_file` field (validation_alias DJ_IGNORE_CONFIG_FILE)
  auto-bound by pydantic-settings.
- The `stores` field receives a `validation_alias` placeholder so
  pydantic-settings does NOT auto-bind DJ_STORES at Config() construction.
  Otherwise its built-in JSON parser intercepts before precedence logic runs
  and reports SettingsError instead of a clean ValueError.
- New `Config._apply_stores_env()` parses DJ_STORES JSON, replaces self.stores
  wholesale, raises ValueError on bad JSON or non-object payloads.
- `_create_config()` restructured to: skip file + secrets when
  ignore_config_file is set; apply DJ_STORES between file load and secrets
  fill so env wins over file and secrets only fill gaps.

Functional precedence (high to low): programmatic > DJ_STORES > config file >
`.secrets/stores.<name>.<attr>` (fills missing attrs only).

Tests: new TestStoreEnv class with 8 tests; existing TestStoreSecrets and
test_storage_adapter tests still pass.

Companion docs: datajoint/datajoint-docs#172
@dimitri-yatsenko dimitri-yatsenko changed the title fix: Load any stores.<name>.<attr> secret file (not just access_key/secret_key) feat(config): DJ_STORES env var + DJ_IGNORE_CONFIG_FILE flag + arbitrary .secrets/ attrs — 2.3 May 20, 2026
@dimitri-yatsenko dimitri-yatsenko requested a review from ttngu207 May 20, 2026 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants