Skip to content

Commit ce4e658

Browse files
vvnpn-nvclaude
andauthored
managing osmo configs with k8s configmap (#822)
* feat: add ConfigMap-sourced dynamic configuration loader Add support for loading dynamic configs from a Kubernetes ConfigMap on service startup, enabling GitOps workflows where configs are version- controlled in Helm values and applied automatically via ArgoCD. Two managed_by modes per config type: - seed (default): only apply if config doesn't exist in DB - configmap: always overwrite DB from ConfigMap on startup Includes: - configmap_loader.py: core loader with dependency ordering, advisory lock for multi-replica safety, secret file resolution for dataset credentials, and per-type error isolation - dynamic_config_file field on WorkflowServiceConfig (CLI + env var) - Helm chart: ConfigMap template, Secret template for credentials, volume mounts, checksum annotation for pod restart on config change Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address bugs and code quality issues in configmap_loader - Fix crash when managed_configs is None (empty YAML section) - Fix partial state corruption in _resolve_dataset_secret_files by validating secret file content before mutating credential dict - Add configmap_loader.py and pyyaml dep to BUILD file - Add type annotations for apply_function, model_class params - Chain exception in _parse_managed_by (raise from error) - Use RETURNING clause in _insert_backend to skip history entry on conflict instead of always creating one - Add schema reference comment to _insert_backend - Warn on unknown keys in managed_configs - Update copyright years to 2025-2026 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add comprehensive tests for ConfigMap config loader Add unit tests (20 tests) and integration tests (18 tests) for the configmap_loader module. Unit tests cover file handling, YAML parsing, managed_by mode parsing, secret file resolution, safe_apply error isolation, advisory lock behavior, and unknown key warnings. Integration tests verify actual DB interactions including seed vs configmap mode for all config types, dependency ordering, per-item error isolation, end-to-end loading, and config history entries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address review findings, add config watcher, fix test failures - Advisory lock: session-scoped -> transaction-scoped (pg_try_advisory_xact_lock) to prevent lock leaks on process kill - Seed mode check: use config_history table instead of exclude_unset to correctly detect if config was ever explicitly set - Config watcher: background polling thread (30s default) with SHA-256 hash comparison detects ConfigMap file changes without pod restart - Roles: pre-construct RolePolicy objects for pydantic v1 compatibility - Dataset: auto-populate credential endpoint from bucket dataset_path - Credential logging: removed secret file path from log messages - Tests: fixed fetch_from_db return types, semantic action format, advisory lock assertions, unused imports, mypy annotations - E2E: added test values overlay and dynamic config test file Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * simplify: consolidate singleton config functions, fix advisory lock - Consolidate _apply_service_config, _apply_workflow_config, _apply_dataset_config into single _apply_singleton_config with config_type parameter and optional pre_apply hook (dataset uses it for secret resolution) - Fix advisory lock: revert to session-scoped pg_try_advisory_lock with explicit unlock in finally block. Transaction-scoped xact lock was released immediately by execute_fetch_command's auto-commit, providing zero mutual exclusion. - Replace _backend_exists with _named_config_exists (eliminates duplicate) - Use SELECT 1 LIMIT 1 instead of COUNT(*) in _singleton_config_exists - Remove duplicate _DEFAULT_POLL_INTERVAL_SECONDS constant - Add WHY comment for endpoint defaulting from dataset_path - Update unit tests for session lock assertions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add drift reconciliation for configmap-mode configs The ConfigMapWatcher now has two-tier polling: 1. File change detection (existing): SHA-256 hash comparison on the mounted ConfigMap file. When the file changes, re-apply everything. 2. Drift reconciliation (new): for managed_by=configmap singleton configs (service, workflow, dataset), compare the last-applied ConfigMap values against current DB state on each poll cycle. If someone modified a config via CLI, the watcher detects the drift and re-applies the ConfigMap values within one poll interval. Key design decisions: - Only reconciles singleton configs (not named configs like pools) to limit DB query load per cycle - Compares desired values against DB BEFORE calling patch_configs, so no config_history entries are created when there's no drift - Uses a separate advisory lock (configmap-reconcile) to prevent multiple replicas from correcting the same drift simultaneously - Resolves secret_file references in the cached config so dataset drift detection compares resolved credentials, not file paths - Seed-mode configs are never reconciled (by design) Also refactors module-level start_config_watcher + global state into a ConfigMapWatcher class with clean instance state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add UX improvements — managed_by visibility, immediate re-apply, credentialSecretName Improvement 1: Management mode visibility - Config GET endpoints (service, workflow, dataset) include _managed_by field in responses ('configmap', 'seed', or absent) - Managed modes persisted to configmap_state table on startup - get_managed_mode() and get_cached_section() exposed for cross-module access Improvement 2: Immediate re-apply after CLI write - When patch_configs writes to a configmap-managed singleton config, the ConfigMap values are immediately re-applied (inside helpers.py) - Only triggers for non-configmap-sync writers (prevents infinite loops) - Eliminates the need to wait for 30s drift reconciliation poll Improvement 3: Simplified secret wiring (credentialSecretName) - Dataset buckets can now use credentialSecretName instead of secret_file - Helm template auto-generates volume + volumeMount for referenced secrets - Loader resolves credentialSecretName to /etc/osmo/secrets/<name>/cred.yaml - No manual extraVolumes/extraVolumeMounts needed Also adds configmap_state table to postgres.py and test schema. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: reject CLI writes to configmap-managed configs with 409 Conflict Replace accept-then-revert pattern with 409 rejection for all config types. When a user tries to modify a config managed by ConfigMap in configmap mode, the API returns 409 Conflict with a clear error message instead of silently reverting after the write. Singleton configs (service, workflow, dataset): - PUT/PATCH endpoints reject with 409 before any DB write - Removed the accept-then-revert hook from helpers.py patch_configs() Named configs (pools, pod_templates, backends, roles, etc.): - 19 single-item PUT/PATCH/DELETE endpoints reject with 409 - Bulk PUT endpoints (put_pools, put_pod_templates, etc.) are NOT guarded because the configmap_loader uses them internally Benefits over accept-then-revert: - Zero race conditions (no concurrent re-apply with watcher) - Zero duplicate history entries (write never happens) - Zero wasted side effects (no backend queue syncs, K8s updates) - Clear API feedback ("update Helm values instead") Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: extract configmap_guard module, fix circular import, cover all write paths Extract 409 guard logic into configmap_guard.py — a standalone module with no imports from helpers.py or configmap_loader.py. This eliminates the circular import (helpers -> configmap_loader -> helpers) that required import-outside-toplevel violations. Architecture: - configmap_guard.py: holds managed config state, guard functions, constants - configmap_loader.py: delegates guard functions to configmap_guard, pushes state via set_managed_configs() on load_and_apply() - helpers.py: imports configmap_guard at top level (no circular import) - config_service.py: imports configmap_guard for _add_managed_by and named item guards Coverage: - Singletons: guarded in helpers.put_configs() and helpers.patch_configs() (covers all singletons including rollback) - Named items: guarded in all 19 single-item endpoints + 6 bulk endpoints + rollback endpoint in config_service.py - configmap-sync bypass: reject_if_managed() checks username and skips for the loader's own writes Also: - Eliminated triple file read in load_and_apply() (reads once, computes hash + parses YAML from same bytes) - Removed dead get_configmap_state() from postgres.py - Updated drift test to expect 409 rejection instead of accept-then-revert Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: remove debug print, dead code, stale comment - Removed debug print(test_configs) from helpers.py - Removed unused re-export delegations from configmap_loader.py (get_managed_mode, get_cached_section, etc. — these are accessed directly from configmap_guard, not via configmap_loader) - Removed stale comment about circular imports Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: rollback guard referenced removed configmap_loader.reject_if_managed The rollback_config endpoint still called configmap_loader.reject_if_managed which was removed when extracting to configmap_guard. Fixed to use configmap_guard.reject_if_managed. Also removed unused configmap_loader import from config_service.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add design doc for ConfigMap-sourced dynamic configuration Covers architecture, management modes, 409 rejection, drift reconciliation, credentialSecretName, and full E2E test results from live dev instance validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: guard watcher startup to avoid service boot abort Wrap ConfigMapWatcher.start() in try/except so a malformed ConfigMap file or DB error during initial config load doesn't prevent the service from starting. The service logs the error and continues without ConfigMap config management. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: extend secret file references to all config credential fields Generalize the dataset-only _resolve_dataset_secret_files() into a recursive _resolve_secret_file_references() that works for all config types. Any dict with a 'secret_file' or 'secretName' key at any nesting level gets resolved by reading the mounted K8s Secret file. Supports two patterns: - Dict secrets (credentials): file contents merged into the dict - Simple string secrets (tokens, passwords): file with {value: "..."} replaces the dict entirely with the string value Helm templates updated: - _helpers.tpl: recursive secretName→secret_file transformer + secret name collector (4 levels deep) - dynamic-config.yaml: resolves secretName in workflow/service config - api-service.yaml: auto-generates volume + volumeMount for all secretName references across all config types - values.yaml: documents secretName pattern for workflow credentials Example Helm values: workflow: config: workflow_data: credential: secretName: osmo-workflow-data-cred workflow_alerts: slack_token: secretName: osmo-slack-token Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: auto-detect Docker registry .dockerconfigjson secret format The secret file resolver now auto-detects three formats: 1. Simple string: {value: "..."} → replaces dict with string 2. Docker registry: {auths: {registry: {username, auth}}} → extracts registry, username, auth fields automatically 3. YAML dict: merges all keys into current dict (existing behavior) Added secretKey field alongside secretName to specify which key in the K8s Secret to mount (defaults to "cred.yaml"). For Docker registry secrets, use secretKey: .dockerconfigjson. Example: backend_images: credential: secretName: imagepullsecret secretKey: .dockerconfigjson This lets users reuse existing imagepull secrets without creating a duplicate secret in a different format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: redesign ConfigMap config to in-memory + watchdog + global mode Major redesign based on team feedback: 1. Single global toggle (no per-config managed_by seed/configmap) - dynamicConfig.enabled: true = ConfigMap mode (all writes blocked) - dynamicConfig.enabled: false = DB mode (CLI/API works normally) 2. In-memory config serving (standard K8s pattern) - ConfigMap file parsed on startup, cached in module-level dict - All config reads served from memory, not DB - DB only for runtime state (agent heartbeats, k8s_uid) + roles - Runtime fields (service_auth) injected from DB on first load 3. Watchdog file events replace 30s polling - Uses watchdog library (already a dep) with inotify - Watches parent dir for K8s ConfigMap symlink swaps - 2s debounce for atomic swap events 4. All-or-nothing validation before applying - Pydantic model validation for singleton configs - Structure validation for named config sections - On failure: previous config preserved, error logged 5. Simplified 409 write protection - Single reject_if_configmap_mode(username) function - 29 guard calls across config_service.py + helpers.py - Bug fix: added missing delete_dataset guard - Bug fix: simplified rollback guard New: src/utils/configmap_state.py (dependency-free state module) Modified: 11 files, ~900 lines added, ~2000 lines removed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: rewrite design doc for v2 architecture Updated to reflect redesigned ConfigMap config system: - Global mode (no per-config managed_by) - In-memory serving (standard K8s pattern) - Watchdog file events (no polling) - Config vs runtime data separation - All-or-nothing validation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: authz_sidecar reads roles from ConfigMap file instead of DB Add --roles-file flag to authz_sidecar. When set, the sidecar reads roles, external role mappings, and pool names from the ConfigMap-mounted YAML file instead of PostgreSQL. This eliminates the DB dependency from the authz_sidecar in ConfigMap mode. New: src/utils/roles/file_loader.go - FileRoleStore loads roles + external_roles from YAML - In-memory reverse map: externalRole -> []osmoRole - ResolveExternalRoles replaces SyncUserRoles SQL query - Poll-based file watching (os.Stat every 30s) Modified: authz_server.go - NewFileBackedAuthzServer constructor (no pgClient needed) - Check(): file-backed ResolveExternalRoles vs DB SyncUserRoles - resolveRoles(): file store vs DB for role policy fetches - computeAllowedPools(): file store vs DB for pool names - MigrateRoles(): skipped in file-backed mode Modified: main.go - --roles-file flag for ConfigMap mode - Conditional init: file-backed vs DB-backed server - DB connection only created when needed Modified: configmap_loader.py - Removed _write_roles_to_db (authz_sidecar reads file directly) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: authz_sidecar uses --roles-file when dynamicConfig enabled Helm template conditionally passes --roles-file to the authz-sidecar container and mounts the dynamic-config ConfigMap volume. When dynamicConfig.enabled=true, postgres args are omitted (not needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: move product defaults to chart values, auto-derive service_base_url Chart values.yaml now ships with: - Default workflow limits (max_num_tasks, timeouts) - Default pod templates (ctrl, user) - Default resource validations (cpu, memory) - All 5 RBAC roles with policies (admin, user, default, ctrl, backend) service_base_url auto-derived from services.service.hostname in the dynamic-config template — no need to set it per deployment. Per-deployment values only need site-specific configs: - Dataset buckets + credentials - Backend definitions - Workflow image credentials - CLI version info Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add default backend and pool to chart defaults Matches configure_app() behavior: default backend with kai scheduler, default pool referencing default backend with default platform. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update design doc for v3 — authz file-backed, chart defaults - Authz sidecar reads roles from ConfigMap file (FileRoleStore) - IDP groups as source of truth for user role assignments - Product defaults (roles, templates, validations, backend, pool) in chart - Per-deployment values only for site-specific overrides - Updated architecture diagram, component table, E2E test results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: eliminate remaining DB deps in ConfigMap mode - Role.list_from_db/fetch_from_db now check snapshot (was missing) - ConfigMapWatcher.postgres is optional (None when DB not available) - _inject_runtime_fields skips DB read when postgres is None - All config types now served from in-memory when ConfigMap mode active Remaining DB access in ConfigMap mode is ONLY: - Backend heartbeats (agent writes, pool status reads) - configure_app() startup (runs before ConfigMap mode activates) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: fix medium review findings — remove stale docs and dead code - Remove stale managed_by mode comments from values.yaml (v1/v2 leftover) - Remove unused dynamic_config_poll_interval field from WorkflowServiceConfig (Python watcher uses watchdog events, Go sidecar has its own constant) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: flatten ConfigMap YAML — remove managed_configs/config/items nesting ConfigMap format now matches API response format directly: - Removed `managed_configs:` top-level wrapper - Removed `config:` sub-key from singleton types (service, workflow, dataset) - Removed `items:` sub-key from named types (pools, roles, templates, etc.) Before: snapshot['service']['config']['cli_config'] After: snapshot['service']['cli_config'] Before: snapshot['pools']['items']['default'] After: snapshot['pools']['default'] This means `osmo config show` output can be pasted directly into Helm values with no format translation. 7 files changed: Helm template, Helm values, Python loader, postgres.py (15 interceptions), Go file_loader, unit tests, integration tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: clean up v1/v2 leftovers — remove dead code and stale references - Delete unused set_configmap_state() method from postgres.py - Remove dead CONFIGMAP_SYNC_USERNAME/TAGS imports from loader (loader no longer writes to DB) - Remove managed_by fields from example values (v3 uses global toggle) - Flatten example values to match v3 structure (no config:/items:) - Remove TestDynamicConfigSeedMode E2E test class (seed mode removed in v3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: correct YAML indentation in chart values after items: removal The children of removed items:/config: wrappers kept their extra indentation, causing YAML parse errors in helm template. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update api-service.yaml for flat config structure Remove .config. and .items. from dynamicConfig references in api-service.yaml template (volume mounts for secrets). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add dynamic-config volume to agent and logger deployments The authz-sidecar container mounts the dynamic-config ConfigMap for file-backed role reading. The volume was only defined in api-service.yaml but the sidecar is also included in agent and logger deployments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add checksum annotation to agent/logger, remove duplicate guards K8s/Helm review findings: - Add checksum/dynamic-config annotation to agent and logger pods so they restart on ConfigMap changes (was only on api-service) - Remove duplicate {{- if }} guards in api-service.yaml secret collection (copy-paste artifact) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: rename dynamicConfig → configs across codebase Rename the feature from "dynamicConfig" to "configs" to match industry conventions (ArgoCD uses "configs:"). Helm values: services.dynamicConfig → services.configs ConfigMap name: osmo-service-dynamic-config → osmo-service-configs Mount path: /etc/osmo/dynamic-config/ → /etc/osmo/configs/ CLI arg: --dynamic_config_file → --config_file Env var: OSMO_DYNAMIC_CONFIG_FILE → OSMO_CONFIG_FILE 14 files, mechanical rename, no logic changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: replace recursive secret templates with explicit secretRefs Pattern 2: secrets declared separately from config, not inline. Before (80 lines of manually unrolled 4-level recursion): - osmo.resolve-secret-names-in-config: walked config tree to transform secretName → secret_file in the ConfigMap YAML - osmo.collect-secret-names: walked config tree to find all secretName references for volume mount generation - dynamic-config-secrets.yaml: created K8s Secrets from inline values After (simple flat list): - secretRefs: list of {secretName} in values.yaml - Template iterates list for volume + volumeMount generation - secretName passed through to ConfigMap as-is - Python loader resolves secretName → file path at runtime - No inline secret creation (secrets managed out-of-band) Deleted: dynamic-config-secrets.yaml, resolve-secret-names-in-config, collect-secret-names, resolve-secret-name templates (~100 lines removed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: rename dynamic-config files to configs - dynamic-config.yaml → configs.yaml (Helm template) - configmap-dynamic-config.md → configmap-configs.md (design doc) - osmo_dynamic_config_values.yaml → osmo_configs_values.yaml (example) - test_dynamic_config.py → test_configs.py (E2E test) - Updated internal references in renamed files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: cleanup dead code and stale references from review - Remove unused CONFIGMAP_SYNC_TAGS constant - Remove unused typing imports from configmap_guard - Fix integration test: remove stale managed_configs wrapper (#1) - Fix docstrings: remove "write roles to DB", "authz reads from DB" - Remove dead configmap_state table from postgres.py and schema.sql - Remove stale comment about deleted secret templates in _helpers.tpl - Rename $dc → $cfg in configs.yaml template Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * perf: bypass caches in file-backed authz mode In file-backed mode, FileRoleStore.GetRoles() and GetPoolNames() are already in-memory map lookups. The LRU role cache and pool name cache add overhead (locking, TTL) for zero benefit. - resolveRoles: direct file store lookup, skip role cache - computeAllowedPools: direct file store lookup, skip pool name cache - Caches only used in DB mode where they avoid DB round-trips Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: optimize loader + remove dead code configmap_loader.py: - Remove unnecessary copy.deepcopy (yaml.safe_load returns fresh dict) - Remove stale comments (circular import note, "kept from original") - Fix remaining "Dynamic config" log message - Remove extra blank lines file_loader.go: - Remove unused GetAllRoleNames() method (never called) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * perf: don't allocate caches in file-backed authz mode roleCache and poolNameCache are unused in file-backed mode (all lookups go directly to in-memory FileRoleStore). Don't create them. - NewFileBackedAuthzServer no longer accepts cache params - initFileBackedServer no longer creates caches - Caches only allocated in DB mode where they serve a purpose Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address PR review findings #1, #2, #3, #4, #9 #1 (CRITICAL): Store watcher on app state to prevent GC from killing the file watcher after configure_app() returns. #9: Activate ConfigMap mode on deferred successful reload. If initial load fails but a later watchdog-triggered reload succeeds, 409 write protection now activates correctly. #3: Move reject_if_configmap_mode() before loops in 6 bulk endpoints (put_pools, put_pod_templates, put_group_templates, put_resource_validations, put_roles, put_backend_tests). Guard doesn't depend on loop variable. #4: Fix Go race on lastModTime in FileRoleStore. Move write inside mu.Lock in Load(), add mu.RLock for read in Start(). #2: Fix import ordering in postgres.py — move configmap_state to first-party import group. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve CI failures after merge + add test coverage Fixes: - Remove PrettyJSONResponse (removed on main in pydantic v2 migration) - Fix pylint: rename _config_watcher, narrow exception catch - Fix mypy: use Any type for ResourceAssertion snapshot returns - Fix assertion: compare scheduler_type enum .value not object - Remove accidentally committed ui/node_modules file - Remove run/minimal/osmo_configs_values.yaml (rework later) Tests (14 new): - 8 Go tests for file-backed authz (Check, resolveRoles, computeAllowedPools, MigrateRoles, external role resolution) - 6 Python tests for snapshot read paths (GroupTemplate, Role, Backend list/names) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 1545336 commit ce4e658

27 files changed

Lines changed: 2976 additions & 103 deletions

deployments/charts/service/templates/api-service.yaml

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,9 @@ spec:
3131
{{- toYaml . | nindent 8 }}
3232
{{- end }}
3333
annotations:
34+
{{- if .Values.services.configs.enabled }}
35+
checksum/configs: {{ .Values.services.configs | toYaml | sha256sum }}
36+
{{- end }}
3437
{{- include "osmo.extra-annotations" .Values.services.service | nindent 8 }}
3538
spec:
3639
{{- with .Values.services.service.hostAliases }}
@@ -145,6 +148,10 @@ spec:
145148
- --default_admin_username
146149
- {{ .Values.services.defaultAdmin.username | quote }}
147150
{{- end }}
151+
{{- if .Values.services.configs.enabled }}
152+
- --config_file
153+
- /etc/osmo/configs/config.yaml
154+
{{- end }}
148155
{{- range $arg := .Values.services.service.extraArgs }}
149156
- {{ $arg | quote }}
150157
{{- end }}
@@ -188,14 +195,24 @@ spec:
188195
ports:
189196
- name: metrics
190197
containerPort: 9464
191-
{{- if or .Values.services.configFile.enabled .Values.global.logs.enabled .Values.services.service.extraVolumeMounts }}
198+
{{- if or .Values.services.configFile.enabled .Values.global.logs.enabled .Values.services.configs.enabled .Values.services.service.extraVolumeMounts }}
192199
volumeMounts:
193200
{{- end }}
194201
{{- if .Values.services.configFile.enabled}}
195202
- mountPath: {{ .Values.services.configFile.path }}
196203
name: mek-volume
197204
subPath: mek.yaml
198205
{{- end }}
206+
{{- if .Values.services.configs.enabled }}
207+
- name: configs
208+
mountPath: /etc/osmo/configs
209+
readOnly: true
210+
{{- range .Values.services.configs.secretRefs }}
211+
- name: secret-{{ .secretName }}
212+
mountPath: /etc/osmo/secrets/{{ .secretName }}
213+
readOnly: true
214+
{{- end }}
215+
{{- end }}
199216
{{- if .Values.global.logs.enabled }}
200217
- name: logs
201218
mountPath: /logs
@@ -247,6 +264,16 @@ spec:
247264
name: mek-config
248265
name: mek-volume
249266
{{- end}}
267+
{{- if .Values.services.configs.enabled }}
268+
- name: configs
269+
configMap:
270+
name: {{ .Values.services.service.serviceName }}-configs
271+
{{- range .Values.services.configs.secretRefs }}
272+
- name: secret-{{ .secretName }}
273+
secret:
274+
secretName: {{ .secretName }}
275+
{{- end }}
276+
{{- end }}
250277
---
251278

252279
apiVersion: v1
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
#
15+
# SPDX-License-Identifier: Apache-2.0
16+
{{- if .Values.services.configs.enabled }}
17+
apiVersion: v1
18+
kind: ConfigMap
19+
metadata:
20+
name: {{ .Values.services.service.serviceName }}-configs
21+
labels:
22+
app: {{ .Values.services.service.serviceName }}
23+
data:
24+
config.yaml: |
25+
{{- $cfg := .Values.services.configs }}
26+
{{- if $cfg.service }}
27+
service:
28+
{{- $service := deepCopy $cfg.service }}
29+
{{- if and (not (index $service "service_base_url")) .Values.services.service.hostname }}
30+
{{- $_ := set $service "service_base_url" (printf "https://%s" .Values.services.service.hostname) }}
31+
{{- end }}
32+
{{- toYaml $service | nindent 6 }}
33+
{{- end }}
34+
{{- if $cfg.workflow }}
35+
workflow:
36+
{{- toYaml $cfg.workflow | nindent 6 }}
37+
{{- end }}
38+
{{- if $cfg.dataset }}
39+
dataset:
40+
{{- toYaml $cfg.dataset | nindent 6 }}
41+
{{- end }}
42+
{{- if $cfg.pools }}
43+
pools:
44+
{{- toYaml $cfg.pools | nindent 6 }}
45+
{{- end }}
46+
{{- if $cfg.podTemplates }}
47+
pod_templates:
48+
{{- toYaml $cfg.podTemplates | nindent 6 }}
49+
{{- end }}
50+
{{- if $cfg.resourceValidations }}
51+
resource_validations:
52+
{{- toYaml $cfg.resourceValidations | nindent 6 }}
53+
{{- end }}
54+
{{- if $cfg.backends }}
55+
backends:
56+
{{- toYaml $cfg.backends | nindent 6 }}
57+
{{- end }}
58+
{{- if $cfg.backendTests }}
59+
backend_tests:
60+
{{- toYaml $cfg.backendTests | nindent 6 }}
61+
{{- end }}
62+
{{- if $cfg.groupTemplates }}
63+
group_templates:
64+
{{- toYaml $cfg.groupTemplates | nindent 6 }}
65+
{{- end }}
66+
{{- if $cfg.roles }}
67+
roles:
68+
{{- toYaml $cfg.roles | nindent 6 }}
69+
{{- end }}
70+
{{- end }}

deployments/charts/service/templates/gateway.yaml

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -453,6 +453,9 @@ spec:
453453
imagePullPolicy: {{ $gw.authz.imagePullPolicy }}
454454
args:
455455
- "--grpc-port={{ $gw.authz.grpcPort }}"
456+
{{- if .Values.services.configs.enabled }}
457+
- "--roles-file=/etc/osmo/configs/config.yaml"
458+
{{- else }}
456459
- "--postgres-host={{ .Values.services.postgres.serviceName }}"
457460
- "--postgres-port={{ .Values.services.postgres.port }}"
458461
- "--postgres-database={{ .Values.services.postgres.db }}"
@@ -461,6 +464,7 @@ spec:
461464
- "--postgres-max-conns={{ $gw.authz.postgres.maxConns }}"
462465
- "--postgres-min-conns={{ $gw.authz.postgres.minConns }}"
463466
- "--postgres-max-conn-lifetime={{ $gw.authz.postgres.maxConnLifetimeMin }}"
467+
{{- end }}
464468
- "--cache-ttl={{ $gw.authz.cache.ttl }}"
465469
- "--cache-max-size={{ $gw.authz.cache.maxSize }}"
466470
{{- if .Values.global.logs.enabled }}
@@ -483,16 +487,19 @@ spec:
483487
name: db-secret
484488
key: db-password
485489
{{- end }}
486-
{{- if or .Values.global.logs.enabled $gw.authz.extraVolumeMounts }}
487490
volumeMounts:
488491
{{- if .Values.global.logs.enabled }}
489492
- name: logs
490493
mountPath: /logs
491494
{{- end }}
495+
{{- if .Values.services.configs.enabled }}
496+
- name: configs
497+
mountPath: /etc/osmo/configs
498+
readOnly: true
499+
{{- end }}
492500
{{- with $gw.authz.extraVolumeMounts }}
493501
{{- toYaml . | nindent 10 }}
494502
{{- end }}
495-
{{- end }}
496503
{{- with $gw.authz.livenessProbe }}
497504
livenessProbe:
498505
{{- toYaml . | nindent 10 }}
@@ -508,6 +515,11 @@ spec:
508515
- name: logs
509516
emptyDir: {}
510517
{{- end }}
518+
{{- if .Values.services.configs.enabled }}
519+
- name: configs
520+
configMap:
521+
name: {{ .Values.services.service.serviceName }}-configs
522+
{{- end }}
511523
{{- with $gw.authz.extraVolumes }}
512524
{{- toYaml . | nindent 8 }}
513525
{{- end }}

deployments/charts/service/values.yaml

Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,192 @@ services:
229229
##
230230
passwordSecretKey: password
231231

232+
## Dynamic configuration loaded from ConfigMap.
233+
## When enabled, all configs are served from in-memory (parsed from the mounted
234+
## ConfigMap file). CLI/API writes return 409. The authz_sidecar reads roles
235+
## from the same file instead of PostgreSQL. Changes are detected via watchdog/inotify.
236+
##
237+
configs:
238+
## Enable ConfigMap-based configuration.
239+
## When true: all configs served from in-memory (parsed from ConfigMap),
240+
## CLI/API writes return 409, authz_sidecar reads roles from file.
241+
## When false: configs in database, CLI/API works normally.
242+
##
243+
enabled: false
244+
245+
## Service config. service_base_url is auto-derived from
246+
## services.service.hostname if not set here.
247+
##
248+
service: {}
249+
250+
## Workflow config (limits, timeouts, image credentials).
251+
## Use secretName in config fields to reference K8s Secrets.
252+
## Each secretName must also be listed in secretRefs above.
253+
##
254+
workflow:
255+
max_num_tasks: 100
256+
max_exec_timeout: "30d"
257+
default_exec_timeout: "7d"
258+
259+
## Dataset config (buckets, credentials).
260+
## Use credentialSecretName to reference K8s Secrets for bucket creds.
261+
##
262+
dataset: {}
263+
264+
## Pod templates applied to workflow containers.
265+
##
266+
podTemplates:
267+
default_ctrl:
268+
spec:
269+
containers:
270+
- name: osmo-ctrl
271+
resources:
272+
requests:
273+
cpu: "1"
274+
memory: "1Gi"
275+
limits:
276+
memory: "1Gi"
277+
default_user:
278+
spec:
279+
containers:
280+
- name: "{{USER_CONTAINER_NAME}}"
281+
resources:
282+
requests:
283+
cpu: "{{USER_CPU}}"
284+
memory: "{{USER_MEMORY}}"
285+
nvidia.com/gpu: "{{USER_GPU}}"
286+
ephemeral-storage: "{{USER_STORAGE}}"
287+
limits:
288+
memory: "{{USER_MEMORY}}"
289+
nvidia.com/gpu: "{{USER_GPU}}"
290+
ephemeral-storage: "{{USER_STORAGE}}"
291+
292+
## Resource validation rules.
293+
##
294+
resourceValidations:
295+
default_cpu:
296+
- left_operand: cpu
297+
operator: LE
298+
right_operand: node_cpu
299+
assert_message: "cpu must be <= node_cpu"
300+
- left_operand: cpu
301+
operator: GT
302+
right_operand: "0"
303+
assert_message: "cpu must be > 0"
304+
default_memory:
305+
- left_operand: memory
306+
operator: LE
307+
right_operand: node_memory
308+
assert_message: "memory must be <= node_memory"
309+
- left_operand: memory
310+
operator: GT
311+
right_operand: "0"
312+
assert_message: "memory must be > 0"
313+
314+
## RBAC role definitions. Policies are product defaults.
315+
## Override external_roles per deployment to map IDP groups.
316+
##
317+
roles:
318+
osmo-admin:
319+
description: "Admin role — full access to all resources"
320+
policies:
321+
- effect: Allow
322+
actions: ["*:*"]
323+
resources: ["*"]
324+
external_roles: [osmo-admin]
325+
osmo-user:
326+
description: "User role — standard workflow and data operations"
327+
policies:
328+
- effect: Allow
329+
actions:
330+
- "app:*"
331+
- "auth:Token"
332+
- "credentials:*"
333+
- "dataset:*"
334+
- "pool:List"
335+
- "profile:Read"
336+
- "profile:Update"
337+
- "resources:Read"
338+
- "user:List"
339+
- "workflow:Cancel"
340+
- "workflow:Create"
341+
- "workflow:Delete"
342+
- "workflow:Exec"
343+
- "workflow:List"
344+
- "workflow:PortForward"
345+
- "workflow:Read"
346+
- "workflow:Rsync"
347+
- "workflow:Update"
348+
resources: ["*"]
349+
external_roles: [osmo-user]
350+
osmo-default:
351+
description: "Default role — login, profile, health check"
352+
policies:
353+
- effect: Allow
354+
actions:
355+
- "auth:Login"
356+
- "auth:Refresh"
357+
- "auth:Token"
358+
- "profile:*"
359+
- "system:Health"
360+
resources: ["*"]
361+
external_roles: []
362+
osmo-ctrl:
363+
description: "Controller role — internal container communication"
364+
policies:
365+
- effect: Allow
366+
actions: ["internal:Logger", "internal:Router"]
367+
resources: ["*"]
368+
external_roles: [osmo-ctrl]
369+
osmo-backend:
370+
description: "Backend role — internal operator communication"
371+
policies:
372+
- effect: Allow
373+
actions: ["internal:Operator", "config:Read", "pool:List"]
374+
resources: ["*"]
375+
external_roles: [osmo-backend]
376+
377+
## Backend cluster definitions. The "default" backend is created by
378+
## configure_app() on first deploy — include it here so ConfigMap mode
379+
## serves it from memory.
380+
##
381+
backends:
382+
default:
383+
description: "Default backend"
384+
scheduler_settings:
385+
scheduler_type: kai
386+
scheduler_name: kai-scheduler
387+
scheduler_timeout: 30
388+
389+
## Pool definitions. The "default" pool references the "default" backend.
390+
##
391+
pools:
392+
default:
393+
description: "Default pool"
394+
backend: default
395+
default_platform: default
396+
platforms:
397+
default: {}
398+
399+
## Backend test definitions.
400+
##
401+
backendTests: {}
402+
403+
## Group templates.
404+
##
405+
groupTemplates: {}
406+
407+
## K8s Secrets to mount into the service pods.
408+
## Each entry generates a volume + volumeMount at /etc/osmo/secrets/<secretName>/.
409+
## The Python config loader resolves secretName references in configs at runtime.
410+
##
411+
## Example:
412+
## secretRefs:
413+
## - secretName: my-bucket-cred # mounts all keys
414+
## - secretName: imagepullsecret # mounts all keys
415+
##
416+
secretRefs: []
417+
232418
## Redis cache service configuration
233419
## Set enabled to false if using an external Redis deployment
234420
##

0 commit comments

Comments
 (0)