Skip to content

Add Entra OIDC auth, chat/search UX overhaul, and supporting fixes#45

Merged
rajivml merged 10 commits into
feature/darwinfrom
rajiv/add-claude
May 28, 2026
Merged

Add Entra OIDC auth, chat/search UX overhaul, and supporting fixes#45
rajivml merged 10 commits into
feature/darwinfrom
rajiv/add-claude

Conversation

@rajivml
Copy link
Copy Markdown
Collaborator

@rajivml rajivml commented May 28, 2026

Brings the rajiv/add-claude branch up to feature/darwin: Microsoft/Entra OIDC auth wired into the OSS code path (no ee/ dependency), a chat & search UX overhaul, a chat-page crash fix, a persona-deletion permission gap close, per-channel Slack-bot model config, and Playwright MCP tooling. 9 commits, 50 files, +909 / −582.

Auth — Microsoft/Entra OIDC

Wired directly into the OSS app (no ee/ dependency):

  • OIDC router registered in backend/danswer/main.py; AuthType.OIDC allowlisted in verify_auth_setting; public auth routes registered.
  • New env vars: OPENID_CONFIG_URL, DEFAULT_ADMIN_EMAILS.
  • Auto-verify OIDC users in oauth_callback; env-driven admin allowlist promotes addresses in DEFAULT_ADMIN_EMAILS.
  • backend/requirements/default.txt: pin bcrypt==4.0.1 (passlib 1.7.4 is incompatible with bcrypt 4.1+; do not bump without also fixing passlib).

Kubernetes (darwin-kubernetes/)

  • env-configmap.yaml: AUTH_TYPE=oidc, OPENID_CONFIG_URL (Entra discovery), DEFAULT_ADMIN_EMAILS; WEB_DOMAIN/DOMAIN set to the external https:// origin (required for a correct OIDC redirect_uri and a Secure session cookie).
  • api_server-service-deployment.yaml: inject OAUTH_CLIENT_ID, OAUTH_CLIENT_SECRET, USER_AUTH_SECRET from the danswer-secrets secret via secretKeyRef.
  • secrets.yaml: stub values replaced with documented placeholders + a "do not commit real secrets" header — real values are applied out-of-band.

Chat & search UX

  • Persona scoping is now an outer fence: persona document_sets intersect with user-applied filters server-side in search/preprocessing/preprocessing.py (the input-bar "Scope" chips are gone but the scoping itself is unchanged).
  • Search-mode framing on the default persona; assistant scope chip; starter prompts; sidebar timestamps on chat history; Cmd+K to start a new chat; distinct assistant message styling; 3-step onboarding cards on the empty chat page; larger chat input.
  • Searchable / scrollable knowledge-set pickerFiltersTab.tsx and ChatFilters.tsx were significantly rewritten (−188 / −176 lines) around the new picker. Tag filters were removed.
  • /search hidden from nav (still reachable by URL); removed the redundant top-left assistant selector; highlight applied filters.
  • web/next.config.js: dropped the 308 stream redirects that were stripping the session cookie.
  • Switching assistants now shows an auto-expiring toast explaining that each chat is bound to a single assistant (and that any attached files need to be re-uploaded). Previously this silently created and navigated to a new chat session.
  • Fresh installs default to the chat page (Settings.default_page = CHAT). Existing deployments keep their persisted value — change via Admin → Settings.

Bug fixes

  • Chat doc-render crash (web/src/components/search/DocumentDisplay.tsx): fall back to blurb when match_highlights is a non-empty array of only falsy/whitespace strings. Previously sections[0][2] threw Cannot read properties of undefined (reading '2'), crashing the chat doc sidebar and the search page when retrieved docs had empty highlights — more likely with large or many-doc contexts.
  • Slack code fences: strip the language token off opening fences in build_qa_response_blocks. Slack mrkdwn has no info string, so the language was rendering as a literal first line of the block. (Slack still cannot syntax-highlight; that's a platform limit.)
  • Persona "Scope" chips removed from the chat input bar (SelectedFilterDisplay, ChatInputBar) — cosmetic only; server-side scoping unchanged.

Security / permissions

  • mark_persona_as_deleted (backend/danswer/db/persona.py) now returns 403 for non-admins on default_persona or ownerless (user_id IS NULL) personas, mirroring the frontend's !default_persona rule. Closes a gap where a basic user could soft-delete a shared default assistant for everyone via a direct API call (get_persona_by_id grants non-admins access to ownerless personas).

Slack bot

  • Configure LLM models per channel (SlackBotConfigCreationForm.tsx, server/manage/slack_bot.py, server/manage/models.py).

Tooling

  • Track .mcp.json (shared Playwright MCP server) so the browser-driven debugging setup is reproducible across the team.
  • .gitignore — ignore .playwright-mcp/ session output and the ad-hoc model-picker-open.png screenshot (they are local artifacts, not source).
  • Removed a stale Playwright MCP log file.

Configuration required for reviewers / operators

Before this can be deployed, the following must be set:

Setting Where Notes
`OPENID_CONFIG_URL` env-configmap Entra `.well-known/openid-configuration` URL for the tenant
`DEFAULT_ADMIN_EMAILS` env-configmap Comma-separated allowlist
`WEB_DOMAIN` / `DOMAIN` env-configmap External `https://` origin
`OAUTH_CLIENT_ID` / `OAUTH_CLIENT_SECRET` / `USER_AUTH_SECRET` `danswer-secrets` Applied out-of-band; placeholders in `secrets.yaml`

Footguns

  • Don't bump bcrypt past 4.0.1 without also fixing passlib (1.7.4 is incompatible with 4.1+).
  • Don't re-introduce the 308 stream redirects in `next.config.js` — they stripped the session cookie.
  • Setting `WEB_DOMAIN` incorrectly will break the OIDC `redirect_uri`.

Commits

SHA Subject
`3597c574` Add option to configure models for channel
`21dd20a9` Remove stale Playwright MCP log file
`d8ac3fb5` Add Entra OIDC auth and chat/search UX improvements
`f04b2c74` Fix chat doc-render crash, Slack code fences, drop persona scope chips
`a6461d51` k8s: wire Entra OIDC auth config (placeholders only, no real secrets)
`df883fe2` chat: toast when switching assistants starts a new session
`7e053e85` settings: default fresh installs to the chat page
`92500872` persona: block non-admins from deleting default/shared assistants
`864c513c` tooling: add Playwright MCP config, ignore its session artifacts

🤖 Generated with Claude Code

Sarath1018 and others added 10 commits May 25, 2026 15:40
Auth:
- Wire Microsoft/Entra OIDC directly into the OSS app (no ee/ dependency):
  OIDC router in main.py, AuthType.OIDC allowlisted, public auth routes
  registered, OPENID_CONFIG_URL + DEFAULT_ADMIN_EMAILS env vars.
- Auto-verify OIDC users in oauth_callback; env-driven admin allowlist.
- Pin bcrypt==4.0.1 (passlib 1.7.4 incompatible with bcrypt 4.1+).

Chat/search UX:
- Persona document_sets act as an outer fence (intersect with user filters).
- Search-mode framing on the default persona; assistant scope chip; starter
  prompts; sidebar timestamps; Cmd+K new chat; distinct assistant message
  styling; 3-step onboarding cards; larger chat input.
- Searchable/scrollable knowledge-set picker; removed tag filters.
- Hide /search from nav (still reachable by URL); remove redundant top-left
  assistant selector; highlight applied filters.
- next.config.js: drop 308 stream redirects that stripped the session cookie.

Docs: AGENTS.md + CONTRIBUTING.md updated for OIDC setup and the new footguns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- DocumentDisplay: fall back to blurb when match_highlights is a non-empty
  array of only falsy/whitespace strings. Previously sections stayed empty and
  sections[0][2] threw "Cannot read properties of undefined (reading '2')",
  crashing the chat doc sidebar (and search page) when retrieved docs had empty
  highlights — more likely with large/many-doc contexts.

- Slack blocks: strip the language token off opening code fences (```bash ->
  ```) in build_qa_response_blocks. Slack mrkdwn has no fenced-code info string,
  so the language rendered as a literal first line of the block. (Slack still
  cannot syntax-highlight; that's a platform limit.)

- SelectedFilterDisplay/ChatInputBar: remove the locked persona "Scope" chips
  from the chat input bar. Cosmetic only — the assistant still scopes search to
  its document sets server-side in search/preprocessing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- env-configmap: AUTH_TYPE=oidc, OPENID_CONFIG_URL (Entra discovery),
  DEFAULT_ADMIN_EMAILS, and set WEB_DOMAIN/DOMAIN to the external https origin
  (required for a correct OIDC redirect_uri and Secure session cookie).
- api_server deployment: inject OAUTH_CLIENT_ID/OAUTH_CLIENT_SECRET/
  USER_AUTH_SECRET from the danswer-secrets secret via secretKeyRef.
- secrets.yaml: replace stub values with documented placeholders and a
  "do not commit real secrets" header; real values applied out-of-band.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switching the assistant silently created and navigated to a new chat session with no feedback. Show an auto-expiring toast explaining each chat is bound to a single assistant (and to re-upload any attached files).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Settings.default_page now defaults to CHAT instead of SEARCH. Only affects deployments with no stored settings yet; existing deployments keep their persisted value (change via Admin -> Settings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
get_persona_by_id grants non-admins access to ownerless personas (user_id IS NULL), which includes the shared default assistants. Guard mark_persona_as_deleted so a basic user gets 403 for default/ownerless personas, mirroring the frontend's !default_persona rule. Closes a gap where a basic user could soft-delete a default assistant for everyone via a direct API call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Track .mcp.json (shared Playwright MCP server) so the browser-driven
debugging setup is reproducible. Gitignore .playwright-mcp/ and the
ad-hoc screenshot, which are local session output, not source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Out of scope for the OIDC / UX work; revisit separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rajivml rajivml merged commit 9f0c210 into feature/darwin May 28, 2026
5 of 6 checks passed
rajivml added a commit that referenced this pull request May 29, 2026
rajiv/add-claude was merged to feature/darwin upstream, so the doc's
"on top of rajiv/add-claude (PR #45)" reference is stale. The branch
is now rebased onto origin/feature/darwin directly — same diff, just
a fresher base.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rajivml added a commit that referenced this pull request May 29, 2026
rajiv/add-claude was merged to feature/darwin upstream, so the doc's
"on top of rajiv/add-claude (PR #45)" reference is stale. The branch
is now rebased onto origin/feature/darwin directly — same diff, just
a fresher base.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rajivml added a commit that referenced this pull request Jun 3, 2026
rajiv/add-claude was merged to feature/darwin upstream, so the doc's
"on top of rajiv/add-claude (PR #45)" reference is stale. The branch
is now rebased onto origin/feature/darwin directly — same diff, just
a fresher base.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rajivml added a commit that referenced this pull request Jun 3, 2026
…, retention + connector reliability (#46)

* docs: add Redis caching & scaling plan

Plan for exposing chat to a few hundred users: P0 connection-pool/session
fix, P1 Redis foundation + DynamicConfigStore read-through cache, P2
per-user request rate limiting, P3 per-chat-turn config caches. Plan only,
no implementation yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* P1: Redis foundation + read-through KV cache

Foundation for caching/rate-limiting work (see REDIS_CACHING_PLAN.md).
This commit only ships the cache piece — no behavioural change unless
REDIS_KV_CACHE_ENABLED=true is set.

* requirements: pin redis==5.0.8.
* configs/app_configs.py: REDIS_HOST/PORT/PASSWORD/DB_NUMBER/SSL,
  REDIS_POOL_MAX_CONNECTIONS, REDIS_HEALTH_CHECK_INTERVAL,
  REDIS_SOCKET_TIMEOUT_SECONDS; cache toggle REDIS_KV_CACHE_ENABLED
  (default OFF) and REDIS_KV_CACHE_TTL_SECONDS (1 day).
* danswer/redis/redis_pool.py: lazy ConnectionPool singleton +
  get_redis_client() helper. Single-tenant — DANSWER_REDIS_KEY_PREFIX
  is the only namespace; upstream's TenantRedisClient is intentionally
  not ported.
* dynamic_configs/store.py: RedisCachedDynamicConfigStore wraps any
  inner DynamicConfigStore with read-through / write-through caching.
  Inner store stays the source of truth (writes inner first), encrypted
  values are NEVER cached plaintext (just invalidated), every Redis
  call is fail-open so an outage degrades latency, not availability.
* dynamic_configs/factory.py: when REDIS_KV_CACHE_ENABLED, transparently
  wraps the existing PostgresBackedDynamicConfigStore — call sites
  unchanged.
* Deployment: redis service in docker-compose.dev.yml (cache-only:
  no AOF, no RDB snapshots, allkeys-lru @ 256mb so a runaway producer
  can't OOM the node). darwin-kubernetes/redis-statefulset.yaml mirrors
  that posture. REDIS_HOST etc. in env-configmap; REDIS_PASSWORD wired
  via optional secretKeyRef so the deployments still boot when Redis
  is unauth'd. NOT the Celery broker — that stays on Postgres by design.
* backend/.gitignore: ignore stray pywikibot apicache/throttle.ctrl
  artifacts dropped by the existing mediawiki test.

Tests (unittest, no real Redis required — mocks at the get_redis_client
boundary):
  - tests/.../redis_layer/test_redis_pool.py: pool singleton, prefix
    constant, reset_pool_for_tests.
  - tests/.../dynamic_configs/test_redis_cached_store.py: read-through,
    write-through invalidation, TTL on SET, cached-None vs miss,
    not-found NOT cached, encrypted values not mirrored, corrupt entry
    treated as miss, fail-open on Redis errors. 13 cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* P2: per-user request rate limiter on chat/query endpoints

Layered on top of P1's Redis client. Complements the existing token-
budget limiter (token_limit.py) — that's a DB-backed COST cap, this is
a Redis-backed REQUEST-COUNT cap that's correct across api_server
replicas. Both run; this one runs first so a 429'd caller never even
touches the DB-backed usage query.

Default OFF. Enable per-environment via:
  REQUEST_RATE_LIMIT_ENABLED=true
  REQUEST_RATE_LIMIT_PER_MINUTE=<N>   # 0 = disable that window
  REQUEST_RATE_LIMIT_PER_HOUR=<N>

* server/middleware/request_rate_limit.py: fixed-window buckets keyed
  by floor(time/window). Atomic INCR + EXPIRE NX so the bucket
  boundary is fixed on first increment (without NX, every request
  would push expiry forward and the bucket would never reset — that
  bug is covered by an explicit test). Authenticated users keyed by
  uuid; anonymous keyed by the first X-Forwarded-For hop, falling back
  to the socket peer; if neither yields an IP we skip (better than
  bucketing every anonymous request under "").
* Fail-OPEN on any Redis error: a Redis blip lets requests through with
  a warning, never wedges the chat path.
* 429 response carries a Retry-After header with seconds-until-bucket-
  rollover so well-behaved clients back off precisely.
* Wired as a FastAPI Depends on:
    POST /chat/send-message
    POST /direct-qa/stream-answer-with-quote
  Both endpoints also keep the existing check_token_rate_limits.

Tests (unittest, mocked Redis pipeline — no real Redis required):
  - default-OFF short-circuits before any Redis call (covers both
    REQUEST_RATE_LIMIT_ENABLED=false AND both windows = 0).
  - within-limit: N requests under cap all allowed.
  - over-limit raises 429 with Retry-After header.
  - per-user isolation: distinct users have independent counters.
  - bucket rollover resets count (time-mocked).
  - EXPIRE NX semantics — locks down the no-sliding-TTL invariant.
  - anonymous keyed by XFF first hop; no-IP skips silently.
  - fail-open: Redis error doesn't propagate. 9 cases total.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Persona list cache with explicit write-through invalidation

GET /persona (Manage Assistants → "View available assistants") fires
get_personas(user_id, ...) — a multi-OR permission-filtered query joining
Persona, Persona__User, Persona__UserGroup, User__UserGroup. At hundreds
of concurrent users opening the chat UI around the same time, the burst
puts unnecessary pressure on the DB connection pool (which is the actual
scaling ceiling for streaming chat — see REDIS_CACHING_PLAN.md).

Design: global cache + per-user filter (not per-user response cache),
so the multi-user-burst pattern collapses 200 queries into ~1:

  danswer:personas:all:not_deleted   global, all PersonaSnapshot
                                     including is_public / users / groups
                                     (PersonaSnapshot already carries the
                                     permission inputs — no separate
                                     payload shape needed)
  danswer:personas:groups:{user_id}  per-user, list[int] of group ids

At request time the cached list is filtered in Python mirroring the SQL
OR-block exactly:
  is_public
  OR user.id IN persona.users.id
  OR (user_group_ids ∩ persona.groups)
The parity vs SQL is locked down by tests (one per branch + negative).

Invalidation is explicit + write-through:
  - 9 mutation paths in db/persona.py call invalidate_personas_all()
    AFTER db_session.commit() (after-commit ordering avoids stale-fill
    race during open transactions).
  - 3 paths in ee/danswer/db/user_group.py (insert/update/prepare-delete)
    call invalidate_user_groups(uid) for each affected user.
  - 24h TTL is ONLY a safety net for missed busts; primary mechanism is
    explicit so persona/group edits are visible immediately.
  - Default OFF (PERSONA_CACHE_ENABLED=false); enable per environment.
  - Fail-OPEN on every Redis op: a Redis outage degrades latency, not
    availability, and a failed bust doesn't roll back the DB write.
  - include_deleted=True falls through to direct DB (uncommon shape;
    we deliberately don't cache it).

Encrypted values: N/A — PersonaSnapshot has no encryption-at-rest
guarantee to bypass (unlike the KV store layer from P1).

Tests (17, mocked Redis + db boundary, no real services):
  - 6 filter-parity cases (one per SQL branch + mixed + zero-groups edge)
  - 2 user-group cache cases (miss/hit, TTL propagation)
  - 3 routing cases (disabled fallthrough, include_deleted bypass, admin
    user_id=None path skips group lookup)
  - 4 invalidation cases (right key for each side, disabled short-circuit,
    Redis-error-during-bust swallowed)
  - 2 fail-open-on-read cases (GET error → miss, corrupt entry → miss)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Manage Assistants page UX overhaul

Replaces the prior "move up / move down inside a 3-dot popover" flow on
/assistants/mine with eight coordinated changes. Backend unchanged — the
existing PATCH /api/user/assistant-list endpoint already accepts the full
chosen_assistants array, so every interaction lands as one optimistic
local update + one PATCH + (on failure) a rollback.

  1. Drag-and-drop reorder via @dnd-kit (already in package.json) with a
     grab handle on each visible row. Pointer activation distance of 6px
     so clicks on the handle don't accidentally start a drag. Keyboard
     reordering comes for free via dnd-kit's default activator focus.

  2. Explicit "set as default" — pin/star icon on each visible row;
     filled when the row is the user's default (position 0 of
     chosen_assistants), with an accent border + "Default" chip on that
     row. Ordering and default are now orthogonal — reorder freely
     without accidentally changing your default.

  3. Visibility as a row-level switch instead of a buried "Hide / Remove"
     popover item. One unified list with a "Hidden (N)" divider; hidden
     rows render at reduced opacity and have no drag handle (no position
     to drag to). The prior separate "Active Assistants" / "Your Hidden
     Assistants" sections collapse into this single list. Refuses to
     hide the last visible row (can't ship the user a broken picker).

  4. Client-side search filter — matches name, description, or tool name.
     Applies to both visible and hidden sections so search-then-toggle
     for "where did I put X" is one motion.

  5. Information density rebalanced. Description is now the primary
     signal (was the smallest text). Tools/sources collapse into compact
     "{n} tools" / "{n} sources" chips so the row scans for "should I
     pick this?" not "what are its internals?". Full tool list reveals
     on hover via title attribute.

  6. Bulk select column + sticky action bar. Checkbox appears on hover
     or focus and stays visible when selected. Action bar shows
     Show / Hide / Remove + Clear when anything is selected. Refuses
     bulk-hide that would empty the visible list.

  7. Header cleanup: title + 1-line subtitle + Create button top-right,
     "Browse all available" as a text link. The prior two giant nav
     tiles + paragraph of explanatory copy are gone — recovers vertical
     space on a page whose real content is the list.

  8. Undo on every state-mutating toast (reorder / set-default / hide /
     show / bulk ops). PopupSpec gains an optional `undo: { onClick }`
     field; the popup stays on screen 6s instead of 4s when undoable so
     the user has time to react. Undo restores the prior chosen_assistants
     array via another PATCH — symmetric round-trip, no special endpoint.

New helpers in lib/assistants/updateAssistantPreferences.ts:
  reorderAssistantList(newOrder)      — full-array PATCH (drag, undo)
  setDefaultAssistant(id, list)       — move id to position 0
  bulkRemoveFromList(ids, list)       — set difference
  bulkAddToList(ids, list)            — set union, appended at end

The pre-existing moveAssistantUp/Down helpers are kept (other callers
may still import them) but no longer used on this page.

Verified: npx tsc --noEmit clean across web/ (0 errors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Assistant Gallery page UX overhaul

Sister rework to the Manage Assistants page. With 50+ accessible
assistants and growing, the old flat 2-column grid had no hierarchy and
no status signal — every card looked identical regardless of whether it
was yours, shared, public, or already in your picker. Same conceptual
fix as Manage: give the page structure so scanning answers "what's
mine?", "what's new?", "what does this one do?".

Backend unchanged — every interaction PATCHes chosen_assistants via the
existing /api/user/assistant-list endpoint (same path the Manage page
uses). All mutations are optimistic + undoable.

Changes (numbers map to the design proposal):
  1. Per-card "In your picker" badge + muted card style when added.
     Eye now finds the un-added ones in a glance.
  2. Three implicit sections: Yours / Shared with you / Featured & Built-in.
     Empty sections hide; section headers carry counts.
  3. Filter chip rows: availability (All / Available to add / Already
     added) with live counts, plus auto-generated per-tool chips for
     tools that appear in ≥2 assistants (avoids chip-bloat as the
     dataset grows). Tool filters use OR semantics.
  4. Owner display: best-effort name from the email local-part
     (split on '@', dots/underscores→spaces) with a "Built-in" badge
     for default_persona assistants. Kills the fork-specific
     "Author: Darwin" magic string.
  6. Responsive grid: 1 / 2 / 3 / 4 cols by breakpoint.
  7. Header matches the Manage rebuild — title + subtitle + Create
     button top-right, "Back to my assistants" as a text link. Cut the
     giant centered nav button and the explanatory paragraph.
  8. Sort dropdown: Featured (API order, respects admin display_priority)
     / A → Z / Newly added (id desc proxy for recency).
  9. Search broadened to name + description + tool names + document-set
     names. Empty-result state with a real "Clear all filters" button.
  10. Compact "{n} tools" / "{n} sources" chips with hover-reveal of
      the full tool list. Flat Add/Remove buttons replace Tremor's
      color="green/red" which was visually shoutier than the action.
  11. Design tokens fixed — border-border / focus-ring-accent in place
      of hardcoded gray-300 / blue-500. Consistent with the rest of the
      app.

Skipped (per proposal):
  - #5 detail drawer / modal — revisit after observing how users use
    the new grid; bigger feature.
  - Bulk select — adding 5 assistants at once isn't a real use case
    here (bulk hide on Manage was).

The pre-existing addAssistantToList / removeAssistantFromList helpers
are kept and used at the call sites. The shared reorderAssistantList
helper added in the prior commit is reused for the undo paths.

Verified: npx tsc --noEmit clean across web/ (0 errors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add backend/scripts/seed_assistants.py for local UX testing

Dev-only seed script that creates N varied personas in the local DB
for exercising the redesigned gallery / manage pages. Refuses to run
when POSTGRES_HOST looks like a managed/prod database (azure.com,
amazonaws.com, .cloud., "prod", etc.) — guard against pointing this
at the wrong env by accident.

Mix is designed to populate each section of the new gallery:
  ~30% "Yours"           — owned by target user, private
  ~20% "Shared with you" — owned by another user, target user in users[]
  ~50% "Featured"        — public, no specific owner

Per row randomly attaches 0–3 tools and 0–2 document sets so the {n}
tools / {n} sources chips render with variety. Half of "Yours" auto-
land in chosen_assistants (and all "Shared with you" do), so the
"Already added" vs "Available to add" filter chips have content on
both sides without manual setup.

60 distinct names + 30 description templates so 50 rows feel populated
and varied. Uses a fixed RNG seed by default (deterministic across runs).
Name prefix "[seed] " makes rows easy to spot and to wipe via --clear.

Usage:
  cd backend && source ../.venv/bin/activate
  python -m scripts.seed_assistants --email you@example.com
  python -m scripts.seed_assistants --clear     # wipe and re-seed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Assistants UX polish: toggle highlight + gallery declutter

Two follow-ups on the assistants UX rework, both from user feedback.

Manage (/assistants/mine):
  * Toggle now accepts a `highlight` prop that draws a transient ring +
    slight scale-up on the switch. Used by hidden rows so a click
    anywhere on the (faded) row body flashes the toggle for ~1.2s,
    pointing the eye at the action that brings the assistant back.
    Doesn't auto-enable on body click — surprising a user mid-read into
    enabling would be a worse outcome than the discoverability gap.
  * Restructured opacity: only the content column (icon + name +
    description + chips) fades when a row is hidden. The action zone
    (checkbox, drag-slot, pin, toggle, share, edit) stays at full
    opacity so the toggle is the bright, clickable target on a dim row.
    Previously the parent opacity-50 cascaded to every child, making
    the toggle the dimmest thing on the dimmest row.
  * stopPropagation on the action zone so clicks on buttons inside it
    don't trigger the row-body flash handler.

Gallery (/assistants/gallery):
  * Removed all tool-related UI per user request — the page is for
    browsing assistants, and tool filter chips + per-card "{n} tools"
    pulled focus from the assistant itself. Gone: the auto-generated
    tool filter chip row, the per-card tools chip, the toolDisplayName
    / toolIcon helpers, the FiTool / FiImage / FiCheck imports, and
    the toolFilters state + commonTools memo. Search hay is now
    name + description + document-set names (no tool names).
  * Dropped the absolute top-right "In your picker" badge. The muted
    card style (border + opacity-75) plus the "Remove" button in the
    footer already signal "added"; the badge ate horizontal space
    (pr-24 on the header reserved 96px) and crowded the title at
    narrower widths. Removed the pr-24 reservation now that nothing
    overlays the header.
  * Grid capped at `1 / 2 / 3` cols — 1 on mobile, 2 on most laptops
    and standard desktops, 3 only at `2xl` (≥1536px). Previously
    1/2/3/4 with the 4-col breakpoint making cards cramped and hard
    to read once descriptions hit their 3-line clamp.
  * Bumped card padding p-4 → p-5 and description line-height to
    leading-relaxed for breathing room.
  * Updated clearAllFilters / hasAnyFilter to drop the toolFilters
    references (now dead).

Verified: npx tsc --noEmit clean across web/ (0 errors), zero stray
references to the removed helpers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Remove tools chip from Manage Assistants page

Mirrors the gallery treatment from the previous commit. The user
reported tool execution isn't reliable yet, and surfacing "{n} tools"
on assistant rows misleads users into picking an assistant for a
capability that may not work in practice.

Dropped: the {n} tools Bubble in the row's chip block, the toolCount
derivation, and the FiTool import. The {n} sources chip stays — it's
about the assistant's knowledge scope, which works fine.

Verified: npx tsc --noEmit clean across web/ (0 errors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Parameterize gallery grid column count (default 3)

AssistantsGallery now accepts an optional `columns` prop (default 3,
supported 1-5). Responsive scaling below the widest breakpoint is
fixed per row of GRID_CLASSES — each row is a complete static
Tailwind class string so the purge step actually emits the classes
(dynamic `md:grid-cols-${n}` would silently disappear at build).

Unsupported values fall back to the default rather than rendering
broken — a bad prop here shouldn't break the page.

The single existing caller (page.tsx) doesn't pass columns, so it
gets the default 3 — same layout as before. To switch to 4 columns
on a wide-monitor deployment: `<AssistantsGallery columns={4} ... />`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Show document-set names on assistant cards (was: count only)

A "{n} sources" chip told users how MUCH knowledge an assistant had
access to but not WHICH knowledge — defeating the point of the chip
for someone deciding "which assistant should I pick for this task?".

Both the Manage page row and the Gallery card now render one Bubble
per document-set name, capped at MAX_VISIBLE_DOC_SETS (3). When an
assistant points at more than that, a "+N more" pill collects the
overflow with the rest of the names exposed via the title tooltip,
so we don't blow the card width or row layout at narrower column
counts.

Each name chip caps at a max-width with truncate + a hover title,
so a single absurdly long document-set name can't push the actions
off the row.

Verified: npx tsc --noEmit clean across web/ (0 errors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Gallery: user-controllable column count (segmented control, persists)

Adds a small "Columns 2 | 3 | 4" segmented control at the right end of
the filter row (next to Sort). The pick persists to localStorage
under "danswer:assistants-gallery:columns" so it survives reloads on
the same device.

State precedence:
  user choice (localStorage)   wins
  ↓
  prop `columns` from caller   (default for new users / new device)
  ↓
  DEFAULT_COLUMNS = 3          (final fallback)

The localStorage read happens in a useEffect so SSR + first paint use
the prop value — avoids a hydration mismatch the time the stored value
disagrees with the prop. localStorage writes are wrapped in try/catch
because some sandboxed contexts (private modes, restrictive iframes)
throw on access — the control still works for the session, just
doesn't persist there.

Picker is hidden below md (768px) because the layout falls back to
1 column at that width regardless of the chosen value. Exposed
options are 2/3/4 — 1 is mobile-only via responsive, 5 is too cramped
for typical screens (GRID_CLASSES still supports 5 if a deployment
wants to set it via prop).

Verified: npx tsc --noEmit clean across web/ (0 errors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Gallery: column picker as dropdown to match Sort

Reverts the segmented "2 | 3 | 4" button group to a single <select>
that mirrors the existing Sort dropdown for visual consistency on the
"view controls" cluster at the right end of the filter row.

Behavior unchanged: pure client-side state + localStorage persistence,
no fetch and no router.refresh() in the column path — the user's
column choice never triggers a backend call.

Verified: npx tsc --noEmit clean across web/ (0 errors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Scale indexing via remote Dask scheduler topology

Replace the monolithic supervisord background pod with separate
deployments for celery-worker, celery-beat, indexer-scheduler,
dask-scheduler, and dask-worker. The indexer-scheduler now reads
DASK_SCHEDULER_ADDRESS to dispatch run_indexing_entrypoint to a
remote Dask cluster instead of an in-process LocalCluster, so
indexing throughput scales horizontally with dask-worker replicas
instead of being capped by one pod's RAM.

Local dev keeps the LocalCluster path (no env var); a new
scripts/dev_run_dask_distributed.py and docker-compose overlay
reproduce the prod-shape topology without K8s.
scripts/test_dask_distributed_e2e.py exercises the topology
(parallelism, worker death, scheduler death) end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: add MIGRATION.md covering Redis / bg-scaling / UX

Single migration doc covering all three slices on this branch:
  1. Background indexing scaling (Dask topology)
  2. Redis caching + rate limiting
  3. Assistants UX rework

Organized for an operator: TL;DR up top ("everything default OFF"),
new deps/env/secrets summarized, deployment order, verification
checklist BEFORE flipping any flags, per-feature enable steps, and
the known footguns (k8s manifests missing REDIS_PASSWORD env wire-up
in the bg-scaling path, seed script bypassing persona cache, CLAUDE.md
update.py gate). Plus the recommended manual test list and the
branch's commit map.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* darwin-kubernetes: port split-background manifests + lock convention in AGENTS.md

The bg-scaling commit (03d1649f) added 5 new k8s manifests under
`deployment/kubernetes/` that split the combined background pod into
beat / celery / indexer-scheduler / dask-scheduler / dask-worker.
But Darwin doesn't apply from `deployment/kubernetes/` — its prod
manifests live under `darwin-kubernetes/`, and the two trees aren't
kept in sync.

Porting all five into `darwin-kubernetes/` with Darwin conventions:
  - Image registry sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend
  - configMapRef env-configmap, secretKeyRef danswer-secrets
  - POSTGRES_USER / POSTGRES_PASSWORD wired everywhere that talks to PG
  - REDIS_PASSWORD wired as optional secretKeyRef (the latent footgun
    flagged in MIGRATION.md §10a is now closed for the Darwin path)
  - indexcpu nodeAffinity + darwin/indexing toleration on every
    indexing-side pod (celery, indexer-scheduler, dask-scheduler,
    dask-worker); beat stays on the default pool (lightweight)
  - dynamic-pvc + file-connector-pvc volume mounts where any task may
    stage files

The existing `darwin-kubernetes/background-deployment.yaml` (combined
beat+celery+indexer via supervisord) is intentionally LEFT IN PLACE —
the split is an opt-in rollout, not a forced cutover. To switch:
apply the new five, verify the new pods are healthy, scale the
combined deployment to 0.

Also lock the convention in AGENTS.md so this doesn't recur:
  - New divergence-table row noting darwin-kubernetes/ is source of
    truth for prod.
  - New "Critical facts that bite" §9 documenting the two-tree split,
    when to touch which, and the per-pod adaptation checklist (image
    registry, configmap, secrets, REDIS_PASSWORD, affinity, PVCs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(MIGRATION.md): reflect the darwin-kubernetes port

§5b — Dask topology section now points at the actual ported
darwin-kubernetes/*.yaml manifests with a concrete cutover script,
not just "you'll need to port these later" boilerplate.

§10a — Footgun is RESOLVED for the Darwin path (the 5 new Darwin
manifests all wire REDIS_PASSWORD via optional secretKeyRef).
Marks the entry as such rather than removing it, so the history of
"why was this previously a concern" stays readable.

§12 — Commit count, file count, and totals updated for the two new
commits (MIGRATION.md itself + the darwin-kubernetes port).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(MIGRATION.md): update base reference after add-claude merge

rajiv/add-claude was merged to feature/darwin upstream, so the doc's
"on top of rajiv/add-claude (PR #45)" reference is stale. The branch
is now rebased onto origin/feature/darwin directly — same diff, just
a fresher base.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Introduce k8s/ — kustomize-based manifests replacing darwin-kubernetes/

Single source of truth for both production (Darwin AKS) and local dev,
with image tags + env values + secrets externalised so a deploy is
"edit one file, kubectl apply -k". No Helm. Replaces the flat
darwin-kubernetes/ tree (which the operator will delete once they've
verified the new structure against the live cluster).

Layout:
  k8s/base/                  Cleaned env-neutral manifests (one file per
                             logical service; deployment+service merged
                             where natural). Image refs use logical names
                             (e.g. `danswer-backend`) which overlays
                             rewrite to concrete registry+tag.
  k8s/overlays/prod/         Darwin AKS production:
                               kustomization.yaml  → images, replicas
                               env.properties      → non-secret config
                               secrets.env.example → template (committed)
                               secrets.env         → real values (gitignored)
  k8s/overlays/local/        Same shape, local-dev defaults
                             (host.docker.internal, latest tags,
                              AUTH_TYPE=disabled, smaller replicas).
  k8s/optional/              Opt-in deployments not part of base:
                               redis.yaml
                               background-{beat,celery,indexer-scheduler}.yaml
                               dask-{scheduler,worker}.yaml
                             Apply with `kubectl apply -f <file>` when
                             rolling out the corresponding feature.
  k8s/README.md              Layout explanation + common workflows
                             (image bump, env change, secret rotation,
                             Redis rollout, migrating off darwin-kubernetes/).

Built from the live-cluster dump in darwin-kubernetes/temp/ (gitignored,
never committed). The cleaner script (intentionally not committed)
strips status, uid, resourceVersion, generation, creationTimestamp,
managedFields, last-applied-configuration annotation, restartedAt,
progressDeadlineSeconds, revisionHistoryLimit, and the auto-assigned
clusterIP/ipFamilies/sessionAffinity on Services. Image references in
base/ are normalised to logical names so kustomize can rewrite them.

SECURITY: the live env-configmap was discovered to contain real plaintext
secrets — Slack tokens, GEN_AI client secret, Jira token, Opsgenie key.
The new structure moves all of those to k8s/overlays/*/secrets.env
(gitignored) which renders into a kustomize-generated Secret. api-server
and background deployments gain `envFrom: secretRef: danswer-secrets` so
the moved values continue to reach the app as env vars. Rotation of the
leaked credentials is a separate operator task — every "REPLACE_ME" in
secrets.env.example marked LEAKED is one of them.

Validation:
  kubectl kustomize k8s/overlays/prod   → 26 resources, clean render
  kubectl kustomize k8s/overlays/local  → 26 resources, clean render
  Image substitution verified in both.

.gitignore additions:
  darwin-kubernetes/temp/          Live cluster dumps
  k8s/overlays/*/secrets.env       Real secret values per environment
  k8s/overlays/*/*.secrets.env     Defensive (any *.secrets.env variant)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: opt-in Redis via kustomize component (local includes, prod doesn't)

In-cluster Redis is now an opt-in kustomize Component at
k8s/optional/redis/, included by the local overlay via
`components: [../../optional/redis]` and NOT by prod (which uses
managed Redis instead).

Why Component instead of `resources: ../../optional/redis.yaml`:
kustomize's load restrictor rejects file-resource refs that escape
the overlay's directory tree. Components are explicitly designed for
opt-in cross-tree refs and pass the security check; they also let us
add patches later that only apply when the component is opted in.

Layout change:
  before:  k8s/optional/redis.yaml
  after:   k8s/optional/redis/
             kustomization.yaml    (kind: Component)
             redis.yaml

The plain `kubectl apply -f k8s/optional/redis/redis.yaml` or
`kubectl apply -k k8s/optional/redis` workflows still work — the file
just moved one level deeper.

env.properties updates:
  local:  REDIS_HOST=redis  (the in-cluster Service name, matching the
                              component's deployment)
  prod:   REDIS_HOST=<your-managed-redis>.redis.cache.windows.net
          (placeholder for Azure Cache for Redis; rename + drop the
           access key into secrets.env as `redis_password` when you
           adopt managed Redis)

Validated:
  kubectl kustomize k8s/overlays/prod   → 26 resources (no Redis)
  kubectl kustomize k8s/overlays/local  → 28 resources (+Service +StatefulSet)

README updated with the components pattern and how to add more opt-in
features the same way (split-background, Dask, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: prod now uses in-cluster Redis (was: managed Redis); bump memory to 512MB

Reversal of the earlier "prod will use managed Redis" decision. Prod
overlay now opts into the same in-cluster Redis component as local:

  k8s/overlays/prod/kustomization.yaml — adds:
    components:
      - ../../optional/redis
  k8s/overlays/prod/env.properties     — REDIS_HOST back to `redis`
                                         (the in-cluster Service name)

Redis StatefulSet bumped from 256MB to 512MB:
  --maxmemory               256mb  →  512mb
  resources.requests.memory 128Mi  →  256Mi
  resources.limits.memory   384Mi  →  1Gi

Limit set to ~2x maxmemory rather than 1.5x because the single-replica
StatefulSet has no failover — OOM = cache outage. Redis uses extra RSS
beyond --maxmemory for client output buffers, COW pages during BGSAVE
(if we ever turn on persistence), and fragmentation; safer to over-
provision the cgroup limit and let `maxmemory-policy: allkeys-lru` do
its job inside Redis's own accounting.

Validated:
  kubectl kustomize k8s/overlays/prod   → 28 resources (now includes Redis)
  kubectl kustomize k8s/overlays/local  → 28 resources (unchanged)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: move redis from optional component to base

Both prod and local overlays opted into the in-cluster Redis component,
so it's no longer optional — promoted to base/redis.yaml and added to
base/kustomization.yaml. Removed the now-redundant `components:` blocks
from both overlays and the optional/redis/ component dir.

Net effect is identical (prod + local still render 28 resources each,
both including Redis) — just less indirection now that Redis is
universal rather than opt-in.

README updated: optional/ table drops the redis row with a note that it
moved to base; the components: "flag" explanation now points at the
split-background deployments as the example opt-in.

Validated:
  kubectl kustomize k8s/overlays/prod   → 28 resources (redis in base)
  kubectl kustomize k8s/overlays/local  → 28 resources (redis in base)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(k8s/README): drop migration narrative + darwin-kubernetes references

The README is the doc for the k8s/ layout as it stands, not a record of
how it came to be. Removed:
  - "Replaces the older darwin-kubernetes/ tree" subtitle
  - the whole "Migration plan (deleting darwin-kubernetes/)" section
  - the "darwin-kubernetes/ is being retired" + temp/ convention bullets

Also fixed two bits left stale by moving Redis into base:
  - structure diagram listed Redis under optional/ → now correctly
    omits it (it's in base)
  - "Roll out Redis" workflow told you to `kubectl apply -f
    k8s/optional/redis.yaml` → rewritten as "Redis ships in base; flip
    the env flags to enable the cache/rate-limiter features"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(k8s/README): add instructions for applying optional/ manifests

optional/ holds plain manifests (no kustomization), so they need
`kubectl apply -f` and aren't picked up by `apply -k overlays/...`.
Added a workflow covering:
  - single-file and whole-folder apply
  - the dependency on the overlay being applied first (optional pods
    reference the overlay-generated env-configmap / danswer-secrets)
  - the full split-background + Dask cutover in dependency order
    (scheduler/workers → split bg pods → scale down combined), plus
    rollback and the dual-beat warning

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: make optional manifests parameterized like base (component + logical images)

The optional/ manifests hardcoded the image tag
(sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend:vha-138) while base
uses the logical name `danswer-backend` that the overlay's images: block
rewrites. That inconsistency meant a tag bump had to be made in two
places and the optional pods could drift from the rest of the cluster.

Fix: grouped the five split-background + Dask manifests into a single
kustomize Component at k8s/optional/background-scaling/, and changed
their image refs to the logical `danswer-backend`. When an overlay opts
in via `components: [../../optional/background-scaling]`, the overlay's
existing `images:` entry for danswer-backend parameterizes them — same
tag as api-server / background, set in one place.

Verified: temporarily opting the component into the prod overlay renders
all five bg-scaling pods with sfbrdevhelmweacr.azurecr.io/danswer/
danswer-backend:vha-138 (34 resources total), then reverted. Neither
overlay opts in by default (prod/local still 28 resources each).

Layout change:
  before:  k8s/optional/{background-beat,background-celery,
           background-indexer-scheduler,dask-scheduler,dask-worker}.yaml
           (plain manifests, hardcoded image, applied via kubectl apply -f)
  after:   k8s/optional/background-scaling/
             kustomization.yaml   (kind: Component)
             <same five manifests, logical image name>
           (opted into via the overlay's components: block)

README updated: optional/ is now described as opt-in components with
logical-image parameterization; the apply workflow switched from
`kubectl apply -f` to the components: + replicas:0 overlay edits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: fully parameterize background-scaling component (replicas, env, env-neutral)

Follow-up to the component conversion — three remaining inconsistencies
vs base that were pointed out:

1. Replicas were hardcoded per-manifest. Removed them from the manifests
   and moved the counts into the component's kustomization.yaml
   `replicas:` block (one place; mirrors how the overlay parameterizes
   base replicas). dask-worker=3 is the indexing-throughput knob.

2. Secret/config loading differed: the component had an extra explicit
   REDIS_PASSWORD secretKeyRef that base doesn't. Dropped it so every
   pod's env block is byte-identical to base/background.yaml — explicit
   POSTGRES_USER/POSTGRES_PASSWORD via secretKeyRef + envFrom
   [configMapRef env-configmap, secretRef danswer-secrets]. (redis_password
   still reaches the app via the envFrom secretRef like every other
   secrets.env key; the explicit entry was redundant and base-divergent.)

3. Manifests carried Darwin-specific node affinity + darwin/indexing
   tolerations, which base does NOT (base is env-neutral; the live
   cluster runs without pool affinity). Stripped them so the component
   is environment-neutral and won't fail to schedule on a local cluster
   that lacks the indexcpu pool. The prod overlay re-adds indexcpu
   affinity + toleration via a patch when it opts in — documented in the
   README opt-in steps with a ready-to-paste patch block.

Verified end-to-end: opting the component into prod renders 34 resources,
all five bg-scaling pods get sfbrdevhelmweacr.azurecr.io/danswer/
danswer-backend:vha-138, replicas come from the component kustomization
(beat=1, celery=2, worker=3), background-deployment scaled to 0. Default
(not opted in): prod/local both 28 resources.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(k8s/README): explicit apply/preview/verify commands for background-scaling

The apply command was present but buried under the overlay-edit YAML
block and read as the generic overlay apply. Made the deploy commands
explicit and labeled: preview the rendered bg/dask pods, kubectl diff
vs live, apply, and rollout-status watches. Also stated plainly why
there's no standalone `kubectl apply -f` for the component (logical
image name only resolves through the overlay's images: block).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: right-size background-scaling for a few-hundred-user deployment

dask-worker 3 → 2, background-celery 2 → 1. The originals were inherited
defaults from the cherry-picked feature/backgroundscaling commit, not
sized to Darwin's load.

- dask-worker=2: each pod runs one connector at a time (--nworkers=1
  --nthreads=1), so this caps concurrent indexing at 2. Enough unless
  many connectors backlog in NOT_STARTED; raise then. Halves the
  worst-case indexcpu footprint (was 3×4Gi, now 2×4Gi).
- background-celery=1: Celery here only runs maintenance tasks (prune,
  sync, deletion, analytics rollup) — NOT indexing. One pod already
  autoscales 3-10 threads (--autoscale=3,10), which easily covers the
  bursty maintenance queue at this scale. The 2nd replica was redundancy
  we don't need.

Added inline comments noting which counts are singletons that must stay
at 1 (beat, indexer-scheduler, dask-scheduler) vs the throughput knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: add slack-listener to background-scaling component

The combined `background` supervisord pod ran 5 programs; the split
component covered 4 (indexer-scheduler, celery, beat — via dask too) but
NOT slack_bot_listener. Migrating to the split topology + scaling the
combined pod to 0 would have killed the Slack bot, which Darwin uses.

Adds slack-listener-deployment running `python
danswer/danswerbot/slack/listener.py`, modeled on the celery manifest:
logical danswer-backend image, env-neutral, env-configmap + danswer-secrets
(the DANSWER_BOT_SLACK_* tokens arrive via the envFrom secretRef).

SINGLETON (count: 1 in the component kustomization) — the listener holds a
Slack Socket Mode websocket; a second replica would double-process every
event. Added to the prod affinity patch's labelSelector in the README so
it lands on the indexcpu pool with the other app pods.

Validated: opting the component into prod now renders 35 resources
(was 34), slack-listener gets the prod image tag + replicas=1; default
(not opted in) unchanged at 28.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: consolidate background-scaling 6 deployments → 4

The split topology had three separate deployments for low-traffic
singletons (background-celery, background-beat, slack-listener) that
don't scale with indexing load — only dask-worker does. Collapsed them
into one `background-lite` deployment running the three as separate
containers in a single pod. This trims pod count + per-deployment
resource reservations while keeping each container independently
restartable.

Now four deployments (was six):
  - background-lite             celery-worker + celery-beat + slack-listener
                                (3 containers, 1 pod, replicas:1 — contains
                                 beat + the Slack websocket, both singletons)
  - background-indexer-scheduler  update.py polling loop (singleton)
  - dask-scheduler              Dask scheduler Service + Deployment
  - dask-worker                 indexing executors (the actual scaling knob)

Chose a multi-container pod over a supervisord-with-custom-conf approach:
no ConfigMap to mount, no risk of the custom conf drifting from the
image's baked-in supervisord.conf, and each container runs the exact
command its former standalone deployment used. strategy: Recreate on the
pod so celery-beat never overlaps during a rollout (dup beats double-fire).

Validated: opting into prod renders 33 resources (was 35), background-lite
shows containers [celery-worker, celery-beat, slack-listener] at
replicas 1, dask-worker at 2, base background scaled to 0. Default
(not opted in) unchanged at 28.

README updated: component table, affinity-patch labelSelector
(background-lite replaces the three), rollout-status command, and the
dual-beat warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: disable MULTILINGUAL_QUERY_EXPANSION in prod overlay

The query-expansion path (secondary_llm_flows/query_expansion.py) builds
its fast_llm with a HARDCODED 5s timeout. With no distinct fast model
(FAST_GEN_AI_MODEL_VERSION empty), it routes to full gpt-4o via the
UiPath gateway, which routinely takes >5s — causing repeated
ReadTimeouts on Slack queries (observed in prod logs).

The committed darwin-kubernetes/env-configmap.yaml already had this
empty; the LIVE cluster had drifted to "English,Japanese" (set
out-of-band, never committed), which is what triggered the timeouts.
Setting it empty in the new prod overlay keeps the go-forward source of
truth correct. App reads `os.environ.get(...) or None`, so empty =
feature off.

To re-enable later: wire FAST_GEN_AI_MODEL_VERSION to a genuinely fast
model (gpt-4o-mini / gpt-4.1-mini) so the 5s budget is realistic, or
make that timeout env-configurable first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: pin Vespa to 8.600.35 (was :latest → caused prod outage)

INCIDENT: applying the kustomize overlay rendered vespaengine/vespa:latest
against a cluster running 8.600.35. The bare→:latest image change rolled
all Vespa StatefulSets onto 8.696.20 — a >30-release jump. Vespa's config
server refuses an auto-upgrade that large (incompatible-upgrade guard,
VersionState.verifyVersionIntervalForUpgrade) and crash-looped on
ConfigServerBootstrap, taking the whole cluster down (config tier → no
quorum → cluster-wide connection-refused 503s on every search + the
api-server's ensure_indices_exist).

FIX: pin both overlays to 8.600.35 — the version the content nodes' on-disk
index is written in, so there is no upgrade and the version check passes.

Recovery performed on the live cluster: set all 5 vespa StatefulSets back
to 8.600.35, cleared the (now-irrelevant) wedged config-server ZK state,
restarted. Content data on the 100Gi content PVCs was never touched.

NEVER use :latest for Vespa. Upgrades must be STEPWISE (≤30 releases per
hop) and done as a deliberate, ordered operation — not a bare tag bump.
busybox also pinned (1.36.1) for the same drift hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: add readiness probes to Service-backed Vespa nodes

Vespa nodes had NO probes, so a k8s Service added a pod to its endpoints
the instant the container started — before Vespa was actually serving.
That's why every Vespa restart re-opened a window of "upstream connect
error / connection refused" 503s (the incident).

Adds readinessProbe (httpGet /state/v1/health) to the three nodes that
sit behind a query/deploy Service:
  - vespa-configserver  (:19071) — gates the deploy + inter-node config path
  - vespa-query         (:8080)  — gates the search path the app hits
  - vespa-feed          (:8080)  — gates the feed/index path

Deliberately NOT added to vespa-content / vespa-admin: they aren't behind
a query-serving k8s Service (content is reached internally via Vespa's own
distributor, admin runs cluster control), and a mis-tuned probe there
could pull a healthy node from rotation and make things worse.

Readiness ONLY — no liveness probe anywhere. An aggressive liveness check
could kill a slow-but-healthy Vespa node mid-bootstrap and cause a restart
loop, which is the failure mode we just spent the incident fighting.
Generous initialDelay (45s configserver / 30s others) + failureThreshold 6
so normal slow startup doesn't flap nodes out of rotation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(AGENTS.md): add Critical fact §10 — never :latest for Vespa, pin the version

Captures the prod-outage learning in the repo's shared operating notes
(the section CLAUDE.md routes every agent to read first). Covers: why a
big Vespa version jump takes the cluster down (config-server refuses
>30-release auto-upgrade), the rule (pin to the running version, 8.600.35;
upgrades stepwise; don't force SKIP_UPGRADE_CHECK on prod), and the
recovery runbook (re-pin image, restart config servers, redeploy schema;
ZK clear only if genuinely corrupt; readiness-only probes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: add guarded-apply.sh — block unsafe Vespa version jumps at apply time

Turns the "never jump Vespa >30 minor releases" rule into an enforced
guardrail instead of a thing to remember. guarded-apply.sh wraps
kubectl apply -k and, before applying:
  - reads the LIVE running Vespa version (kubectl) + the version the
    overlay would deploy (kubectl kustomize)
  - REFUSES a >30-minor upgrade (Vespa's auto-upgrade limit — the thing
    that caused the outage)
  - REFUSES a major-version change (needs dedicated migration)
  - REFUSES a floating/unparseable tag (:latest)
  - WARNS + requires FORCE=1 on a large downgrade (legit only when
    recovering to the on-disk version — which is why downgrade isn't a
    hard block; our recovery was a 96-minor downgrade)
  - otherwise runs kubectl diff, then apply

Checks against LIVE, not the repo's previous pin — config drifts out of
git (we saw it), so the running version is the only truth that matters
at apply time. FORCE=1 overrides with an explicit "I accept the risk".

Wired into k8s/README.md (Quick start now uses guarded-apply.sh; new
"Bump the Vespa version" + "Vespa version guard" sections) and
referenced from AGENTS.md §10.

Verified: parses live=8.600.35 vs overlay=8.600.35 (gap 0 → OK); bash -n
clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: add KEDA indexing-autoscale component (opt-in) for dask-worker

Autoscales dask-worker-deployment on real indexing demand instead of a
fixed replica count, via a KEDA PostgreSQL scaler.

Metric (validated against the live DB): the number of index attempts that
can run CONCURRENTLY right now, respecting INDEXING_PER_SOURCE_CAP —
  SUM over source of LEAST(cap, pending_count_for_source)
not a raw pending count (10 same-source attempts still run 1 at a time
under cap=1, so we must not spin up 10 workers). Counting IN_PROGRESS in
the metric keeps replicas >= running jobs, so KEDA never scales a busy
worker away; scale-to-0 only when there's genuinely no work.

Grounded in code + live DB, not guesses:
  - remote-Dask mode does NOT cap dispatch by NUM_INDEXING_WORKERS, so
    adding worker pods truly adds parallelism (the in-process LocalCluster
    path is the only one bounded by that env)
  - PER_SOURCE_CAP (default 1) is the real concurrency ceiling
  - index_attempt links directly to connector.connector_id (this fork),
    not connector_credential_pair_id
  - status is stored UPPERCASE (Enum native_enum=False) — confirmed live:
    NOT_STARTED / IN_PROGRESS / SUCCESS / FAILED

Shipped as opt-in component k8s/optional/keda-indexing-autoscale/
(ScaledObject + TriggerAuthentication; password from danswer-secrets,
no duplication). minReplica 0 (scale to zero when idle), maxReplica 4,
cooldown 300s. README documents prerequisites (KEDA operator install;
opt in after background-scaling; REMOVE the static dask-worker replicas
entry so it doesn't fight the HPA), the scale-down safety reasoning, and
a recommended dask-worker graceful-shutdown companion change.

Validated: component renders standalone (2 resources) and opted into prod
(namespaced to darwin). Not enabled by default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: add KEDA operator install manifest (optional/keda, own namespace)

Companion to the keda-indexing-autoscale component: the KEDA operator
itself. No Helm — a standalone kustomization referencing KEDA's official
release bundle PINNED to v2.14.0 (GitHub release assets are immutable, so
the URL is a content pin; never a moving ref — same lesson as Vespa
:latest, AGENTS.md §10).

It's cluster-scoped infra (CRDs + RBAC + operator), installed ONCE per
cluster independent of the danswer overlays — hence kind: Kustomization
(apply on its own), NOT a Component layered into prod/local. KEDA's bundle
creates and installs into its own `keda` namespace internally, so no
`namespace:` override here (that would wrongly re-namespace the
cluster-scoped CRDs).

Install:  kubectl apply --server-side -k k8s/optional/keda
  (--server-side required — KEDA CRDs exceed the client-side
   last-applied-configuration annotation limit)

README updated: optional/ table now distinguishes Components (opt into an
overlay) from Standalone installs (apply on their own), and the KEDA
autoscale prereq points at `kubectl apply -k k8s/optional/keda`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(k8s/README): document mandatory pod restarts after a config change

Fixes a misleading instruction and adds the missing operational step the
user hit: after `kubectl apply -k`, ConfigMap changes do NOT auto-roll
pods, because disableNameSuffixHash:true keeps the name stable (the
hash-suffix is exactly what would trigger a rollout). envFrom reads env
only at pod start, so running pods keep stale values until restarted.

Changes:
- "Add a new env var": corrected step that implied auto-pickup; now states
  you must restart consumers.
- "Enable the Redis cache + per-user rate limiter": adds the explicit
  `kubectl rollout restart deploy/api-server-deployment
  deploy/background-deployment` after apply, clarifies the rate limit is
  PER-USER, and includes PERSONA_CACHE_ENABLED.
- New "Which workloads to restart after a config change" table mapping
  changed vars → workloads (Redis flags → api-server + background; model
  vars → model servers; etc.), plus the split-background variant
  (background-lite / indexer-scheduler / dask-worker, no background-deployment).
- disableNameSuffixHash footgun now spells out the manual-restart consequence.

Also commits the prod env.properties with the Redis features enabled
(REDIS_KV_CACHE_ENABLED / REQUEST_RATE_LIMIT_ENABLED 20-per-min,300-per-hr /
PERSONA_CACHE_ENABLED) — the user turned these on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* k8s: add startup + readiness probes to api-server (own /health, not deps)

api-server had no probes. Adds:
- startupProbe on /health:8080 — the container runs `alembic upgrade
  heads` before uvicorn, so /health isn't up until migrations finish;
  this allows ~5min (30×10s) for migrations+boot before readiness/liveness
  count. Transitively gates on Postgres (no migrations → never Ready).
- readinessProbe on /health:8080 — gates the api-server Service so it
  doesn't route to a still-booting pod.

Deliberately:
- checks the app's OWN /health, NOT Vespa/Redis. Those are partial/optional
  deps (Vespa retried-then-proceeds, Redis fail-open); coupling API
  availability to them would turn a partial outage into a total one — the
  Vespa incident is the proof (api-server stayed up serving auth/settings
  while search was down).
- NO liveness probe — an aggressive liveness on a slow-migrating api-server
  could kill it mid-migration (same lesson as the Vespa probes).

/health is in the auth_check public-endpoint allowlist, so the probe isn't
401'd. Postgres remains gated by the alembic step in the start command;
Redis/Vespa intentionally NOT gated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(k8s/README): add "Verify Redis caching is working" runbook

After enabling the Redis flags, document how to confirm the cache is
actually populated + hit (vs silently failing open to Postgres):
  - key-namespace table (kv / personas / groups / ratelimit)
  - --scan for presence (the "is it on" check)
  - INFO stats keyspace hit/miss ratio
  - TTL/STRLEN on a specific entry
  - MONITOR to watch a live request hit the cache
  - DEL + reload to prove the read-through refill
  - rename-an-assistant to prove write-through invalidation (TTL -> -2)
Plus the gotcha: the cache is silent on success (only logs on Redis
error), so api-server logs won't show hits — Redis-side inspection is
the only way to observe it; a "Redis GET/SET failed" warning means it's
failing open to Postgres.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf: fix N+1 in basic indexing-status (chat page load + folder create)

get_basic_connector_indexing_status reads cc_pair.connector.source for
every cc-pair, but get_connector_credential_pairs didn't eager-load the
connector relationship → one lazy query per cc-pair. At ~404 cc-pairs
against a remote Azure Postgres that's 404 sequential round-trips, which
dominated chat page load — and re-ran on every folder create (the chat
page's router.refresh() re-fetches the whole fetchChatData bundle, and
this is the slowest endpoint in it).

Fix: add eager_load_connector to get_connector_credential_pairs (opt-in,
joinedload on ConnectorCredentialPair.connector) and use it in the basic
indexing-status endpoint. 405 queries -> 2. No API/contract change, no
frontend change; speeds every chat page load, not just folder create.

Verified the doc-count GROUP BY itself was already fast (7ms over 689k
rows on the live DB) — the cost was the N+1, not the aggregation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(chat): optimistic folder create — no full refetch

Creating a folder called router.refresh(), which re-runs the entire
fetchChatData server bundle (chat sessions + doc sets + assistants +
tags + llm providers + indexing-status + folders, uncached via noStore)
just to show one new empty folder in the sidebar. The create POST itself
is a single fast INSERT.

Now: mirror the server `folders` prop into local state (re-synced when
the prop changes) and, on create, append the returned folder to that
state instead of refreshing. The folder appears as soon as the INSERT
returns — no fan-out, no SSR re-render.

Paired with the indexing-status N+1 fix, this removes both the trigger
(the refetch) and the worst cost within it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(chat): Redis-cache the connector indexing-status read

The chat page calls GET /manage/indexing-status on every load to derive
available source types. Its cost is a per-cc-pair document-count
aggregation (~300ms on the live DB at a few hundred cc-pairs) — the
dominant fan-out cost after the #1/#2 fixes. The result is identical for
all users and changes only when a connector is added/removed or an
indexing run completes, so front it with a short-TTL global Redis cache.

- Split the DB build into `_build_basic_cc_pair_info` and wrap the
  endpoint with an inline fail-open cache (global key
  `danswer:cc_pair_basic_info`, default 60s TTL).
- Pure TTL, no explicit invalidation: staleness is bounded by the TTL
  and harmless. Any Redis error falls straight through to a direct DB
  build — never an outage.
- Default OFF via CC_PAIR_INFO_CACHE_ENABLED; prod overlay enables it,
  local leaves it opt-in. Documented in k8s/README.md key table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* k8s(vespa): group into base/vespa/ + ordered, health-gated upgrade script

Vespa is a version-stateful subsystem with a lifecycle unlike the rest of
base (pinned version, the :latest outage history, multi-StatefulSet
ordered upgrades). Group it and give the upgrade ordering a real home —
which is a script, not the manifests, since kustomize is declarative and
cannot sequence a health-gated multi-StatefulSet rollout.

Structure:
- Move the 7 vespa-*.yaml into k8s/base/vespa/ with its own
  kustomization.yaml (referenced from base as `- vespa/`). Rendered
  output is unchanged.
- Split the single `vespa` logical image into per-role names
  (vespa-configserver/-admin/-content/-feed/-query); both overlays map
  all five to vespaengine/vespa:8.600.35. This lets the upgrade script
  move one role's version at a time.

Safety prereqs (these change the content/admin pod template, so the next
apply rolls those StatefulSets — safe, one pod at a time, data on
retained PVCs):
- Add readiness probes to content + admin on :19092 /state/v1/health
  (verified serving 200 live; node-agnostic, unlike the containers' 8080).
- Set publishNotReadyAddresses: true on vespa-internal so peer discovery
  is never gated by readiness (a slow/booting node must stay resolvable).

Upgrade tooling:
- k8s/scripts/vespa-upgrade.sh <target> [ns]: ordered (configserver →
  admin → content one-ordinal-at-a-time via partition stepping → feed →
  query), health-gated between each (kubectl exec → localhost, Istio-aware),
  single hop, refuses >30-minor/major/downgrade (FORCE to override),
  DRY_RUN/YES flags. bash-3.2 compatible. Dry-run verified against live.

Docs: README "Upgrade Vespa" rewritten around the script; base/ section
describes the folder; guarded-apply clarified as the everyday-apply net,
not the upgrade tool; AGENTS.md §10 updated with the script + per-role
structure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(chat): createFolder returns the new folder id (was undefined)

POST /folder returns the new id as a bare integer, but createFolder read
`data.folder_id` — always undefined. This was harmless while the create
handler just called router.refresh() and ignored the return, but the
optimistic-folder insertion (a84600b3) uses the id, so new folders
rendered with folder_id=undefined and rename/delete PATCHed
/api/folder/undefined → no-op. Parse the bare integer instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* polish(chat): tidy chat-row drag (compact ghost + no browser split-view)

Two native-DnD annoyances when dragging a chat row to a folder:
- The row is a <Link> (<a href>), so the browser auto-attaches the URL to
  the drag and some browsers (Arc/Edge/Safari) offer "open in split view"
  when dragging toward the edge. Clear the auto-added link payload and set
  effectAllowed=move so only the folder DnD remains.
- The default drag image is a translucent clone of the full-width row that
  trails awkwardly across the sidebar. Replace it via setDragImage with a
  compact chip (chat name, ellipsized) built off-screen and removed on the
  next tick.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(chat): pending spinner on "Manage Assistants" navigation

Navigating to /assistants/mine awaits the heavy fetchChatData bundle, with
no feedback — it felt frozen. (A route-level loading.tsx was wrong here:
the app renders the sidebar inside each page, not a shared layout, so the
fallback blanked the whole shell and read as a full reload.)

Instead drive the navigation with useTransition + router.push: the current
page (and sidebar) stays mounted and visible until the new page's server
fetch completes, and isPending swaps the button's brain icon for a
spinner + "Loading…" so the click clearly registers. Feels like an in-app
transition, not a reload.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* perf(db): Celery broker on Redis + env-driven Postgres pool sizing

Reduce and contain Postgres connection pressure (the real ceiling for chat
at scale — sessions are held through the LLM stream):

- Celery broker + result backend optionally on Redis via
  CELERY_BROKER_REDIS_ENABLED (default off; prod on). Uses a separate
  logical DB (CELERY_REDIS_DB_NUMBER, default 1) so Celery keys never
  collide with the cache/rate-limit DB 0. Removes Celery's queue
  polling/writes from Postgres. Task status is unaffected — this fork
  tracks it in its own task_queue_jobs table, not the Celery backend.
  Indexing stays on Dask. Falls back to the Postgres broker when off, so
  local dev without Redis still boots. Note: the broker (unlike the
  fail-open cache) is a hard dependency when enabled.
- Postgres pool size/overflow are now env-driven (POSTGRES_POOL_SIZE /
  POSTGRES_POOL_OVERFLOW, defaults preserve the previous 40+10) so each
  deployment can size its pool to its replica count and stay under Azure
  Postgres max_connections. Applied to both the sync and async engines.

Overlays: prod enables the Redis broker and sets explicit pool values
(documented to lower as api-server replicas grow); local leaves both
opt-in/empty. README gains a "Celery on Redis + pool sizing" section, a
verify command, and restart-matrix rows; AGENTS.md divergence table notes
the broker is now configurable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style(config): black-format app_configs.py (line-length 130)

Collapse the multi-line env-read statements added during the Redis/cache
work to single lines, per the repo's pinned black==23.3.0 + pyproject
line-length=130. Cosmetic only — no values or logic change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style: black-format Redis persona cache + rate-limit middleware

Collapse wrapped signatures/calls to single lines per the repo's
black==23.3.0 + pyproject line-length=130. Cosmetic only — no logic
change. Completes black compliance for the Redis-feature files on this
branch (the remaining black-flagged files are unrelated connector/llm
code, left for a separate cleanup).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* perf(chat): per-user Redis cache for the document-set list

/document-set is on the chat-page bundle (fired every load); the read is
a multi-join plus, in EE, a per-user permission query. Cache it.

Per-user (not global+filter like the persona cache) on purpose: the
doc-set permission filter is edition-dependent — EE filters by
is_public/users/groups, MIT base returns all — so memoizing the exact
versioned result per user avoids replicating that branchy logic, where a
parity bug would leak doc-set visibility. Trade-off: a cold burst of N
distinct users still costs N first-loads, but a user's repeat loads
(new-chat / nav / router.refresh) collapse to one DB build per TTL.

- New db/d…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants