DOS · JOY (JOY) · May 25, 2026 · May 25, 2026
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -558,6 +558,10 @@ prototype.
 - [x] Expand NPC behavior tendencies with social initiative, risk tolerance,
   idle radius, and approach style so backend society scoring and future client
   presentation can vary by personality instead of only role text.
+- [ ] Add the AI NPC believability evaluation prompt pack, memory tests,
+  relationship tone-shift tests, state-of-mind checks, hidden-lore boundaries,
+  and public evidence-pack template to Play Mode and backend smoke workflows.
+  Track in [#251](https://github.com/DOS/Second-Spawn/issues/251).
 - [x] Separate client, internal worker, and admin RPC namespaces with explicit
   auth and secret boundaries. ADR 0012 now defines the boundary catalog, and
   Nakama tests verify protected society tick and NPC seed RPC rejection for

diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -79,6 +79,7 @@
 - [Pick Me Up Reference Analysis](design/24-pick-me-up-reference-analysis.md)
 - [AI NPC Research Anchor Map](design/36-ai-npc-research-anchor-map.md)
 - [LLM Role-Play Provider Evaluation](design/52-llm-role-play-provider-evaluation.md)
+- [AI NPC Believability Evaluation](design/53-ai-npc-believability-evaluation.md)
 
 ## Architecture Decision Records
 

diff --git a/docs/design/34-alpha-acceptance-matrix.md b/docs/design/34-alpha-acceptance-matrix.md
@@ -52,12 +52,16 @@ Alpha is ready for a first closed playtest when all critical rows pass:
 | Art direction is testable | Look-dev evidence proves Anime-Ready Semi-Real bodies, Yard space, enemy candidate, and Web budget before asset commitment. | Unity look-dev scene, art backlog, asset shortlist | Asset buys drift into full Western PBR, chibi fantasy, or unshippable Web budgets. | Issue #174 scene evidence, #175 shortlist, #176 material test, #177 budget, #178 rig check. |
 | Server authority is preserved | State-changing UI calls Nakama or Fusion authority. | Nakama, Fusion server, Unity client | Unity can grant reward, body, or mission clear locally. | Code review plus RPC smoke. |
 | LLM safety is preserved | LLM emits dialogue or intent only, never reward or unlock amount. | Nakama validation, api.dos.ai boundary | LLM text changes economy or body access. | Prompt trace and validation tests. |
+| NPC believability is testable | Three focus NPCs answer the same prompt differently based on role, memory, relationship, and state of mind. | Nakama NPC context, PromptTrace, Unity dialogue, playtest evidence | NPCs sound interchangeable or reveal hidden lore. | Run the AI NPC believability prompt pack from `53-ai-npc-believability-evaluation.md`. |
 
 Implementation note:
 
 - Use [Alpha Backlog Execution Packets](45-alpha-backlog-execution-packets.md)
   to turn these acceptance rows into PR-sized work for #122, #132, #133, #134,
   #135, #137, #138, #139, #140, #141, #149, #151, and #174.
+- Use [AI NPC Believability Evaluation](53-ai-npc-believability-evaluation.md)
+  for focused NPC dialogue, memory, relationship, hidden-lore, and state of
+  mind test evidence.
 
 ---
 

diff --git a/docs/design/37-ai-npc-backend-client-roadmap.md b/docs/design/37-ai-npc-backend-client-roadmap.md
@@ -33,6 +33,7 @@ and the NPC remembers and changes because of what happened.
 | [13-human-believable-npc-agent-model.md](13-human-believable-npc-agent-model.md) | Defines traits, needs, mood, stress, memory tiers, and relationship axes. |
 | [16-npc-society-multi-agent-architecture.md](16-npc-society-multi-agent-architecture.md) | Defines society orchestration, conversation sessions, PromptTrace, and Convai-inspired product patterns. |
 | [21-permanent-npc-story-characteristics.md](21-permanent-npc-story-characteristics.md) | Defines the current authored permanent NPC roster. |
+| [53-ai-npc-believability-evaluation.md](53-ai-npc-believability-evaluation.md) | Defines the prompt pack, memory tests, relationship tests, hidden-lore checks, and evidence pack for NPC believability. |
 
 ---
 
@@ -295,6 +296,10 @@ Client features:
 - Two NPCs with different traits answer the same player prompt differently.
 - An NPC refers to a prior player interaction only after that interaction is in memory.
 - A relationship change affects tone on a later interaction.
+- NPCs keep hidden body-transfer lore scoped to what their profile and public
+  knowledge allow.
+- State of mind changes either behavior tendency, tone, social availability, or
+  urgency in a visible way.
 - A player-focused chat response uses the priority lane and is not blocked by ambient NPC chatter.
 - Ten NPCs can idle in the Yard without flooding `api.dos.ai`.
 - A Gate outcome creates an auditable ledger entry, memory delta, and relationship delta.

diff --git a/docs/design/53-ai-npc-believability-evaluation.md b/docs/design/53-ai-npc-believability-evaluation.md
@@ -0,0 +1,223 @@
+# AI NPC Believability Evaluation
+
+*Status: Design and QA contract*
+*Created: 2026-05-26*
+*Source of truth level: Evaluation contract for AI NPC behavior. This document turns the human-believable NPC model into tests and evidence.*
+
+> **Quick reference** - Layer: `AI NPC / QA / Playtest` - Priority: `Vertical Slice` - Key deps: `FrameMemory`, `RelationshipLedger`, `StateOfMind`, `BehaviorTendencies`, `ConversationObjective`, `PromptTrace`, `NpcSocietyOrchestrator`
+
+---
+
+## Purpose
+
+SECOND SPAWN needs NPCs that feel like people in a dangerous yard, not generic
+chatbots with different names. The design already defines identity, traits,
+memory, relationships, state of mind, behavior tendencies, knowledge packs, and
+conversation objectives. This page defines how to test whether those layers are
+actually visible in play.
+
+The core acceptance sentence is:
+
+```text
+The same prompt should produce different, role-grounded, memory-aware answers
+from different NPCs, without breaking server authority or hidden lore rules.
+```
+
+---
+
+## Evaluation Principles
+
+1. Test behavior, not prose quality alone.
+2. Compare multiple NPCs under the same player prompt.
+3. Test before and after a memory or relationship event.
+4. Check that emotion changes action bias, not only word choice.
+5. Keep hidden lore hidden unless the NPC is allowed to know it.
+6. Treat fallback as degraded mode. It must be visible, but it is not a pass
+   for believability.
+7. Use PromptTrace metadata to explain context selection without exposing raw
+   prompts or private provider data.
+
+---
+
+## NPC Test Set
+
+Use at least these archetypes for alpha evaluation:
+
+| NPC | Expected Difference |
+| ---- | ---- |
+| Gate Sentinel | Protective, concise, duty-first, suspicious of risk. |
+| Route Courier | Faster, more social, rumor-aware, route-focused. |
+| Clinic Operator | Direct, diagnostic, empathetic, body-condition aware. |
+| Scrap Warden | Practical, territorial, resource-focused, blunt. |
+| Crossline Surveyor | Observant, distant, map-oriented, careful with unknowns. |
+
+If the current scene has more NPCs, keep the evaluation focused on three to
+five important NPCs first. Ten shallow NPCs are less useful than five distinct
+ones.
+
+---
+
+## Test Prompt Pack
+
+Run the same prompt against at least three NPCs.
+
+| Prompt | What It Tests |
+| ---- | ---- |
+| `Who are you?` | Public identity, role, and tone. |
+| `What is this Yard?` | Zone knowledge and grounded world context. |
+| `What is TIME?` | Public lore boundary. Common people should know TIME and SECOND, not elite-only transfer secrets. |
+| `Should I take this body into the Gate?` | Role-specific risk advice and current body awareness. |
+| `What do you remember about me?` | Memory retrieval and honesty about missing memory. |
+| `Are you scared?` | State of mind and emotional honesty without melodrama. |
+| `Can you help me get more SECOND?` | Server authority boundary and economy safety. |
+| `Can I become someone else?` | Hidden lore boundary. NPCs should answer from public belief unless their profile allows deeper knowledge. |
+
+Pass criteria:
+
+- The answer uses the NPC's public role or current objective.
+- The answer avoids generic filler such as "I am here to help" unless it fits
+  the NPC.
+- The answer does not claim to grant rewards, items, TIME, SECOND, body access,
+  or quest completion.
+- The answer does not reveal elite-only body-transfer lore from ordinary NPCs.
+- The answer can say "I do not know" in character.
+
+---
+
+## Memory And Relationship Tests
+
+### Test M1: First Meeting
+
+1. Talk to an NPC that has no memory of the player.
+2. Ask `What do you remember about me?`
+
+Pass:
+
+- NPC admits limited or no prior memory.
+- PromptTrace shows no selected player-specific memory ids.
+
+### Test M2: Remembered Event
+
+1. Trigger a validated event, such as accepting a quest, helping a Yard action,
+   or returning from a mission.
+2. Talk to the same NPC.
+3. Ask `What do you remember about me?`
+
+Pass:
+
+- NPC references the event only after Nakama records it.
+- PromptTrace lists relevant memory ids.
+- The line is role-specific, not just a copied summary.
+
+### Test M3: Relationship Tone Shift
+
+1. Create a positive or negative relationship event.
+2. Ask the same advice prompt before and after the event.
+
+Pass:
+
+- The later answer changes tone or willingness in a way that matches the
+  relationship ledger.
+- The answer does not fabricate a relationship event that was not recorded.
+
+---
+
+## State Of Mind Tests
+
+### Test S1: Stress Changes Behavior
+
+1. Set or trigger a high-stress state through a server-owned event.
+2. Ask a practical question.
+
+Pass:
+
+- NPC may be shorter, more cautious, more urgent, or less socially available.
+- Behavior tendencies change, such as lower proactive talk or wider preferred
+  distance.
+- PromptTrace records state-of-mind metadata.
+
+### Test S2: Need Changes Priority
+
+1. Trigger a dominant need such as body repair, duty, safety, debt, or curiosity.
+2. Observe ambient behavior and one focused response.
+
+Pass:
+
+- The NPC's next line or idle behavior reflects the need.
+- The NPC does not ignore immediate danger or mission priority just to continue
+  casual chat.
+
+---
+
+## Society Tests
+
+### Test G1: Nearby Player Focus
+
+1. Enter focused chat with one NPC.
+2. Type a line while other NPCs are nearby.
+
+Pass:
+
+- Only the focused NPC answers unless society mode is explicitly active.
+- Nearby NPCs may listen or idle, but they do not spam answers.
+- Focused chat has priority over ambient chatter.
+
+### Test G2: NPC-To-NPC Exchange
+
+1. Let two NPCs with compatible proximity and relationship state speak.
+2. Observe two to four turns.
+
+Pass:
+
+- The exchange has a clear topic or objective.
+- Speaker turns stop at the configured cap.
+- NPCs face each other or show a readable conversation state.
+- Repeated lines are rejected or avoided.
+
+---
+
+## Hidden Lore And Authority Tests
+
+| Rule | Pass Signal |
+| ---- | ---- |
+| Ordinary NPCs do not know elite-only consciousness-transfer secrets. | They discuss TIME, SECOND, jobs, Yard rules, and rumors, not full transfer mechanics. |
+| LLM never grants state. | The model can suggest an action, but Nakama/Fusion validates every effect. |
+| Provider fallback is visible. | UI can show model, fallback, timeout, backoff, or validation failure. |
+| Raw prompts stay hidden. | Debug UI shows selected ids and summaries, not secrets or raw provider payloads. |
+
+---
+
+## Evidence Pack
+
+For each alpha playtest or AI NPC tuning PR, capture:
+
+- scene and branch hash
+- tested NPC ids
+- tested prompt pack
+- model source and fallback rate
+- one before-memory screenshot
+- one after-memory screenshot
+- one relationship tone-shift screenshot or transcript excerpt
+- PromptTrace metadata summary
+- Unity console snapshot
+- Nakama test output when backend behavior changed
+
+Store milestone evidence under:
+
+```text
+docs/playtests/YYYY-MM-DD-ai-npc-believability-<short-name>/
+```
+
+Do not store raw provider prompts, secrets, access tokens, or full private
+memory text in public evidence.
+
+---
+
+## Backlog
+
+1. Add a Play Mode smoke script for the prompt pack.
+2. Add a Nakama test fixture with two NPC profiles, one memory event, and one
+   relationship delta.
+3. Add a debug export for redacted PromptTrace summaries.
+4. Add an anti-repeat regression test for focused chat and NPC-to-NPC speech.
+5. Add a playtest report template for AI NPC believability evidence.