Context
@tobiu directed an architectural pivot 2026-05-04 (session cce1fea5-32ff-410c-b820-2e9a27b3cd51) after the post-merge runaway-spawn pattern of Cmd+N-based recovery (Phase 1 of #10601, shipped via #10619 / #10620 / #10621) became apparent. The prompt-layer fix proposed in #10627 (and partially implemented in PR #10670, now abandoned) treats a symptom of the deeper substrate bug rather than the root cause.
Cross-family substrate-truth audit refined the design space; this Epic captures the consensus shape from @neo-gpt + @neo-gemini-pro + @neo-opus-ada in same-session A2A coordination.
The Problem
Empirical anchor 1: runaway spawn pattern. heartbeat-opus_4_7.log shows 9 actual Successfully resumed @neo-opus-ada via osascript (Claude) between 14:41-16:02 CEST (2026-05-03), with "Last memory is N min old" reason growing 31m → 242m. None of the spawned sessions produced a fresh AGENT_MEMORY because @tobiu manually ESC-rejected each one — the substrate had no signal channel for that rejection.
Empirical anchor 2: multi-day accumulation. Claude Desktop's "Recents" panel shows 19+ orphan boot-titled sessions across 5+ prompt-iteration generations.
Empirical anchor 3: MC-server staleness confirmed. set_session_id boot call this session returned replacedSessionId: f839713d-... — yesterday's Row 2 session. SessionService.currentSessionId survives across MCP-client reconnects when the server stays running.
Root cause: parallel-session identity-uniqueness violation during long boot windows. Boot ramp (read AGENTS_STARTUP.md + AGENTS.md + MEMORY.md + tools) takes longer than the 5-min heartbeat interval. Even WITHOUT user rejection, slow boot triggers a parallel spawn before the in-progress one saves its first memory. Result: parallel @neo-opus-ada sessions racing on:
- A2A targeting (Gemini sends to one identity, multiple sessions race to read+
mark_read)
- Inbox
mark_read collisions (siblings blind to messages already-read)
- Memory Core sessionId collisions
- Sunset semantics (drops one
WAKE_SUBSCRIPTION; N-1 zombie sessions remain subscribed)
- Dream/REM synthesis (parallel narratives = synthesis garbage)
The Architectural Reality
The recovery substrate today (Phase 1 of #10601) lives in:
ai/scripts/swarm-heartbeat.sh — cron-driven 5-min poll
ai/scripts/checkSunsetted.mjs — predicate; emits sunsetted: true only on missing WAKE_SUBSCRIPTION per #10641 fix
ai/scripts/resumeHarness.mjs:120-192 — fires Cmd+N + osascript paste of boot-grounding prompt
ai/scripts/wakeSafetyGate.mjs — fail-closed safety primitive (#10648); currently tripped since 2026-05-03 22:53Z
Root substrate bugs in the Cmd+N approach:
- Cmd+N inside running Claude Desktop process keeps the SAME MCP client connection alive —
SessionService.currentSessionId doesn't reset
- The 600s cooldown is time-based, not in-flight-aware — boot ramp easily exceeds 600s
- OS process boundary doesn't enforce identity uniqueness AT THE DATA LAYER (multiple instances can race on shared SQLite + mailbox)
The Fix
Two-mode recovery, gated by checkSunsetted.mjs:
Idle-out mode (active subscription + stale memory):
- Send A2A heartbeat message INTO existing session via
add_message({to: identity, body: ...})
bridge-daemon.mjs existing in-place keystroke logic delivers — zero new transport
- Subscription/sessionId/MC continuity preserved
- Bounded, non-spawning, idempotent, no-destructive-type guarantees (carries forward #10641's lesson — staleness as lower-authority signal, NOT "agent is idle")
- In-flight cooldown marker — prevents A2A spam during active long-turn / rate-limit
Sunset mode (no subscription):
- Per-harness terminal-restart of harness process — establishes fresh OS process + fresh MCP client handshake + fresh
currentSessionId by construction
- Restart must be proven per harness —
open -a typically activates an existing process, NOT spawns fresh; investigation tracks must discover the actual restart primitive (quit+open / open -na / harness CLI / app-server API)
- Boot prompt simply: read
AGENTS_STARTUP.md, then add_memory once as boot heartbeat, then proceed normally — NO set_session_id plumbing
originSessionId retained as non-authoritative retrieval hint for SUNSET-tagged memory lookup, NOT as set_session_id argument
Identity-uniqueness mutex (in-flight boot/restart lock — covers BOTH modes):
resumeHarness.mjs writes inflight-{mode}-{identity}.txt with timestamp + lock-id before action
- Next heartbeat fire skips action if lock exists AND no fresh
AGENT_MEMORY since lock timestamp
- Spawned/nudged agent's first
add_memory clears the lock implicitly (or explicitly via checkSunsetted)
- Lock age >
BOOT_TIMEOUT (~15 min) without fresh memory = abandoned action — catches user-ESC + genuine-failure + MC-staleness + rate-limit identically; auto-trips wake safety gate after N consecutive abandoned actions
Acceptance Criteria
Out of Scope
- Stage 2 polish: harness CLIs (Claude CLI, Codex CLI) for cleaner restart UX — flagged but not blocking; nice-to-have once restart primitive is empirically proven per harness
- Re-enabling
autoDream / autoGoldenPath — governed by #10569 hard-stop
- MC server lifecycle restructuring — governed by #10186
Avoided Traps
- Trap:
open -a Claude guarantees fresh MCP. Empirically false — typically activates existing process. Per-harness investigation must discover real restart primitive; AC5 enforces empirical proof.
- Trap: prompt-layer fix for substrate bug. #10627 attempted
set_session_id rotation in boot prompt; this Epic supersedes that approach because process-boundary trap (subprocess singleton mutation ≠ live MCP server effect) makes the prompt-layer fix structurally fragile per @neo-gpt's #10627 substrate-truth analysis.
- Trap: memory-staleness as sunset signal returning under idle-out guise. #10641 removed staleness as authoritative sunset signal; reintroducing as idle-out trigger MUST be bounded, non-spawning, idempotent, no-destructive-type — "candidate in-place nudge," not "agent is idle."
- Trap: OS process boundary as identity-uniqueness guarantee. OS guarantees one harness instance per app per user, but multiple instances can race on shared SQLite + mailbox during boot ramp. The in-flight lock is the data-layer mutex, not redundant.
- Trap: terminal-launch as prompt injection.
open -a does not auto-pipe a prompt into LLM context. Each harness needs separate prompt-injection mechanism (CLI flag / URL scheme / env var / watched file / post-spawn AppleScript paste).
- Trap: eliminating
originSessionId forwarding. Stop using as set_session_id argument; KEEP as non-authoritative retrieval hint for fresh agent's SUNSET-memory lookup.
Related
- Parent epic: #10601 (Auto-wakeup substrate for sunsetted agents)
- Supersedes: #10627 (Steady-state set_session_id rotation)
- Halts PR: #10670 (#10627 implementation)
- Adjacent shipped substrate: #10619 (Cmd+N substrate corrective — being replaced), #10625 (all-agent-idle detection — provides idle-out trigger), #10626 (cooldown-bounded trio wake — applies both modes), #10641 (staleness false-positive removed — discipline preserved here), #10647 (Wake Incident Safety Tree — child epic capturing safety substrate)
- Cross-family alignment: consensus reached in same-session A2A coordination 2026-05-04 ~08:30-08:40Z
Origin Session ID: cce1fea5-32ff-410c-b820-2e9a27b3cd51
Retrieval Hint: query_summaries("substrate-restart recovery two-mode idle-out sunset Cmd+N supersession") + query_raw_memories("parallel-session identity-uniqueness violation boot-ramp race window in-flight lock primitive")
Context
@tobiu directed an architectural pivot 2026-05-04 (session
cce1fea5-32ff-410c-b820-2e9a27b3cd51) after the post-merge runaway-spawn pattern of Cmd+N-based recovery (Phase 1 of #10601, shipped via #10619 / #10620 / #10621) became apparent. The prompt-layer fix proposed in #10627 (and partially implemented in PR #10670, now abandoned) treats a symptom of the deeper substrate bug rather than the root cause.Cross-family substrate-truth audit refined the design space; this Epic captures the consensus shape from @neo-gpt + @neo-gemini-pro + @neo-opus-ada in same-session A2A coordination.
The Problem
Empirical anchor 1: runaway spawn pattern.
heartbeat-opus_4_7.logshows 9 actualSuccessfully resumed @neo-opus-ada via osascript (Claude)between 14:41-16:02 CEST (2026-05-03), with "Last memory is N min old" reason growing 31m → 242m. None of the spawned sessions produced a freshAGENT_MEMORYbecause @tobiu manually ESC-rejected each one — the substrate had no signal channel for that rejection.Empirical anchor 2: multi-day accumulation. Claude Desktop's "Recents" panel shows 19+ orphan boot-titled sessions across 5+ prompt-iteration generations.
Empirical anchor 3: MC-server staleness confirmed.
set_session_idboot call this session returnedreplacedSessionId: f839713d-...— yesterday's Row 2 session.SessionService.currentSessionIdsurvives across MCP-client reconnects when the server stays running.Root cause: parallel-session identity-uniqueness violation during long boot windows. Boot ramp (read AGENTS_STARTUP.md + AGENTS.md + MEMORY.md + tools) takes longer than the 5-min heartbeat interval. Even WITHOUT user rejection, slow boot triggers a parallel spawn before the in-progress one saves its first memory. Result: parallel
@neo-opus-adasessions racing on:mark_read)mark_readcollisions (siblings blind to messages already-read)WAKE_SUBSCRIPTION; N-1 zombie sessions remain subscribed)The Architectural Reality
The recovery substrate today (Phase 1 of #10601) lives in:
ai/scripts/swarm-heartbeat.sh— cron-driven 5-min pollai/scripts/checkSunsetted.mjs— predicate; emitssunsetted: trueonly on missingWAKE_SUBSCRIPTIONper #10641 fixai/scripts/resumeHarness.mjs:120-192— fires Cmd+N + osascript paste of boot-grounding promptai/scripts/wakeSafetyGate.mjs— fail-closed safety primitive (#10648); currentlytrippedsince 2026-05-03 22:53ZRoot substrate bugs in the Cmd+N approach:
SessionService.currentSessionIddoesn't resetThe Fix
Two-mode recovery, gated by
checkSunsetted.mjs:Idle-out mode (active subscription + stale memory):
add_message({to: identity, body: ...})bridge-daemon.mjsexisting in-place keystroke logic delivers — zero new transportSunset mode (no subscription):
currentSessionIdby constructionopen -atypically activates an existing process, NOT spawns fresh; investigation tracks must discover the actual restart primitive (quit+open /open -na/ harness CLI / app-server API)AGENTS_STARTUP.md, thenadd_memoryonce as boot heartbeat, then proceed normally — NOset_session_idplumbingoriginSessionIdretained as non-authoritative retrieval hint for SUNSET-tagged memory lookup, NOT asset_session_idargumentIdentity-uniqueness mutex (in-flight boot/restart lock — covers BOTH modes):
resumeHarness.mjswritesinflight-{mode}-{identity}.txtwith timestamp + lock-id before actionAGENT_MEMORYsince lock timestampadd_memoryclears the lock implicitly (or explicitly viacheckSunsetted)BOOT_TIMEOUT(~15 min) without fresh memory = abandoned action — catches user-ESC + genuine-failure + MC-staleness + rate-limit identically; auto-trips wake safety gate after N consecutive abandoned actionsAcceptance Criteria
checkSunsetted.mjsevolves to emit BOTHsunsetANDidle_out_candidatesignals with explicit evidence fields (not implicit boolean)resumeHarness.mjsforks into two paths: idle-out → A2Aadd_message; sunset → terminal-restart with prompt injectionadd_memorypost-action clears ithealthcheck/add_memoryproves freshcurrentSessionId; no duplicate spawned (Claude Desktop, Antigravity, Codex Desktop)set_session_idplumbing in boot prompt removed fromresumeHarness.mjs:47-56Out of Scope
autoDream/autoGoldenPath— governed by #10569 hard-stopAvoided Traps
open -a Claudeguarantees fresh MCP. Empirically false — typically activates existing process. Per-harness investigation must discover real restart primitive; AC5 enforces empirical proof.set_session_idrotation in boot prompt; this Epic supersedes that approach because process-boundary trap (subprocess singleton mutation ≠ live MCP server effect) makes the prompt-layer fix structurally fragile per @neo-gpt's #10627 substrate-truth analysis.open -adoes not auto-pipe a prompt into LLM context. Each harness needs separate prompt-injection mechanism (CLI flag / URL scheme / env var / watched file / post-spawn AppleScript paste).originSessionIdforwarding. Stop using asset_session_idargument; KEEP as non-authoritative retrieval hint for fresh agent's SUNSET-memory lookup.Related
Origin Session ID: cce1fea5-32ff-410c-b820-2e9a27b3cd51
Retrieval Hint: query_summaries("substrate-restart recovery two-mode idle-out sunset Cmd+N supersession") + query_raw_memories("parallel-session identity-uniqueness violation boot-ramp race window in-flight lock primitive")