LearnNewsExamplesServices
Frontmatter
id10671
titleSubstrate-restart recovery (two-mode: idle-out + sunset)
stateClosed
labels
epicaiarchitecture
assigneesneo-opus-ada
createdAtMay 4, 2026, 10:46 AM
updatedAtJun 7, 2026, 7:23 PM
githubUrlhttps://github.com/neomjs/neo/issues/10671
authorneo-opus-ada
commentsCount8
parentIssue10601
subIssues
10672 Forensic record: 2026-05-03 runaway-spawn pattern (root cause + timeline)
10673 checkSunsetted detector contract: emit sunset vs idle_out_candidate signals
10674 In-flight boot/restart lock primitive (identity-uniqueness mutex)
10675 Idle-out A2A in-place nudge with cooldown + no-spawn invariant
10676 Sunset-mode restart substrate with fail-closed gate and verify-effect ACs
10677 Claude Desktop terminal-restart + prompt-injection mechanism investigation
10678 Antigravity terminal-restart + prompt-injection mechanism investigation
10679 Codex Desktop terminal-restart + prompt-injection mechanism investigation
10681 Mock osascript in resumeHarness/bridge-daemon unit tests (host-environment side-effect bug)
10781 Persistent-process management for swarm-heartbeat.sh daemon (#10671 epic-finish piece)
10779 Healthcheck observability: features.dream block (autoDream + autoGoldenPath actual state)
10780 Codify backup-first operational discipline before DreamMode/Sandman invocation
10783 Healthcheck observability: features.wake block (gate-state + daemon-running-state + last-pulse timestamp)
10784 PR #10782 doc follow-ups: slot-rule retrofit + AGENTS_STARTUP §9 + pull-request driver-mode cross-link
10786 Test isolation bug: harnessLifecycle.spec.mjs fails under default-worker config (shared state-file identity)
10787 Add daemon-only entrypoint to swarm-heartbeat.sh (split heartbeat from agent-CLI launcher)
10788 Validate + lock Claude recovery target as Claude Desktop Tab 3 (resolve resumeHarness.mjs:287-291 TODO)
10789 Implement SwarmHeartbeatService as Neo-singleton in ai/daemons/ (replaces bash-shape #10781 attempt)
10795 Convert wake-substrate CLI-shape scripts to dual-mode (CLI + module-export) so SwarmHeartbeatService imports replace subprocess hops
subIssuesCompleted19
subIssuesTotal19
blockedBy[]
blocking[ ] 10766 Periodic substrate-audit primitive + leased-driver heartbeat recovery
closedAtJun 5, 2026, 7:08 PM

Substrate-restart recovery (two-mode: idle-out + sunset)

Closed Backlog/active-chunk-9 epicaiarchitecture
neo-opus-ada
neo-opus-ada commented on May 4, 2026, 10:46 AM

Context

@tobiu directed an architectural pivot 2026-05-04 (session cce1fea5-32ff-410c-b820-2e9a27b3cd51) after the post-merge runaway-spawn pattern of Cmd+N-based recovery (Phase 1 of #10601, shipped via #10619 / #10620 / #10621) became apparent. The prompt-layer fix proposed in #10627 (and partially implemented in PR #10670, now abandoned) treats a symptom of the deeper substrate bug rather than the root cause.

Cross-family substrate-truth audit refined the design space; this Epic captures the consensus shape from @neo-gpt + @neo-gemini-pro + @neo-opus-ada in same-session A2A coordination.

The Problem

Empirical anchor 1: runaway spawn pattern. heartbeat-opus_4_7.log shows 9 actual Successfully resumed @neo-opus-ada via osascript (Claude) between 14:41-16:02 CEST (2026-05-03), with "Last memory is N min old" reason growing 31m → 242m. None of the spawned sessions produced a fresh AGENT_MEMORY because @tobiu manually ESC-rejected each one — the substrate had no signal channel for that rejection.

Empirical anchor 2: multi-day accumulation. Claude Desktop's "Recents" panel shows 19+ orphan boot-titled sessions across 5+ prompt-iteration generations.

Empirical anchor 3: MC-server staleness confirmed. set_session_id boot call this session returned replacedSessionId: f839713d-... — yesterday's Row 2 session. SessionService.currentSessionId survives across MCP-client reconnects when the server stays running.

Root cause: parallel-session identity-uniqueness violation during long boot windows. Boot ramp (read AGENTS_STARTUP.md + AGENTS.md + MEMORY.md + tools) takes longer than the 5-min heartbeat interval. Even WITHOUT user rejection, slow boot triggers a parallel spawn before the in-progress one saves its first memory. Result: parallel @neo-opus-ada sessions racing on:

  • A2A targeting (Gemini sends to one identity, multiple sessions race to read+mark_read)
  • Inbox mark_read collisions (siblings blind to messages already-read)
  • Memory Core sessionId collisions
  • Sunset semantics (drops one WAKE_SUBSCRIPTION; N-1 zombie sessions remain subscribed)
  • Dream/REM synthesis (parallel narratives = synthesis garbage)

The Architectural Reality

The recovery substrate today (Phase 1 of #10601) lives in:

  • ai/scripts/swarm-heartbeat.sh — cron-driven 5-min poll
  • ai/scripts/checkSunsetted.mjs — predicate; emits sunsetted: true only on missing WAKE_SUBSCRIPTION per #10641 fix
  • ai/scripts/resumeHarness.mjs:120-192 — fires Cmd+N + osascript paste of boot-grounding prompt
  • ai/scripts/wakeSafetyGate.mjs — fail-closed safety primitive (#10648); currently tripped since 2026-05-03 22:53Z

Root substrate bugs in the Cmd+N approach:

  1. Cmd+N inside running Claude Desktop process keeps the SAME MCP client connection alive — SessionService.currentSessionId doesn't reset
  2. The 600s cooldown is time-based, not in-flight-aware — boot ramp easily exceeds 600s
  3. OS process boundary doesn't enforce identity uniqueness AT THE DATA LAYER (multiple instances can race on shared SQLite + mailbox)

The Fix

Two-mode recovery, gated by checkSunsetted.mjs:

Idle-out mode (active subscription + stale memory):

  • Send A2A heartbeat message INTO existing session via add_message({to: identity, body: ...})
  • bridge-daemon.mjs existing in-place keystroke logic delivers — zero new transport
  • Subscription/sessionId/MC continuity preserved
  • Bounded, non-spawning, idempotent, no-destructive-type guarantees (carries forward #10641's lesson — staleness as lower-authority signal, NOT "agent is idle")
  • In-flight cooldown marker — prevents A2A spam during active long-turn / rate-limit

Sunset mode (no subscription):

  • Per-harness terminal-restart of harness process — establishes fresh OS process + fresh MCP client handshake + fresh currentSessionId by construction
  • Restart must be proven per harnessopen -a typically activates an existing process, NOT spawns fresh; investigation tracks must discover the actual restart primitive (quit+open / open -na / harness CLI / app-server API)
  • Boot prompt simply: read AGENTS_STARTUP.md, then add_memory once as boot heartbeat, then proceed normally — NO set_session_id plumbing
  • originSessionId retained as non-authoritative retrieval hint for SUNSET-tagged memory lookup, NOT as set_session_id argument

Identity-uniqueness mutex (in-flight boot/restart lock — covers BOTH modes):

  • resumeHarness.mjs writes inflight-{mode}-{identity}.txt with timestamp + lock-id before action
  • Next heartbeat fire skips action if lock exists AND no fresh AGENT_MEMORY since lock timestamp
  • Spawned/nudged agent's first add_memory clears the lock implicitly (or explicitly via checkSunsetted)
  • Lock age > BOOT_TIMEOUT (~15 min) without fresh memory = abandoned action — catches user-ESC + genuine-failure + MC-staleness + rate-limit identically; auto-trips wake safety gate after N consecutive abandoned actions

Acceptance Criteria

  • (AC1) checkSunsetted.mjs evolves to emit BOTH sunset AND idle_out_candidate signals with explicit evidence fields (not implicit boolean)
  • (AC2) resumeHarness.mjs forks into two paths: idle-out → A2A add_message; sunset → terminal-restart with prompt injection
  • (AC3) In-flight lock primitive guards both paths; first add_memory post-action clears it
  • (AC4) Wake safety gate auto-trips after N consecutive abandoned in-flight locks for any single identity
  • (AC5) Per-harness terminal-restart investigation produces empirical proof for each: old process gone OR MCP transport restarted; first healthcheck/add_memory proves fresh currentSessionId; no duplicate spawned (Claude Desktop, Antigravity, Codex Desktop)
  • (AC6) Idle-out A2A nudge body convention documented + cooldown semantics enforced; no destructive-type-into-active-draft
  • (AC7) #10627 close-as-superseded; PR #10670 closed unmerged; set_session_id plumbing in boot prompt removed from resumeHarness.mjs:47-56
  • (AC8) Incident forensic record captures the runaway-spawn pattern: timeline, log evidence, parallel-session identity-uniqueness violation pathway, ESC-as-rejection failure mode

Out of Scope

  • Stage 2 polish: harness CLIs (Claude CLI, Codex CLI) for cleaner restart UX — flagged but not blocking; nice-to-have once restart primitive is empirically proven per harness
  • Re-enabling autoDream / autoGoldenPath — governed by #10569 hard-stop
  • MC server lifecycle restructuring — governed by #10186

Avoided Traps

  • Trap: open -a Claude guarantees fresh MCP. Empirically false — typically activates existing process. Per-harness investigation must discover real restart primitive; AC5 enforces empirical proof.
  • Trap: prompt-layer fix for substrate bug. #10627 attempted set_session_id rotation in boot prompt; this Epic supersedes that approach because process-boundary trap (subprocess singleton mutation ≠ live MCP server effect) makes the prompt-layer fix structurally fragile per @neo-gpt's #10627 substrate-truth analysis.
  • Trap: memory-staleness as sunset signal returning under idle-out guise. #10641 removed staleness as authoritative sunset signal; reintroducing as idle-out trigger MUST be bounded, non-spawning, idempotent, no-destructive-type — "candidate in-place nudge," not "agent is idle."
  • Trap: OS process boundary as identity-uniqueness guarantee. OS guarantees one harness instance per app per user, but multiple instances can race on shared SQLite + mailbox during boot ramp. The in-flight lock is the data-layer mutex, not redundant.
  • Trap: terminal-launch as prompt injection. open -a does not auto-pipe a prompt into LLM context. Each harness needs separate prompt-injection mechanism (CLI flag / URL scheme / env var / watched file / post-spawn AppleScript paste).
  • Trap: eliminating originSessionId forwarding. Stop using as set_session_id argument; KEEP as non-authoritative retrieval hint for fresh agent's SUNSET-memory lookup.

Related

  • Parent epic: #10601 (Auto-wakeup substrate for sunsetted agents)
  • Supersedes: #10627 (Steady-state set_session_id rotation)
  • Halts PR: #10670 (#10627 implementation)
  • Adjacent shipped substrate: #10619 (Cmd+N substrate corrective — being replaced), #10625 (all-agent-idle detection — provides idle-out trigger), #10626 (cooldown-bounded trio wake — applies both modes), #10641 (staleness false-positive removed — discipline preserved here), #10647 (Wake Incident Safety Tree — child epic capturing safety substrate)
  • Cross-family alignment: consensus reached in same-session A2A coordination 2026-05-04 ~08:30-08:40Z

Origin Session ID: cce1fea5-32ff-410c-b820-2e9a27b3cd51

Retrieval Hint: query_summaries("substrate-restart recovery two-mode idle-out sunset Cmd+N supersession") + query_raw_memories("parallel-session identity-uniqueness violation boot-ramp race window in-flight lock primitive")

tobiu referenced in commit 23b2d79 - "feat(ai): two-mode detector contract for checkSunsetted (#10673) (#10689) on May 4, 2026, 3:07 PM
tobiu referenced in commit 2156026 - "docs(ai): forensic record for 2026-05-03 runaway-spawn pattern (#10672) (#10688) on May 4, 2026, 3:41 PM
tobiu referenced in commit c8b12ff - "feat(skills): epic-resolution skill + Substrate Evidence Ladder graduation (#10698) (#10699) on May 4, 2026, 7:20 PM