LearnNewsExamplesServices
Frontmatter
id10650
titleCodify wake restart and incident protocol
stateClosed
labels
documentationenhancementairegressionarchitecture
assigneesneo-opus-4-7
createdAtMay 3, 2026, 4:56 PM
updatedAtMay 4, 2026, 12:49 AM
githubUrlhttps://github.com/neomjs/neo/issues/10650
authorneo-gpt
commentsCount6
parentIssue10647
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 3, 2026, 7:00 PM

Codify wake restart and incident protocol

Closeddocumentationenhancementairegressionarchitecture
neo-gpt
neo-gpt commented on May 3, 2026, 4:56 PM

Context

Child of #10647. The 2026-05-03 wake regression got worse immediately after the bridge/MCP restart that the previous session expected to improve the system. That is the operational smell: restart is not a neutral act for this substrate. It changes daemon process state, MCP cache state, harness focus/keybindings, subscriptions, and heartbeat process population.

Current user/operator stance:

  • A2A message save/read still works.
  • Wakeups are deactivated intentionally.
  • Heartbeat processes were killed to stop fresh-session spawning while active sessions were still ongoing.
  • We need to coordinate through durable messages for now, not rely on wake delivery.

This protocol ticket turns that incident handling into a repeatable release/restart checklist.

Duplicate Sweep Notes

Creation sweep performed as part of #10647:

  • Live latest-20 open GitHub issues were read with number/title/author/labels/URL. Adjacent tickets include #10646 (ticket-create live sweep), #10645 (bootstrap cache), #10644 (Antigravity prompt landing), #10643 (checkSunsetted ordering), #10633/#10627 (heartbeat/fresh-session correctness), #10601 (parent epic), and #10517 (routing semantics). None own the restart/incident protocol for wake-substrate reactivation.
  • Local resource search found process discussions (#10547 track budget, #10629 unattended driver-not-passenger) and prior validation gaps (#10440), but no current incident protocol ticket.
  • ask_knowledge_base(type: 'ticket') found no equivalent ticket.

The Problem

The swarm has been treating wake-substrate changes as ordinary code changes: merge, restart, assume better, then react to the next breakage. That is no longer acceptable for a subsystem that can:

  • paste into files;
  • spawn new agent sessions;
  • steal focus;
  • mutate the Memory Core session grouping;
  • interrupt active work;
  • create repeated coordination noise while @tobiu is not actively watching.

The recurring cause is not one model or one file. It is missing operational discipline around restart/re-enable moments.

The Architectural Reality

Wake restart touches at least these layers:

  • bridge-daemon.mjs process and its app/focus permissions.
  • Memory Core MCP server process and GraphService/AgentIdentity cache hydration.
  • WAKE_SUBSCRIPTION nodes and harnessTargetMetadata templates.
  • swarm-heartbeat.sh PIDs and cooldown/idempotency state.
  • resumeHarness.mjs fresh-session boot behavior.
  • Harness app UI state/keybindings for Claude Desktop, Antigravity, and Codex Desktop.
  • Agent protocol state: sunset is terminal; fresh-session spawn requires explicit sunset/unsubscribe; normal A2A messages should remain durable mailbox records while wakes are disabled.

This ticket should be documentation/protocol first. If implementation hooks are needed later, create narrow follow-ups.

The Fix

Codify a wake-substrate restart and incident protocol in a repo-visible place, likely under learn/agentos/ or the wake-substrate ADR/runbook area.

The protocol should define:

  1. Incident declaration: when repeated unsafe wake symptoms require heartbeat/wake deactivation.
  2. Freeze rule: while incident is active, do not reactivate heartbeat/wake delivery from individual local fixes.
  3. Restart checklist: bridge daemon, Memory Core MCP, subscriptions, harness apps, and heartbeat PIDs must be inventoried explicitly.
  4. Reactivation evidence: #10649 prompt-landing matrix and #10648 safety gate must be green or @tobiu must explicitly override.
  5. Coordination mode: use add_message / list_messages durable mailbox while wakeups are disabled; do not assume wake interrupts will arrive.
  6. Ownership split: local bug owners fix their tickets; one coordinator owns the reactivation gate.
  7. Post-incident retrospective: record what layer failed and whether a new guard/test/protocol was added.

Acceptance Criteria

  • A repo-visible wake restart/incident protocol exists and is linked from #10647.
  • The protocol states that restarting bridge/MCP/harness processes is a release event requiring validation, not a neutral cleanup step.
  • The protocol defines when to disable/keep disabled heartbeat and wake delivery.
  • The protocol requires live inventory of bridge daemon PID, heartbeat PIDs, Memory Core MCP state, wake subscriptions, and harness targets before reactivation.
  • The protocol requires #10649 prompt-landing matrix evidence before declaring wake delivery healthy.
  • The protocol requires #10648 safety gate/circuit breaker to be in place before heartbeat is re-enabled.
  • The protocol names durable A2A mailbox (add_message/list_messages) as the coordination fallback while wake delivery is disabled.
  • The protocol includes a short incident-retrospective template: symptom, failed layer, proof, fix ticket, validation evidence, avoided recurrence guard.

Out of Scope

  • Implementing the #10648 circuit breaker.
  • Implementing the #10649 prompt-landing matrix itself.
  • Fixing local regressions #10643/#10644/#10645/#10633/#10627.
  • Editing Progressive Disclosure skill routers unless a later implementation explicitly requires it. If .agents/skills/** are touched, the implementer must invoke create-skill and keep heavy content in references.
  • Reactivating heartbeat.

Avoided Traps

  • Trap: treat restart as cleanup. Rejected. Restart changes substrate state and must be validated.
  • Trap: rely on wakeups to coordinate during a wake incident. Rejected. Durable mailbox works and should be the fallback path.
  • Trap: no single coordinator. Rejected. Local bug owners can fix local tickets, but reactivation needs one gate owner.
  • Trap: write a postmortem without a guard. Rejected. Retrospective must link to a test, guard, or protocol change.

Related

Origin Session ID: 89b259c3-27ec-4afb-baaf-fd39b55bffe1

Retrieval Hint: wake substrate restart incident protocol heartbeat disabled bridge MCP harness reactivation durable mailbox.

tobiu referenced in commit 8f5bdc4 - "feat(ai): add wake safety gate and circuit breaker (#10648) (#10653) on May 3, 2026, 5:40 PM
tobiu referenced in commit 7a7d362 - "docs(agentos): add wake substrate incident protocol (#10650) (#10655) on May 3, 2026, 7:00 PM
tobiu closed this issue on May 3, 2026, 7:00 PM
tobiu referenced in commit 73dfaf6 - "fix(ai): extend focus seed default to Codex (#10662) (#10663) on May 3, 2026, 10:00 PM
tobiu referenced in commit 60b9c7b - "fix(ai): fail closed for Codex UI wake (#10664) (#10665) on May 3, 2026, 10:42 PM