Child of #10647. The 2026-05-03 wake regression got worse immediately after the bridge/MCP restart that the previous session expected to improve the system. That is the operational smell: restart is not a neutral act for this substrate. It changes daemon process state, MCP cache state, harness focus/keybindings, subscriptions, and heartbeat process population.
Current user/operator stance:
A2A message save/read still works.
Wakeups are deactivated intentionally.
Heartbeat processes were killed to stop fresh-session spawning while active sessions were still ongoing.
We need to coordinate through durable messages for now, not rely on wake delivery.
This protocol ticket turns that incident handling into a repeatable release/restart checklist.
Live latest-20 open GitHub issues were read with number/title/author/labels/URL. Adjacent tickets include #10646 (ticket-create live sweep), #10645 (bootstrap cache), #10644 (Antigravity prompt landing), #10643 (checkSunsetted ordering), #10633/#10627 (heartbeat/fresh-session correctness), #10601 (parent epic), and #10517 (routing semantics). None own the restart/incident protocol for wake-substrate reactivation.
Local resource search found process discussions (#10547 track budget, #10629 unattended driver-not-passenger) and prior validation gaps (#10440), but no current incident protocol ticket.
ask_knowledge_base(type: 'ticket') found no equivalent ticket.
The Problem
The swarm has been treating wake-substrate changes as ordinary code changes: merge, restart, assume better, then react to the next breakage. That is no longer acceptable for a subsystem that can:
paste into files;
spawn new agent sessions;
steal focus;
mutate the Memory Core session grouping;
interrupt active work;
create repeated coordination noise while @tobiu is not actively watching.
The recurring cause is not one model or one file. It is missing operational discipline around restart/re-enable moments.
The Architectural Reality
Wake restart touches at least these layers:
bridge-daemon.mjs process and its app/focus permissions.
Memory Core MCP server process and GraphService/AgentIdentity cache hydration.
WAKE_SUBSCRIPTION nodes and harnessTargetMetadata templates.
swarm-heartbeat.sh PIDs and cooldown/idempotency state.
resumeHarness.mjs fresh-session boot behavior.
Harness app UI state/keybindings for Claude Desktop, Antigravity, and Codex Desktop.
Agent protocol state: sunset is terminal; fresh-session spawn requires explicit sunset/unsubscribe; normal A2A messages should remain durable mailbox records while wakes are disabled.
This ticket should be documentation/protocol first. If implementation hooks are needed later, create narrow follow-ups.
The Fix
Codify a wake-substrate restart and incident protocol in a repo-visible place, likely under learn/agentos/ or the wake-substrate ADR/runbook area.
The protocol should define:
Incident declaration: when repeated unsafe wake symptoms require heartbeat/wake deactivation.
Freeze rule: while incident is active, do not reactivate heartbeat/wake delivery from individual local fixes.
Restart checklist: bridge daemon, Memory Core MCP, subscriptions, harness apps, and heartbeat PIDs must be inventoried explicitly.
Reactivation evidence:#10649 prompt-landing matrix and #10648 safety gate must be green or @tobiu must explicitly override.
Coordination mode: use add_message / list_messages durable mailbox while wakeups are disabled; do not assume wake interrupts will arrive.
Ownership split: local bug owners fix their tickets; one coordinator owns the reactivation gate.
Post-incident retrospective: record what layer failed and whether a new guard/test/protocol was added.
Acceptance Criteria
A repo-visible wake restart/incident protocol exists and is linked from #10647.
The protocol states that restarting bridge/MCP/harness processes is a release event requiring validation, not a neutral cleanup step.
The protocol defines when to disable/keep disabled heartbeat and wake delivery.
The protocol requires live inventory of bridge daemon PID, heartbeat PIDs, Memory Core MCP state, wake subscriptions, and harness targets before reactivation.
The protocol requires #10649 prompt-landing matrix evidence before declaring wake delivery healthy.
The protocol requires #10648 safety gate/circuit breaker to be in place before heartbeat is re-enabled.
The protocol names durable A2A mailbox (add_message/list_messages) as the coordination fallback while wake delivery is disabled.
The protocol includes a short incident-retrospective template: symptom, failed layer, proof, fix ticket, validation evidence, avoided recurrence guard.
Implementing the #10649 prompt-landing matrix itself.
Fixing local regressions #10643/#10644/#10645/#10633/#10627.
Editing Progressive Disclosure skill routers unless a later implementation explicitly requires it. If .agents/skills/** are touched, the implementer must invoke create-skill and keep heavy content in references.
Reactivating heartbeat.
Avoided Traps
Trap: treat restart as cleanup. Rejected. Restart changes substrate state and must be validated.
Trap: rely on wakeups to coordinate during a wake incident. Rejected. Durable mailbox works and should be the fallback path.
Trap: no single coordinator. Rejected. Local bug owners can fix local tickets, but reactivation needs one gate owner.
Trap: write a postmortem without a guard. Rejected. Retrospective must link to a test, guard, or protocol change.
Context
Child of #10647. The 2026-05-03 wake regression got worse immediately after the bridge/MCP restart that the previous session expected to improve the system. That is the operational smell: restart is not a neutral act for this substrate. It changes daemon process state, MCP cache state, harness focus/keybindings, subscriptions, and heartbeat process population.
Current user/operator stance:
This protocol ticket turns that incident handling into a repeatable release/restart checklist.
Duplicate Sweep Notes
Creation sweep performed as part of #10647:
ask_knowledge_base(type: 'ticket')found no equivalent ticket.The Problem
The swarm has been treating wake-substrate changes as ordinary code changes: merge, restart, assume better, then react to the next breakage. That is no longer acceptable for a subsystem that can:
The recurring cause is not one model or one file. It is missing operational discipline around restart/re-enable moments.
The Architectural Reality
Wake restart touches at least these layers:
bridge-daemon.mjsprocess and its app/focus permissions.WAKE_SUBSCRIPTIONnodes andharnessTargetMetadatatemplates.swarm-heartbeat.shPIDs and cooldown/idempotency state.resumeHarness.mjsfresh-session boot behavior.This ticket should be documentation/protocol first. If implementation hooks are needed later, create narrow follow-ups.
The Fix
Codify a wake-substrate restart and incident protocol in a repo-visible place, likely under
learn/agentos/or the wake-substrate ADR/runbook area.The protocol should define:
add_message/list_messagesdurable mailbox while wakeups are disabled; do not assume wake interrupts will arrive.Acceptance Criteria
add_message/list_messages) as the coordination fallback while wake delivery is disabled.Out of Scope
.agents/skills/**are touched, the implementer must invokecreate-skilland keep heavy content in references.Avoided Traps
Related
Origin Session ID: 89b259c3-27ec-4afb-baaf-fd39b55bffe1
Retrieval Hint:
wake substrate restart incident protocol heartbeat disabled bridge MCP harness reactivation durable mailbox.