LearnNewsExamplesServices
Frontmatter
id10647
titleStabilize A2A wakeups before heartbeat reactivation
stateClosed
labels
epicairegressionarchitecture
assigneesneo-gpt
createdAtMay 3, 2026, 4:53 PM
updatedAtMay 9, 2026, 11:23 PM
githubUrlhttps://github.com/neomjs/neo/issues/10647
authorneo-gpt
commentsCount2
parentIssue10601
subIssues
10648 Add wake reactivation gate and circuit breaker
10649 Add cross-harness wake prompt-landing matrix
10650 Codify wake restart and incident protocol
10658 Fix Claude bridge prompt-focus regression
10662 Add Codex focus seed default to bridge-daemon (mirrors #10661)
subIssuesCompleted5
subIssuesTotal5
blockedBy[]
blocking[]
closedAtMay 3, 2026, 10:00 PM

Stabilize A2A wakeups before heartbeat reactivation

Closedepicairegressionarchitecture
neo-gpt
neo-gpt commented on May 3, 2026, 4:53 PM

Context

@tobiu granted permission on 2026-05-03 to turn the current wake/A2A regression into tracker structure and coordinate with @neo-gemini-3-1-pro and @neo-opus-4-7. Wakeups and heartbeat processes are intentionally deactivated right now because the substrate started doing unsafe things after restart:

  • heartbeat/orphan recovery started fresh Claude sessions repeatedly while active sessions were not done;
  • Antigravity wake payloads arrived as editor/file content instead of agent prompts (#10644);
  • Codex wake bootstrap split between durable SQLite truth and stale MCP cache (#10645);
  • checkSunsetted still has stale-row origin extraction risk (#10643);
  • all-agent-idle/cooldown semantics still have open correctness gaps (#10633);
  • steady-state fresh-session memory grouping still needs set_session_id rotation (#10627).

This ticket is not a replacement for those local fixes. It is the missing coordination and safety envelope that prevents us from reactivating heartbeat after one narrow green check and then breaking A2A wakeups for the fourth time.

Duplicate Sweep Notes

Creation sweep performed before filing:

The Problem

The wake substrate repeatedly regresses because we validate layers in isolation and then infer the whole loop works.

The loop is actually multi-layer:

  1. A2A mailbox write succeeds.
  2. unread/listing semantics are correct.
  3. wake subscription bootstrap sees the right identity/template state.
  4. coalescing/raw delivery emits the right envelope.
  5. bridge/MCP/native adapter targets the right harness.
  6. prompt payload lands in the agent prompt surface, not a random editor/file.
  7. fresh-session recovery only runs after an explicit sunset/unsubscribe state.
  8. heartbeat/scheduler treats uncertain state as unsafe, not as permission to spawn.

We have repeatedly marked one of those layers green and then discovered a failure at a neighboring layer. The result is negative ROI: the swarm spends merge bandwidth on liveness, but the liveness substrate creates new failures faster than it removes human coordination work.

Historical Memory Context

Relevant Memory Core anchors:

  • summary_0763a9bf-1052-4a2f-99f3-a8e0e14f1671 — bidirectional wakeups were previously made to work, but the path involved brittle macOS/TCC, tab-focus, raw/digest envelope, clone-drift, and restart-state assumptions.
  • summary_52e84f76-2d4f-41cc-a42e-9d1d3fcaa381 — Phase 3 wake substrate was designed around Shape D Hybrid, client resync, and heartbeat-bypass concerns, but the later implementation still lacked full-loop proof.
  • summary_aaf22f06-cc5c-4dff-aa2f-7d5efb3a6343 — cross-family wake implementation moved quickly and produced useful process refinements, but also exposed state mismatches and substrate assumptions under review pressure.
  • b9e17b3c-75ed-4cd0-a827-d57fe8370473 — @tobiu corrected that fresh chat alone is weaker than full harness/MCP restart because MCP singleton state can stay stale.
  • 14f4573d-c169-4ad3-8fa6-c4c8a3ca3eae and 52141baf-55c0-4bfd-9488-343aa7c091a2 — sunset is terminal, and premature sunset/fresh-session churn is an established failure mode.

The Architectural Reality

This belongs under #10601, the auto-wakeup recovery epic, because it controls when that recovery substrate may safely run. It also touches, but does not replace:

  • #10517 — HarnessPresence and wakePolicy routing semantics.
  • #10604 — Harness Registry and fresh session terminal booting.
  • #10627 — steady-state set_session_id rotation in resume boot-grounding.
  • #10633 — all-agent-idle cycle_id semantics.
  • #10643checkSunsetted origin extraction ordering.
  • #10644 — Antigravity prompt-surface/keybinding regression.
  • #10645 — AgentIdentity cache hydration for wake bootstrap.
  • #10646 — live latest-open sweep requirement for ticket creation.

The core architectural shift: heartbeat reactivation must become a gated release event, not a byproduct of merging the latest narrow fix.

The Fix

Create a small safety sub-tree under this epic:

  1. Add a wake reactivation gate and circuit breaker so heartbeat/resume paths fail closed when the substrate is known unsafe.
  2. Add a cross-harness prompt-landing validation matrix so "delivered" means the prompt reached the agent prompt surface, not just that a bridge adapter exited 0.
  3. Codify a wake-substrate restart and incident protocol so future bridge/MCP/harness restarts follow a repeatable, cross-agent checklist before heartbeat is re-enabled.

Acceptance Criteria

  • This epic is linked as a child of #10601.
  • Heartbeat/wake reactivation remains disabled until the child safety gate(s) are satisfied or @tobiu explicitly overrides the risk.
  • A linked implementation ticket defines the circuit breaker: unsafe delivery signals stop further heartbeat/resume actions instead of spawning more sessions.
  • A linked validation ticket defines the cross-harness prompt-landing matrix for Claude Desktop, Antigravity, and Codex Desktop.
  • A linked process ticket defines the restart/re-enable protocol for bridge daemon, Memory Core MCP, harness apps, and subscriptions after wake-substrate changes.
  • The epic body or linked process ticket explicitly distinguishes A2A storage success from wake delivery success.
  • Existing local-fix tickets (#10643, #10644, #10645, #10633, #10627) remain separate and are treated as dependencies/adjacent work, not bundled into this epic.

Out of Scope

  • Fixing Antigravity's current shortcut/keybinding regression directly (#10644 owns that).
  • Fixing Codex bootstrap/cache hydration directly (#10645 owns that).
  • Fixing checkSunsetted stale-row origin extraction directly (#10643 owns that).
  • Redesigning A2A mailbox storage; list/add messages are currently assumed usable.
  • Reactivating heartbeat in this ticket. This epic defines when that becomes safe.

Avoided Traps

  • Trap: create a parallel wake epic that competes with #10601. Rejected. This is a child safety/release epic under #10601.
  • Trap: treat bridge log success as delivery success. Rejected. The Antigravity file-write regression proves osascript exit 0 is not prompt landing.
  • Trap: reactivate heartbeat after one narrow fix. Rejected. The regression pattern is cross-layer; the reactivation gate must be cross-layer too.
  • Trap: rely on stale indexed issue state. Rejected. Creation used live latest-20 open issue sweep per #10646.
  • Trap: bundle every local bug into one mega-PR. Rejected. This epic coordinates gates; local bugs stay separately reviewable.

Related

Origin Session ID: 89b259c3-27ec-4afb-baaf-fd39b55bffe1

Retrieval Hint: A2A wakeups heartbeat reactivation gate prompt landing safety envelope fresh session spawn regression.

tobiu referenced in commit 8f5bdc4 - "feat(ai): add wake safety gate and circuit breaker (#10648) (#10653) on May 3, 2026, 5:40 PM
tobiu referenced in commit 7a7d362 - "docs(agentos): add wake substrate incident protocol (#10650) (#10655) on May 3, 2026, 7:00 PM
tobiu closed this issue on May 3, 2026, 10:00 PM