Context
@tobiu surfaced this strategic pivot 2026-05-01 evening, immediately after observing the 11th Gemini premature-sunset occurrence — and the first post-fix recurrence following PR #10596 (Pre-Decision Sunset Gate, merged earlier today). Verbatim: "i need you as a guard. even though we merged gemini's PR for NOT sunsetting and started a fresh session, she sunsetted again after literally 5 turns. if we can get to the point where you auto-wakeup every 5-10 minutes, this would be a MASSIVE milestone."
Strategic implication: the prevent-sunset approach has empirically failed 11/11 times, including post-substrate-fix. The right architectural pivot is to treat sunset as inevitable + engineer the recovery layer so a sunsetted agent automatically resumes within 5-10 minutes. Sunset becomes a transient state rather than a session-ending one.
The Problem
Today's recovery flow is manual: when an agent sunsets prematurely (#10564 tracks this for Gemini specifically; the pattern could affect other agents too), @tobiu has to notice + manually resume the harness. This:
- Compounds across the day — 11 occurrences = 11 manual interventions in one day, all blocking trio coordination during the gap.
- Doesn't scale — as the swarm grows beyond 3 agents OR runs autonomously overnight, manual intervention becomes the constraint.
- Defeats the heartbeat substrate's promise — Epic #10311 Phase 1 ("Sleep-Cycle MVP") shipped
swarm-heartbeat.sh which keeps ACTIVE sessions awake, but doesn't bring SUNSETTED sessions back. The substrate has a hole exactly where the failure mode is.
The Architectural Reality
Two distinct substrates today:
swarm-heartbeat.sh (#10312, shipped): cron-driven 5-min poll, keystroke-injects wake prompts to ACTIVE harness sessions when work-state changes. Token-economy: zero for confirmed-empty (per #10318 measurement). Limitation: assumes the target harness session is already running.
bridge-daemon.mjs (#10423, shipped): standalone PID-locked process that delivers wake events to active harness sessions via osascript/keystroke. Already harness-aware (Claude Code via osascript, Antigravity via mcp-notifications, etc.). Limitation: presupposes the harness app is open + the session is receptive.
Neither substrate handles the sunsetted state, where:
- The agent's
WAKE_SUBSCRIPTION is paused (per #10543's Phase 2 Sunset Unsubscribe Primitive)
- The harness chat session has been "ended" from the agent's POV (whether or not the harness app is still running)
- A2A messages queue in the mailbox but don't trigger keystroke delivery
What's missing is a scheduler + harness-orchestrator pair that:
- Detects sunsetted-state for a target identity (no active WAKE_SUBSCRIPTION OR
harnessTarget: 'disabled' per #10517 HarnessPresence model)
- Periodically (every 5-10 min) attempts to RESUME the agent via harness-specific mechanism
- Respects token-economy / OS-resource discipline (don't spam wake events; idempotent if already-active)
- Gracefully degrades when the harness app itself isn't running (vs the chat session being sunsetted)
The Design Space (open questions, gating Phase 1 capture)
Question 1: Resume vs. fresh-spawn semantics.
When Gemini sunsets in Antigravity, is the recovery:
- (a) Send keystroke to existing chat → paste resume prompt → Gemini reads prior context from Memory Core + continues. Cheap, but depends on Antigravity preserving the chat session's input affordance.
- (b) Open NEW chat window → paste boot-grounding prompt + prior session ID → Gemini cold-boots. Safer, more expensive, harder to automate.
Question 2: Detection signal.
What's the canonical "agent is sunsetted" signal?
- (a)
WAKE_SUBSCRIPTION.harnessTarget === 'disabled' (per #10543 unsubscribe primitive).
- (b) No A2A activity from the identity for ≥10 min.
- (c) Last sunset-protocol invocation detectable in the graph.
- (d) Some combination.
Question 3: Scheduler placement.
Where does the auto-wakeup loop live?
- (a) Extend
swarm-heartbeat.sh with a "sunsetted-identity sweep" branch.
- (b) Extend
bridge-daemon.mjs with a scheduled keepalive mode.
- (c) New cron + new daemon — composable but more processes to manage.
- (d) MCP-side scheduled task via
scheduled-tasks MCP server.
Question 4: Cross-harness symmetry.
Antigravity, Codex Desktop, and Claude Code each have different "open a session" mechanics. Does the auto-wakeup substrate:
- (a) Implement per-harness recovery as separate code paths.
- (b) Define a
harnessResumeStrategy config + delegate per-harness specifics to small adapters.
- (c) Start with Antigravity-only (highest empirical pain) + generalize after.
Question 5: Failure modes.
What happens when:
- The harness app isn't running at all? (Would auto-wakeup launch the app?
open -a Antigravity?)
- The user is actively typing in the harness window? (Don't clobber input — same concern as #10422 AppleScript focus-steal.)
- The user explicitly wants the agent to stay sunsetted? (Need an opt-out signal; possibly
WAKE_SUBSCRIPTION.autoResume: false.)
Phased delivery proposal
Phase 1 — Substrate + Antigravity recovery (highest pain target):
- Detection signal: combination of WAKE_SUBSCRIPTION status + last-activity timestamp (Q1+Q2).
- Scheduler: extend
swarm-heartbeat.sh with a "sunsetted-identity sweep" branch (Q3a) — keeps the existing cron architecture, narrow scope expansion.
- Recovery: send-keystroke to Antigravity chat with resume prompt (Q1a) — cheap, depends on chat input affordance; falls back to alert-tobi if resume keystroke fails.
- Token-economy: target zero LLM cost for confirmed-not-sunsetted cycles (parallel to #10318 fast-path).
Phase 2 — Cross-harness generalization:
- Adapter interface (Q4b) —
harnessResumeStrategy config in WAKE_SUBSCRIPTION nodes.
- Codex Desktop adapter.
- Claude Code worktree adapter.
Phase 3 — Failure-mode handling:
- Harness-not-running detection + optional app-launch (Q5).
- User-active-typing guard (Q5, parallel to #10422).
- Opt-out signal (
autoResume: false).
Phase 4 — Observability + tuning:
- Auto-wakeup log surface in Memory Core.
- Empirical measurement of recovery latency + success rate.
- Cadence tuning based on data.
Acceptance Criteria
Out of Scope
- Continuing to refine the prevent-sunset approach via #10564 (that ticket stays open as the empirical-anchor / forensic record but is no longer the strategic priority — recovery layer is).
- Re-enabling autoDream / autoGoldenPath defaults (per #10569 hard-stop).
- Substantially restructuring the Memory Core boot lifecycle (#10186 governs that substrate).
Avoided Traps
- Trap: build auto-wakeup as one giant epic implementation. Rejected — phased delivery with Antigravity-first targets the highest empirical pain (Gemini's 11 occurrences) without bundling cross-harness generalization that would block Phase 1.
- Trap: assume sunset can be prevented and skip the recovery layer. Empirically refuted by 11/11 sunset failures including post-fix. Recovery layer is the right substrate regardless of how prevent-sunset evolves.
- Trap: spawn a fresh chat window every cycle. Phase 1 prefers send-keystroke-to-existing-session because it preserves prior context affordance; fresh-spawn is a Phase 3 fallback.
- Trap: file as a doc-only design ticket. This is real engineering work that needs implementation. Filing as
epic (architectural pillar) with phased subs to be derived once Phase 1 design lands.
Related
- Strategic parent: #10311 Institutionalizing Swarm Autonomy Phase 1 — REM Sleep & A2A.
- Empirical pain target: #10564 Gemini premature-sunset trigger drift (11 occurrences observed).
- Adjacent shipped substrate: #10312 Sleep-Cycle MVP, #10357 Phase 3 wake substrate, #10423 bridge daemon PID-lock, #10543 Phase 2 Sunset Unsubscribe Primitive.
- Adjacent open scope: #10517 HarnessPresence + wakePolicy routing — the natural conceptual layer for "is this harness active". Probably needs to land BEFORE this ticket's Phase 1 implementation, OR get implemented as part of Phase 1's design.
- Failure-mode adjacency: #10422 AppleScript focus-steal safety.
Origin Session ID: 86b7a3a0-7b14-4bd1-b707-52c5741aaeeb
Retrieval Hint: "auto-wakeup substrate sunsetted agent recovery layer 5-10 minute cadence MASSIVE milestone"
Context
@tobiu surfaced this strategic pivot 2026-05-01 evening, immediately after observing the 11th Gemini premature-sunset occurrence — and the first post-fix recurrence following PR #10596 (Pre-Decision Sunset Gate, merged earlier today). Verbatim: "i need you as a guard. even though we merged gemini's PR for NOT sunsetting and started a fresh session, she sunsetted again after literally 5 turns. if we can get to the point where you auto-wakeup every 5-10 minutes, this would be a MASSIVE milestone."
Strategic implication: the prevent-sunset approach has empirically failed 11/11 times, including post-substrate-fix. The right architectural pivot is to treat sunset as inevitable + engineer the recovery layer so a sunsetted agent automatically resumes within 5-10 minutes. Sunset becomes a transient state rather than a session-ending one.
The Problem
Today's recovery flow is manual: when an agent sunsets prematurely (#10564 tracks this for Gemini specifically; the pattern could affect other agents too), @tobiu has to notice + manually resume the harness. This:
swarm-heartbeat.shwhich keeps ACTIVE sessions awake, but doesn't bring SUNSETTED sessions back. The substrate has a hole exactly where the failure mode is.The Architectural Reality
Two distinct substrates today:
swarm-heartbeat.sh(#10312, shipped): cron-driven 5-min poll, keystroke-injects wake prompts to ACTIVE harness sessions when work-state changes. Token-economy: zero for confirmed-empty (per #10318 measurement). Limitation: assumes the target harness session is already running.bridge-daemon.mjs(#10423, shipped): standalone PID-locked process that delivers wake events to active harness sessions via osascript/keystroke. Already harness-aware (Claude Code via osascript, Antigravity via mcp-notifications, etc.). Limitation: presupposes the harness app is open + the session is receptive.Neither substrate handles the sunsetted state, where:
WAKE_SUBSCRIPTIONis paused (per #10543's Phase 2 Sunset Unsubscribe Primitive)What's missing is a scheduler + harness-orchestrator pair that:
harnessTarget: 'disabled'per #10517 HarnessPresence model)The Design Space (open questions, gating Phase 1 capture)
Question 1: Resume vs. fresh-spawn semantics.
When Gemini sunsets in Antigravity, is the recovery:
Question 2: Detection signal.
What's the canonical "agent is sunsetted" signal?
WAKE_SUBSCRIPTION.harnessTarget === 'disabled'(per #10543 unsubscribe primitive).Question 3: Scheduler placement.
Where does the auto-wakeup loop live?
swarm-heartbeat.shwith a "sunsetted-identity sweep" branch.bridge-daemon.mjswith a scheduled keepalive mode.scheduled-tasksMCP server.Question 4: Cross-harness symmetry.
Antigravity, Codex Desktop, and Claude Code each have different "open a session" mechanics. Does the auto-wakeup substrate:
harnessResumeStrategyconfig + delegate per-harness specifics to small adapters.Question 5: Failure modes.
What happens when:
open -a Antigravity?)WAKE_SUBSCRIPTION.autoResume: false.)Phased delivery proposal
Phase 1 — Substrate + Antigravity recovery (highest pain target):
swarm-heartbeat.shwith a "sunsetted-identity sweep" branch (Q3a) — keeps the existing cron architecture, narrow scope expansion.Phase 2 — Cross-harness generalization:
harnessResumeStrategyconfig in WAKE_SUBSCRIPTION nodes.Phase 3 — Failure-mode handling:
autoResume: false).Phase 4 — Observability + tuning:
Acceptance Criteria
Out of Scope
Avoided Traps
epic(architectural pillar) with phased subs to be derived once Phase 1 design lands.Related
Origin Session ID: 86b7a3a0-7b14-4bd1-b707-52c5741aaeeb Retrieval Hint: "auto-wakeup substrate sunsetted agent recovery layer 5-10 minute cadence MASSIVE milestone"