Context
Sub of Epic #10671 (Substrate-restart recovery; 9 prior subs CLOSED). Investigation 2026-05-05 per @tobiu's hint about DreamMode-inactivity surfaced the integration gap that prevents the heartbeat substrate from running autonomously: ai/scripts/swarm-heartbeat.sh is structured as a continuous-loop daemon (while true; sleep $POLL_INTERVAL at lines 118 + 256) but has no persistent-process-management mechanism — no launchd plist, no systemd unit, no npm script that keeps it alive between operator sessions or system reboots.
@tobiu's framing 2026-05-05 19:48Z: "our velocity is gated by me, since as a human i need sleep. e.g., if i am online for 12h a day, we miss 12h of progress. this is a strong push for nightshift mode."
The full night-shift readiness path is: persistent-process-management (THIS sub) + wakeSafetyGate untrip (operator-territory CLI) + end-to-end validation (operator-territory test run). This sub addresses the first piece.
The Problem
Without persistent-process-management:
swarm-heartbeat.sh only runs when an operator manually invokes bash ai/scripts/swarm-heartbeat.sh and keeps the terminal open
- Operator-sleep / system-reboot / terminal-close all kill the daemon → no sweep activity → no autonomous heartbeat-driven recovery
- Even if
wakeSafetyGate were untripped (operator-action), the daemon needs to be RUNNING for the gate to be consulted at all
- Empirical state today: zero sweep-error log activity (
.neo-ai-data/wake-daemon/sweep-errors.log doesn't exist); checkSunsetted.mjs autonomous runs never fire
The mutual-idle-stall pattern (#10399) cannot be auto-resolved while the daemon isn't running. This was empirically demonstrated multiple times in session 2026-05-05 (3-of-3 agents idling simultaneously; @tobiu prompted recovery each time).
The Architectural Reality
ai/scripts/swarm-heartbeat.sh — continuous-loop bash daemon; reads POLL_INTERVAL=300 (5 min) env var; concurrency-locked via .neo-ai-data/heartbeat-concurrency.lock (TTL 600s); persistent sweep-error log at .neo-ai-data/wake-daemon/sweep-errors.log
ai/scripts/checkSunsetted.mjs — predicate consumed by the daemon; emits sunset vs idle_out_candidate signals (#10673)
ai/scripts/resumeHarness.mjs — recovery dispatcher; consumes daemon output (#10675/#10676)
ai/scripts/wakeSafetyGate.mjs — fail-closed gate; daemon consults via node wakeSafetyGate.mjs check before any high-authority action (#10648)
- Repo precedent: zero existing
.plist or .service files committed; this sub authors the FIRST persistent-process-management substrate
- Operator environment:
tobiasuhlig@wheel ownership in ls output suggests primary deployment is macOS (launchd is the native primitive); Linux deployments would need a sibling systemd-unit shape
The Fix
Two-deliverable shape — agent-side authoring (read-only / committed-template work) + operator-side installation (destructive write to ~/Library/LaunchAgents/).
Agent-side deliverables (this ticket scope)
learn/agentos/wake-substrate/com.neomjs.swarm-heartbeat.plist.template — committed template file with:
- Standard plist boilerplate (XML + DOCTYPE)
Label = com.neomjs.swarm-heartbeat
ProgramArguments = [/bin/bash, /full/path/to/swarm-heartbeat.sh] (operator substitutes the full repo path; template uses placeholder)
RunAtLoad = true (start daemon on login / reboot)
KeepAlive = true (auto-restart if daemon crashes)
StandardOutPath + StandardErrorPath pointing into .neo-ai-data/wake-daemon/
EnvironmentVariables block with NEO_AGENT_IDENTITY placeholder
- Inline comments naming the operator-substitution points + safety considerations
learn/agentos/wake-substrate/PersistentProcessManagement.md — operator-doc covering:
- macOS launchd installation procedure (
launchctl bootstrap gui/$(id -u))
- macOS launchd uninstall procedure (
launchctl bootout)
- Linux systemd-unit sibling sketch (template only; no committed
.service file unless someone validates against a Linux deployment)
- Verification:
launchctl list | grep com.neomjs + sweep-log activity check
- Troubleshooting common gotchas (relative paths, missing
WorkingDirectory, environment isolation)
ai/scripts/installWakeSubstrateLaunchd.mjs (optional polish, scope-extension) — Node.js script that customizes the plist template with the local repo path + writes to ~/Library/LaunchAgents/. Operator runs once. Out-of-scope for v1 if the manual procedure is short enough.
Operator-side actions (out of scope for THIS ticket; tracked in #10671 epic-finish)
- Verify the plist on operator's actual macOS install (empirical syntax verification)
- Run
launchctl bootstrap to install
- Untrip
wakeSafetyGate after end-to-end validation
- Establish backup-first discipline before re-enabling DreamMode/Sandman (per #10780)
Acceptance Criteria
Out of Scope
- Actually installing the plist on operator's macOS (operator-territory)
- Untripping
wakeSafetyGate (operator-territory; separate #10671 epic-finish step)
- End-to-end mutual-idle simulation + recovery validation (operator-territory)
- Linux systemd
.service file with empirical validation (sibling work; mentioned in AC3 as out-of-scope-for-v1)
- Optional
installWakeSubstrateLaunchd.mjs automation script (scope-extension if helpful)
Avoided Traps
- Drafting plist text from memory without empirical verification: macOS launchd plist syntax has multiple gotchas (path-must-be-absolute, KeepAlive vs OnDemand, environment isolation). Verify-before-assert says I author the TEMPLATE + the GOTCHAS but operator empirically validates the plist on an actual macOS install. Per
feedback_verify_before_assert.md discipline.
- Bundling with operator-territory steps: the plist install + gate-untrip + validation run are operator-territory; bundling them into AC of an agent-authored sub-ticket is wrong-scope. This sub authors the deliverables; the operator runs them.
- Auto-installing without operator GO: writing to
~/Library/LaunchAgents/ is destructive-write on shared state. Per feedback_silence_is_not_consent.md — explicit operator GO required for the actual install.
- Skipping the Linux sibling entirely: punted to AC3 (out-of-scope-for-v1 with named follow-up); avoiding the trap of macOS-only assumption while not over-extending the v1 scope.
Related
- Parent epic: #10671 (substrate-restart recovery; 9 subs CLOSED, integration outstanding)
- Related sibling tickets: #10396 (heartbeat wakeup), #10399 (mutual-idle stall mitigation), #10633 (AllAgentIdle cycle_id derivation) — all RESOLVED_BY_RELATED-#10671 per my recent comments; their gap is what THIS sub addresses
- Adjacent observability: #10779 (
features.dream healthcheck) + sibling proposal for features.wake healthcheck (gate-state + daemon-running-state + last-pulse timestamp) — could be sibling-filed once this lands
- Operational discipline: #10780 (backup-first before DreamMode/Sandman invocation) — applies before any controlled DreamMode test
- Empirical anchor: session 2026-05-05 — multiple agent-mutual-idle-stall events; @tobiu's "12h sleep gates velocity" framing 19:48Z makes night-shift readiness load-bearing for swarm-velocity, not polish
Origin Session ID: 23b9cbcd-4938-4a46-b21a-0d48dd12e7e7
Retrieval Hint: query_raw_memories(query="persistent-process management launchd plist swarm-heartbeat daemon night-shift 10671 epic-finish")
Context
Sub of Epic #10671 (Substrate-restart recovery; 9 prior subs CLOSED). Investigation 2026-05-05 per @tobiu's hint about DreamMode-inactivity surfaced the integration gap that prevents the heartbeat substrate from running autonomously:
ai/scripts/swarm-heartbeat.shis structured as a continuous-loop daemon (while true; sleep $POLL_INTERVALat lines 118 + 256) but has no persistent-process-management mechanism — no launchd plist, no systemd unit, no npm script that keeps it alive between operator sessions or system reboots.@tobiu's framing 2026-05-05 19:48Z: "our velocity is gated by me, since as a human i need sleep. e.g., if i am online for 12h a day, we miss 12h of progress. this is a strong push for nightshift mode."
The full night-shift readiness path is: persistent-process-management (THIS sub) +
wakeSafetyGateuntrip (operator-territory CLI) + end-to-end validation (operator-territory test run). This sub addresses the first piece.The Problem
Without persistent-process-management:
swarm-heartbeat.shonly runs when an operator manually invokesbash ai/scripts/swarm-heartbeat.shand keeps the terminal openwakeSafetyGatewere untripped (operator-action), the daemon needs to be RUNNING for the gate to be consulted at all.neo-ai-data/wake-daemon/sweep-errors.logdoesn't exist);checkSunsetted.mjsautonomous runs never fireThe mutual-idle-stall pattern (#10399) cannot be auto-resolved while the daemon isn't running. This was empirically demonstrated multiple times in session 2026-05-05 (3-of-3 agents idling simultaneously; @tobiu prompted recovery each time).
The Architectural Reality
ai/scripts/swarm-heartbeat.sh— continuous-loop bash daemon; readsPOLL_INTERVAL=300(5 min) env var; concurrency-locked via.neo-ai-data/heartbeat-concurrency.lock(TTL 600s); persistent sweep-error log at.neo-ai-data/wake-daemon/sweep-errors.logai/scripts/checkSunsetted.mjs— predicate consumed by the daemon; emitssunsetvsidle_out_candidatesignals (#10673)ai/scripts/resumeHarness.mjs— recovery dispatcher; consumes daemon output (#10675/#10676)ai/scripts/wakeSafetyGate.mjs— fail-closed gate; daemon consults vianode wakeSafetyGate.mjs checkbefore any high-authority action (#10648).plistor.servicefiles committed; this sub authors the FIRST persistent-process-management substratetobiasuhlig@wheelownership inlsoutput suggests primary deployment is macOS (launchd is the native primitive); Linux deployments would need a sibling systemd-unit shapeThe Fix
Two-deliverable shape — agent-side authoring (read-only / committed-template work) + operator-side installation (destructive write to
~/Library/LaunchAgents/).Agent-side deliverables (this ticket scope)
learn/agentos/wake-substrate/com.neomjs.swarm-heartbeat.plist.template— committed template file with:Label=com.neomjs.swarm-heartbeatProgramArguments=[/bin/bash, /full/path/to/swarm-heartbeat.sh](operator substitutes the full repo path; template uses placeholder)RunAtLoad=true(start daemon on login / reboot)KeepAlive=true(auto-restart if daemon crashes)StandardOutPath+StandardErrorPathpointing into.neo-ai-data/wake-daemon/EnvironmentVariablesblock withNEO_AGENT_IDENTITYplaceholderlearn/agentos/wake-substrate/PersistentProcessManagement.md— operator-doc covering:launchctl bootstrap gui/$(id -u))launchctl bootout).servicefile unless someone validates against a Linux deployment)launchctl list | grep com.neomjs+ sweep-log activity checkWorkingDirectory, environment isolation)ai/scripts/installWakeSubstrateLaunchd.mjs(optional polish, scope-extension) — Node.js script that customizes the plist template with the local repo path + writes to~/Library/LaunchAgents/. Operator runs once. Out-of-scope for v1 if the manual procedure is short enough.Operator-side actions (out of scope for THIS ticket; tracked in #10671 epic-finish)
launchctl bootstrapto installwakeSafetyGateafter end-to-end validationAcceptance Criteria
learn/agentos/wake-substrate/com.neomjs.swarm-heartbeat.plist.templatecommitted with all required keys + inline operator-substitution commentslearn/agentos/wake-substrate/PersistentProcessManagement.mdcommitted with macOS install/uninstall/verify procedureslearn/agentos/DreamPipeline.md(andtooling/WakeSubstrateIncidentProtocol.md) to the new doc so operators discovering DreamMode/Sandman issues find the persistent-process-management substratelaunchctl list | grep com.neomjs; verify sweep activity via.neo-ai-data/wake-daemon/sweep-errors.logmtime advancing"Out of Scope
wakeSafetyGate(operator-territory; separate #10671 epic-finish step).servicefile with empirical validation (sibling work; mentioned in AC3 as out-of-scope-for-v1)installWakeSubstrateLaunchd.mjsautomation script (scope-extension if helpful)Avoided Traps
feedback_verify_before_assert.mddiscipline.~/Library/LaunchAgents/is destructive-write on shared state. Perfeedback_silence_is_not_consent.md— explicit operator GO required for the actual install.Related
features.dreamhealthcheck) + sibling proposal forfeatures.wakehealthcheck (gate-state + daemon-running-state + last-pulse timestamp) — could be sibling-filed once this landsOrigin Session ID: 23b9cbcd-4938-4a46-b21a-0d48dd12e7e7
Retrieval Hint:
query_raw_memories(query="persistent-process management launchd plist swarm-heartbeat daemon night-shift 10671 epic-finish")