LearnNewsExamplesServices
Frontmatter
id10781
titlePersistent-process management for swarm-heartbeat.sh daemon (#10671 epic-finish piece)
stateClosed
labels
documentationenhancementaiarchitecture
assigneesneo-opus-4-7
createdAtMay 5, 2026, 9:52 PM
updatedAtMay 5, 2026, 11:08 PM
githubUrlhttps://github.com/neomjs/neo/issues/10781
authorneo-opus-4-7
commentsCount1
parentIssue10671
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 5, 2026, 11:08 PM

Persistent-process management for swarm-heartbeat.sh daemon (#10671 epic-finish piece)

Closeddocumentationenhancementaiarchitecture
neo-opus-4-7
neo-opus-4-7 commented on May 5, 2026, 9:52 PM

Context

Sub of Epic #10671 (Substrate-restart recovery; 9 prior subs CLOSED). Investigation 2026-05-05 per @tobiu's hint about DreamMode-inactivity surfaced the integration gap that prevents the heartbeat substrate from running autonomously: ai/scripts/swarm-heartbeat.sh is structured as a continuous-loop daemon (while true; sleep $POLL_INTERVAL at lines 118 + 256) but has no persistent-process-management mechanism — no launchd plist, no systemd unit, no npm script that keeps it alive between operator sessions or system reboots.

@tobiu's framing 2026-05-05 19:48Z: "our velocity is gated by me, since as a human i need sleep. e.g., if i am online for 12h a day, we miss 12h of progress. this is a strong push for nightshift mode."

The full night-shift readiness path is: persistent-process-management (THIS sub) + wakeSafetyGate untrip (operator-territory CLI) + end-to-end validation (operator-territory test run). This sub addresses the first piece.

The Problem

Without persistent-process-management:

  • swarm-heartbeat.sh only runs when an operator manually invokes bash ai/scripts/swarm-heartbeat.sh and keeps the terminal open
  • Operator-sleep / system-reboot / terminal-close all kill the daemon → no sweep activity → no autonomous heartbeat-driven recovery
  • Even if wakeSafetyGate were untripped (operator-action), the daemon needs to be RUNNING for the gate to be consulted at all
  • Empirical state today: zero sweep-error log activity (.neo-ai-data/wake-daemon/sweep-errors.log doesn't exist); checkSunsetted.mjs autonomous runs never fire

The mutual-idle-stall pattern (#10399) cannot be auto-resolved while the daemon isn't running. This was empirically demonstrated multiple times in session 2026-05-05 (3-of-3 agents idling simultaneously; @tobiu prompted recovery each time).

The Architectural Reality

  • ai/scripts/swarm-heartbeat.sh — continuous-loop bash daemon; reads POLL_INTERVAL=300 (5 min) env var; concurrency-locked via .neo-ai-data/heartbeat-concurrency.lock (TTL 600s); persistent sweep-error log at .neo-ai-data/wake-daemon/sweep-errors.log
  • ai/scripts/checkSunsetted.mjs — predicate consumed by the daemon; emits sunset vs idle_out_candidate signals (#10673)
  • ai/scripts/resumeHarness.mjs — recovery dispatcher; consumes daemon output (#10675/#10676)
  • ai/scripts/wakeSafetyGate.mjs — fail-closed gate; daemon consults via node wakeSafetyGate.mjs check before any high-authority action (#10648)
  • Repo precedent: zero existing .plist or .service files committed; this sub authors the FIRST persistent-process-management substrate
  • Operator environment: tobiasuhlig@wheel ownership in ls output suggests primary deployment is macOS (launchd is the native primitive); Linux deployments would need a sibling systemd-unit shape

The Fix

Two-deliverable shape — agent-side authoring (read-only / committed-template work) + operator-side installation (destructive write to ~/Library/LaunchAgents/).

Agent-side deliverables (this ticket scope)

  1. learn/agentos/wake-substrate/com.neomjs.swarm-heartbeat.plist.template — committed template file with:

    • Standard plist boilerplate (XML + DOCTYPE)
    • Label = com.neomjs.swarm-heartbeat
    • ProgramArguments = [/bin/bash, /full/path/to/swarm-heartbeat.sh] (operator substitutes the full repo path; template uses placeholder)
    • RunAtLoad = true (start daemon on login / reboot)
    • KeepAlive = true (auto-restart if daemon crashes)
    • StandardOutPath + StandardErrorPath pointing into .neo-ai-data/wake-daemon/
    • EnvironmentVariables block with NEO_AGENT_IDENTITY placeholder
    • Inline comments naming the operator-substitution points + safety considerations
  2. learn/agentos/wake-substrate/PersistentProcessManagement.md — operator-doc covering:

    • macOS launchd installation procedure (launchctl bootstrap gui/$(id -u))
    • macOS launchd uninstall procedure (launchctl bootout)
    • Linux systemd-unit sibling sketch (template only; no committed .service file unless someone validates against a Linux deployment)
    • Verification: launchctl list | grep com.neomjs + sweep-log activity check
    • Troubleshooting common gotchas (relative paths, missing WorkingDirectory, environment isolation)
  3. ai/scripts/installWakeSubstrateLaunchd.mjs (optional polish, scope-extension) — Node.js script that customizes the plist template with the local repo path + writes to ~/Library/LaunchAgents/. Operator runs once. Out-of-scope for v1 if the manual procedure is short enough.

Operator-side actions (out of scope for THIS ticket; tracked in #10671 epic-finish)

  • Verify the plist on operator's actual macOS install (empirical syntax verification)
  • Run launchctl bootstrap to install
  • Untrip wakeSafetyGate after end-to-end validation
  • Establish backup-first discipline before re-enabling DreamMode/Sandman (per #10780)

Acceptance Criteria

  • (AC1) learn/agentos/wake-substrate/com.neomjs.swarm-heartbeat.plist.template committed with all required keys + inline operator-substitution comments
  • (AC2) learn/agentos/wake-substrate/PersistentProcessManagement.md committed with macOS install/uninstall/verify procedures
  • (AC3) Linux systemd sibling: either committed template OR explicitly named as out-of-scope-for-v1 with cross-link to a future ticket
  • (AC4) Cross-reference from learn/agentos/DreamPipeline.md (and tooling/WakeSubstrateIncidentProtocol.md) to the new doc so operators discovering DreamMode/Sandman issues find the persistent-process-management substrate
  • (AC5) Empirical-validation procedure documented but NOT executed in this PR (operator-territory): "after install, verify daemon-running via launchctl list | grep com.neomjs; verify sweep activity via .neo-ai-data/wake-daemon/sweep-errors.log mtime advancing"
  • (AC6) Verify-before-assert discipline: do NOT claim plist correctness without empirical operator-side install verification. The plist template in this sub is author-side draft + documented gotchas; correctness of the actual plist text is operator-side L3 verification.

Out of Scope

  • Actually installing the plist on operator's macOS (operator-territory)
  • Untripping wakeSafetyGate (operator-territory; separate #10671 epic-finish step)
  • End-to-end mutual-idle simulation + recovery validation (operator-territory)
  • Linux systemd .service file with empirical validation (sibling work; mentioned in AC3 as out-of-scope-for-v1)
  • Optional installWakeSubstrateLaunchd.mjs automation script (scope-extension if helpful)

Avoided Traps

  • Drafting plist text from memory without empirical verification: macOS launchd plist syntax has multiple gotchas (path-must-be-absolute, KeepAlive vs OnDemand, environment isolation). Verify-before-assert says I author the TEMPLATE + the GOTCHAS but operator empirically validates the plist on an actual macOS install. Per feedback_verify_before_assert.md discipline.
  • Bundling with operator-territory steps: the plist install + gate-untrip + validation run are operator-territory; bundling them into AC of an agent-authored sub-ticket is wrong-scope. This sub authors the deliverables; the operator runs them.
  • Auto-installing without operator GO: writing to ~/Library/LaunchAgents/ is destructive-write on shared state. Per feedback_silence_is_not_consent.md — explicit operator GO required for the actual install.
  • Skipping the Linux sibling entirely: punted to AC3 (out-of-scope-for-v1 with named follow-up); avoiding the trap of macOS-only assumption while not over-extending the v1 scope.

Related

  • Parent epic: #10671 (substrate-restart recovery; 9 subs CLOSED, integration outstanding)
  • Related sibling tickets: #10396 (heartbeat wakeup), #10399 (mutual-idle stall mitigation), #10633 (AllAgentIdle cycle_id derivation) — all RESOLVED_BY_RELATED-#10671 per my recent comments; their gap is what THIS sub addresses
  • Adjacent observability: #10779 (features.dream healthcheck) + sibling proposal for features.wake healthcheck (gate-state + daemon-running-state + last-pulse timestamp) — could be sibling-filed once this lands
  • Operational discipline: #10780 (backup-first before DreamMode/Sandman invocation) — applies before any controlled DreamMode test
  • Empirical anchor: session 2026-05-05 — multiple agent-mutual-idle-stall events; @tobiu's "12h sleep gates velocity" framing 19:48Z makes night-shift readiness load-bearing for swarm-velocity, not polish

Origin Session ID: 23b9cbcd-4938-4a46-b21a-0d48dd12e7e7

Retrieval Hint: query_raw_memories(query="persistent-process management launchd plist swarm-heartbeat daemon night-shift 10671 epic-finish")

tobiu referenced in commit 21d5cdd - "feat(memory-core): healthcheck features.wake observability block (#10783) (#10930) on May 7, 2026, 11:39 PM