LearnNewsExamplesServices
Frontmatter
id11797
titleValidate orchestrator swarm-heartbeat wake delivery
stateClosed
labels
bugaitestingregressionarchitecture
assigneesneo-gpt
createdAtMay 23, 2026, 1:20 AM
updatedAtMay 23, 2026, 3:35 AM
githubUrlhttps://github.com/neomjs/neo/issues/11797
authorneo-gpt
commentsCount3
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 23, 2026, 3:35 AM

Validate orchestrator swarm-heartbeat wake delivery

Closedbugaitestingregressionarchitecture
neo-gpt
neo-gpt commented on May 23, 2026, 1:20 AM

Context

Operator observation on 2026-05-23 after PR #11772 merged: dev was pulled and the Orchestrator was restarted, but no wakeup messages have been observed for @neo-opus-4-7 or @neo-gpt.

This directly targets the post-merge validation residual from PR #11772 / issue #11766. PR #11772 folded the standalone swarm-heartbeat daemon into the Orchestrator as a local-only scheduled lane and explicitly left L4 runtime validation open:

  • local maintainer checkout: confirm a running Orchestrator drives heartbeat pulses (.neo-ai-data/wake-daemon/heartbeat.alive mtime advancing; heartbeat log/activity);
  • cloud profile: confirm the lane stays off.

The operator now supplied negative runtime evidence for the local profile: after restart, no visible wake delivery has happened.

The Problem

The merge replaced the old standalone heartbeat process with an Orchestrator-owned lane. Unit coverage verified the shape, but the live end-to-end path is still unproven and may be silently failing.

The dangerous failure mode is not just "no wake". It is false confidence: the Orchestrator can be running, and individual unit tests can pass, while the folded heartbeat lane either never becomes due, never emits, exits early, hits the wake-safety gate, or sends output to the wrong/obsolete harness surface.

Architectural Reality

Verified surfaces:

  • PR #11772 merged at 2026-05-22T17:28:26Z and resolves #11766.
  • ai/daemons/Orchestrator.mjs initializes the heartbeat lane when swarmHeartbeatEnabled is true, then drives SWARM_HEARTBEAT_TASK_NAME through cadenceEngine.runIfDue and records task outcomes.
  • ai/scripts/orchestrator-daemon.mjs resolves NEO_ORCHESTRATOR_SWARM_HEARTBEAT_INTERVAL_MS and NEO_ORCHESTRATOR_SWARM_HEARTBEAT_ENABLED from env/config.
  • ai/daemons/SwarmHeartbeatService.mjs#pulse() touches heartbeat.alive, checks locks, sweeps expired tasks, runs sunset / idle-out / all-agent-idle checks, then bypasses push-capable identities before the old token-economy tmux pulse path.

Important implication: an operator-visible wake can fail for several distinct reasons. The verification must isolate which stage is failing, not just assert "heartbeat broken".

Duplicate Sweep

  • KB ticket search surfaced #10396 / #10399 / #10312 as older generic heartbeat / swarm-stall substrate tickets, and #10931 as a stale-threshold observability bug. None is the post-#11772 folded-Orchestrator live validation gap.
  • GitHub search for open SwarmHeartbeatService / Orchestrator / wake issues surfaced #11075, which is config constants cleanup and not a live wake-delivery verification issue.
  • #11766 is the source implementation ticket and is closed; its L4 post-merge validation residual needs this follow-up ticket because the operator now reports negative runtime evidence.

The Fix

Build a focused verification + diagnosis pass for the folded heartbeat lane:

  1. Confirm the Orchestrator actually schedules swarm-heartbeat after restart.
  2. Confirm .neo-ai-data/wake-daemon/heartbeat.alive advances on each due pulse.
  3. Confirm TaskStateService / HealthService exposes heartbeat task outcomes (running, completed, failed, lastReason) for the folded lane.
  4. Force a short-interval local run using NEO_ORCHESTRATOR_SWARM_HEARTBEAT_INTERVAL_MS and explicit NEO_ORCHESTRATOR_SWARM_HEARTBEAT_ENABLED=true, then capture whether the lane reaches each pulse() branch.
  5. Validate actual wake delivery to the intended active harness identities (@neo-opus-4-7, @neo-gpt) or prove why delivery is correctly skipped.
  6. Add the smallest durable diagnostic or test/runbook needed so future agents can verify this without relying on operator observation alone.

Acceptance Criteria

  • AC1 — Local-profile Orchestrator run proves swarm-heartbeat is enabled, due, and invoked after restart.
  • AC2 — heartbeat.alive mtime advancement is measured across at least two due pulses.
  • AC3 — Orchestrator task outcome state for swarm-heartbeat is inspected and records success/failure reason rather than staying silent.
  • AC4 — Wake-safety gate state, concurrency-lock state, push-capable bypass, all-agent-idle result, sunset result, and token-economy gate are each checked or logged for the failing run.
  • AC5 — At least one controlled wake-delivery path is proven end-to-end for an intended active harness identity, OR a concrete blocker is identified and converted into a narrow follow-up bug.
  • AC6 — Cloud-profile negative assertion remains true: NEO_AI_DEPLOYMENT_MODE=cloud leaves the local-only heartbeat lane off by default.
  • AC7 — Add or update focused regression coverage / diagnostic runbook so the folded-lane post-merge validation can be repeated without guesswork.

Out of Scope

  • Rewriting the heartbeat architecture again before the live failure stage is isolated.
  • Reopening the retired standalone swarm-heartbeat.sh / launchd path as the primary fix.
  • Changing unrelated Orchestrator heavy-maintenance lanes (KB sync, dream, golden path, backup).
  • Solving #11075 config-constant cleanup.

Avoided Traps

Trap Why rejected
Treating "Orchestrator process is running" as proof The folded heartbeat lane can be disabled, not due, early-returning, or failing after Orchestrator start.
Treating absence of visible wake as one bug The pulse pipeline has multiple intentional skips and safety gates; diagnostics must name the exact stage.
Reverting to the old daemon path immediately #11766 deliberately retired that tech debt. The first step is to validate the folded lane and identify any narrow regression.
Only checking Chroma / KB health This is a local wake-substrate pipeline; Chroma health can be unrelated to whether heartbeat delivery reaches harnesses.

Related

  • #11766 — source issue, closed by PR #11772.
  • PR #11772 — folded swarm-heartbeat into the Orchestrator and left L4 post-merge validation open.
  • #11730 — v13 post-MVP residual workstreams parent.
  • #10601 / #10399 — older auto-wakeup / swarm-stall substrate history.
  • #10931 — prior wake liveness observability threshold issue; adjacent but not duplicate.

Origin Session ID: d60db68f-8ff0-48a6-b168-237ca9dca2a0

Handoff Retrieval Hint: query_raw_memories("orchestrator swarm-heartbeat folded lane no wake after restart #11772 #11766")

tobiu referenced in commit f566c0a - "fix(ai): normalize swarm heartbeat identity (#11797) (#11804) on May 23, 2026, 2:35 AM