Context
Surfaced 2026-05-07 during PR #10930 Cycle 1 review by @neo-gemini-3-1-pro (commentId IC_kwDODSospM8AAAABBlU19Q). PR #10930 added the features.wake healthcheck observability block per #10783 — it hardcodes the daemon-liveness stale-threshold to 10 * 60 * 1000 (2× the default POLL_INTERVAL=300s from swarm-heartbeat.sh).
swarm-heartbeat.sh reads its POLL_INTERVAL from the environment (POLL_INTERVAL=${POLL_INTERVAL:-300}). If an operator overrides POLL_INTERVAL=900 (15 min cadence), the hardcoded 10-minute observability threshold in HealthService.HEARTBEAT_LIVENESS_STALE_MS becomes incorrectly tight — the healthcheck would flag a properly-running daemon as daemonRunning: false simply because the next pulse hasn't fired yet.
The Problem
HealthService.HEARTBEAT_LIVENESS_STALE_MS is decoupled from the substrate's actual polling cadence. The contract is "daemon is alive if last pulse < 2× POLL_INTERVAL ago" — but the implementation pins POLL_INTERVAL = 300s without consulting the env. Operator-side override of POLL_INTERVAL silently breaks observability correctness.
This is a defensive-correctness gap, not a runtime regression — at default POLL_INTERVAL=300s the math works out exactly (10min stale threshold = 2 × 5min). The divergence only surfaces when an operator explicitly raises POLL_INTERVAL.
The Architectural Reality
ai/scripts/swarm-heartbeat.sh:10 — POLL_INTERVAL=${POLL_INTERVAL:-300} # 5 minutes default
ai/mcp/server/memory-core/services/HealthService.mjs (post-PR-#10930) — const HEARTBEAT_LIVENESS_STALE_MS = 10 * 60 * 1000;
- The coupling intent is documented in HealthService.mjs as "POLL_INTERVAL default × 2" but the math is hardcoded, not computed.
The Fix
Replace the module-scope HEARTBEAT_LIVENESS_STALE_MS constant with a function that reads POLL_INTERVAL from process.env at call time:
function heartbeatLivenessStaleMs() {
const pollIntervalSeconds = Number(process.env.POLL_INTERVAL) || 300;
return 2 * pollIntervalSeconds * 1000;
}
Function-call-time read (rather than module-load-time) preserves test-isolation behavior — specs that override POLL_INTERVAL for stalled-daemon scenarios still get the env value at the moment buildWakeFeaturesBlock runs.
Then in buildWakeFeaturesBlock:
livenessBlock = {
daemonRunning : ageMs < heartbeatLivenessStaleMs(),
...
};
Update the relevant test cases in HealthService.spec.mjs #10783 describe block to also set POLL_INTERVAL env explicitly where the test assumes a particular threshold.
Acceptance Criteria
Out of Scope
- Changing
POLL_INTERVAL default semantics (300s) — substrate convention preserved
- Adding a separate
NEO_HEARTBEAT_LIVENESS_STALE_MS operator-override env var — YAGNI; the substrate's POLL_INTERVAL is the single source of truth
- Changing the "2×" multiplier — single missed-pulse buffer is the documented intent
Avoided Traps
- Hardcoding POLL_INTERVAL in HealthService duplicate-of-truth: rejected because operator-side override silently breaks observability. The substrate (swarm-heartbeat.sh) is the source of truth; observability must consult it.
- Module-load-time env read: rejected because test isolation requires per-call env override. Function-call-time read mirrors the existing
wakeSafetyGate.gateFilePath() and heartbeatAlivePath() patterns.
- Adding NEO_HEARTBEAT_LIVENESS_STALE_MS env override: rejected as YAGNI. The substrate already has POLL_INTERVAL as the operator-override surface; introducing a parallel observability-only override fragments configuration.
Related
- Triggering review: PR #10930 Cycle 1 commentId IC_kwDODSospM8AAAABBlU19Q — @neo-gemini-3-1-pro Approve+Follow-Up
- Predecessor: #10783 (
features.wake healthcheck block) — this ticket polishes one edge case from the originating implementation
- Substrate source:
ai/scripts/swarm-heartbeat.sh POLL_INTERVAL env contract
- Implementation site:
ai/mcp/server/memory-core/services/HealthService.mjs HEARTBEAT_LIVENESS_STALE_MS const
- Adjacent context: parent epic #10671 (substrate-restart recovery)
Origin Session ID: 7e897a0b-33ce-4d6c-b1a9-a1ff93e4e571
Retrieval Hint: query_raw_memories(query="HEARTBEAT_LIVENESS_STALE_MS POLL_INTERVAL coupling features wake healthcheck #10783 #10930")
Context
Surfaced 2026-05-07 during PR #10930 Cycle 1 review by @neo-gemini-3-1-pro (commentId IC_kwDODSospM8AAAABBlU19Q). PR #10930 added the
features.wakehealthcheck observability block per #10783 — it hardcodes the daemon-liveness stale-threshold to10 * 60 * 1000(2× the defaultPOLL_INTERVAL=300sfromswarm-heartbeat.sh).swarm-heartbeat.shreads itsPOLL_INTERVALfrom the environment (POLL_INTERVAL=${POLL_INTERVAL:-300}). If an operator overridesPOLL_INTERVAL=900(15 min cadence), the hardcoded 10-minute observability threshold inHealthService.HEARTBEAT_LIVENESS_STALE_MSbecomes incorrectly tight — the healthcheck would flag a properly-running daemon asdaemonRunning: falsesimply because the next pulse hasn't fired yet.The Problem
HealthService.HEARTBEAT_LIVENESS_STALE_MSis decoupled from the substrate's actual polling cadence. The contract is "daemon is alive if last pulse < 2× POLL_INTERVAL ago" — but the implementation pins POLL_INTERVAL = 300s without consulting the env. Operator-side override of POLL_INTERVAL silently breaks observability correctness.This is a defensive-correctness gap, not a runtime regression — at default
POLL_INTERVAL=300sthe math works out exactly (10min stale threshold = 2 × 5min). The divergence only surfaces when an operator explicitly raises POLL_INTERVAL.The Architectural Reality
ai/scripts/swarm-heartbeat.sh:10—POLL_INTERVAL=${POLL_INTERVAL:-300} # 5 minutes defaultai/mcp/server/memory-core/services/HealthService.mjs(post-PR-#10930) —const HEARTBEAT_LIVENESS_STALE_MS = 10 * 60 * 1000;The Fix
Replace the module-scope
HEARTBEAT_LIVENESS_STALE_MSconstant with a function that readsPOLL_INTERVALfromprocess.envat call time:function heartbeatLivenessStaleMs() { const pollIntervalSeconds = Number(process.env.POLL_INTERVAL) || 300; return 2 * pollIntervalSeconds * 1000; }Function-call-time read (rather than module-load-time) preserves test-isolation behavior — specs that override
POLL_INTERVALfor stalled-daemon scenarios still get the env value at the momentbuildWakeFeaturesBlockruns.Then in
buildWakeFeaturesBlock:livenessBlock = { daemonRunning : ageMs < heartbeatLivenessStaleMs(), ... };Update the relevant test cases in
HealthService.spec.mjs#10783describe block to also setPOLL_INTERVALenv explicitly where the test assumes a particular threshold.Acceptance Criteria
HEARTBEAT_LIVENESS_STALE_MSconst replaced withheartbeatLivenessStaleMs()function that readsPOLL_INTERVALfromprocess.envbuildWakeFeaturesBlockcalls the function at decision time (not cached at module load)#10783spec cases continue to pass (defaultPOLL_INTERVAL=300→ 10min threshold)POLL_INTERVAL=900→ 30min threshold; backdated liveness file at 11min still reportsdaemonRunning: trueOut of Scope
POLL_INTERVALdefault semantics (300s) — substrate convention preservedNEO_HEARTBEAT_LIVENESS_STALE_MSoperator-override env var — YAGNI; the substrate's POLL_INTERVAL is the single source of truthAvoided Traps
wakeSafetyGate.gateFilePath()andheartbeatAlivePath()patterns.Related
features.wakehealthcheck block) — this ticket polishes one edge case from the originating implementationai/scripts/swarm-heartbeat.shPOLL_INTERVAL env contractai/mcp/server/memory-core/services/HealthService.mjsHEARTBEAT_LIVENESS_STALE_MSconstOrigin Session ID:
7e897a0b-33ce-4d6c-b1a9-a1ff93e4e571Retrieval Hint:
query_raw_memories(query="HEARTBEAT_LIVENESS_STALE_MS POLL_INTERVAL coupling features wake healthcheck #10783 #10930")