LearnNewsExamplesServices
Frontmatter
id10931
titleTrack POLL_INTERVAL env var in features.wake stale-threshold
stateClosed
labels
enhancementaiarchitecture
assigneesneo-opus-4-7
createdAtMay 7, 2026, 11:20 PM
updatedAtMay 9, 2026, 11:15 PM
githubUrlhttps://github.com/neomjs/neo/issues/10931
authorneo-opus-4-7
commentsCount0
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 7, 2026, 11:39 PM

Track POLL_INTERVAL env var in features.wake stale-threshold

Closedenhancementaiarchitecture
neo-opus-4-7
neo-opus-4-7 commented on May 7, 2026, 11:20 PM

Context

Surfaced 2026-05-07 during PR #10930 Cycle 1 review by @neo-gemini-3-1-pro (commentId IC_kwDODSospM8AAAABBlU19Q). PR #10930 added the features.wake healthcheck observability block per #10783 — it hardcodes the daemon-liveness stale-threshold to 10 * 60 * 1000 (2× the default POLL_INTERVAL=300s from swarm-heartbeat.sh).

swarm-heartbeat.sh reads its POLL_INTERVAL from the environment (POLL_INTERVAL=${POLL_INTERVAL:-300}). If an operator overrides POLL_INTERVAL=900 (15 min cadence), the hardcoded 10-minute observability threshold in HealthService.HEARTBEAT_LIVENESS_STALE_MS becomes incorrectly tight — the healthcheck would flag a properly-running daemon as daemonRunning: false simply because the next pulse hasn't fired yet.

The Problem

HealthService.HEARTBEAT_LIVENESS_STALE_MS is decoupled from the substrate's actual polling cadence. The contract is "daemon is alive if last pulse < 2× POLL_INTERVAL ago" — but the implementation pins POLL_INTERVAL = 300s without consulting the env. Operator-side override of POLL_INTERVAL silently breaks observability correctness.

This is a defensive-correctness gap, not a runtime regression — at default POLL_INTERVAL=300s the math works out exactly (10min stale threshold = 2 × 5min). The divergence only surfaces when an operator explicitly raises POLL_INTERVAL.

The Architectural Reality

  • ai/scripts/swarm-heartbeat.sh:10POLL_INTERVAL=${POLL_INTERVAL:-300} # 5 minutes default
  • ai/mcp/server/memory-core/services/HealthService.mjs (post-PR-#10930) — const HEARTBEAT_LIVENESS_STALE_MS = 10 * 60 * 1000;
  • The coupling intent is documented in HealthService.mjs as "POLL_INTERVAL default × 2" but the math is hardcoded, not computed.

The Fix

Replace the module-scope HEARTBEAT_LIVENESS_STALE_MS constant with a function that reads POLL_INTERVAL from process.env at call time:

function heartbeatLivenessStaleMs() {
    const pollIntervalSeconds = Number(process.env.POLL_INTERVAL) || 300;
    return 2 * pollIntervalSeconds * 1000;
}

Function-call-time read (rather than module-load-time) preserves test-isolation behavior — specs that override POLL_INTERVAL for stalled-daemon scenarios still get the env value at the moment buildWakeFeaturesBlock runs.

Then in buildWakeFeaturesBlock:

livenessBlock = {
    daemonRunning        : ageMs < heartbeatLivenessStaleMs(),
    ...
};

Update the relevant test cases in HealthService.spec.mjs #10783 describe block to also set POLL_INTERVAL env explicitly where the test assumes a particular threshold.

Acceptance Criteria

  • (AC1) HEARTBEAT_LIVENESS_STALE_MS const replaced with heartbeatLivenessStaleMs() function that reads POLL_INTERVAL from process.env
  • (AC2) buildWakeFeaturesBlock calls the function at decision time (not cached at module load)
  • (AC3) Existing #10783 spec cases continue to pass (default POLL_INTERVAL=300 → 10min threshold)
  • (AC4) New spec case: POLL_INTERVAL=900 → 30min threshold; backdated liveness file at 11min still reports daemonRunning: true
  • (AC5) JSDoc on the function documents the coupling contract: "stale threshold = 2× POLL_INTERVAL (substrate convention)"

Out of Scope

  • Changing POLL_INTERVAL default semantics (300s) — substrate convention preserved
  • Adding a separate NEO_HEARTBEAT_LIVENESS_STALE_MS operator-override env var — YAGNI; the substrate's POLL_INTERVAL is the single source of truth
  • Changing the "2×" multiplier — single missed-pulse buffer is the documented intent

Avoided Traps

  • Hardcoding POLL_INTERVAL in HealthService duplicate-of-truth: rejected because operator-side override silently breaks observability. The substrate (swarm-heartbeat.sh) is the source of truth; observability must consult it.
  • Module-load-time env read: rejected because test isolation requires per-call env override. Function-call-time read mirrors the existing wakeSafetyGate.gateFilePath() and heartbeatAlivePath() patterns.
  • Adding NEO_HEARTBEAT_LIVENESS_STALE_MS env override: rejected as YAGNI. The substrate already has POLL_INTERVAL as the operator-override surface; introducing a parallel observability-only override fragments configuration.

Related

  • Triggering review: PR #10930 Cycle 1 commentId IC_kwDODSospM8AAAABBlU19Q — @neo-gemini-3-1-pro Approve+Follow-Up
  • Predecessor: #10783 (features.wake healthcheck block) — this ticket polishes one edge case from the originating implementation
  • Substrate source: ai/scripts/swarm-heartbeat.sh POLL_INTERVAL env contract
  • Implementation site: ai/mcp/server/memory-core/services/HealthService.mjs HEARTBEAT_LIVENESS_STALE_MS const
  • Adjacent context: parent epic #10671 (substrate-restart recovery)

Origin Session ID: 7e897a0b-33ce-4d6c-b1a9-a1ff93e4e571

Retrieval Hint: query_raw_memories(query="HEARTBEAT_LIVENESS_STALE_MS POLL_INTERVAL coupling features wake healthcheck #10783 #10930")

tobiu closed this issue on May 7, 2026, 11:39 PM
tobiu referenced in commit 21d5cdd - "feat(memory-core): healthcheck features.wake observability block (#10783) (#10930) on May 7, 2026, 11:39 PM