LearnNewsExamplesServices
Frontmatter
id10783
titleHealthcheck observability: features.wake block (gate-state + daemon-running-state + last-pulse timestamp)
stateClosed
labels
enhancementaiarchitecture
assigneesneo-opus-4-7
createdAtMay 5, 2026, 10:00 PM
updatedAtMay 7, 2026, 11:39 PM
githubUrlhttps://github.com/neomjs/neo/issues/10783
authorneo-opus-4-7
commentsCount0
parentIssue10671
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 7, 2026, 11:39 PM

Healthcheck observability: features.wake block (gate-state + daemon-running-state + last-pulse timestamp)

Closedenhancementaiarchitecture
neo-opus-4-7
neo-opus-4-7 commented on May 5, 2026, 10:00 PM

Context

Surfaced 2026-05-05 during #10781 persistent-process-management investigation. The Memory Core healthcheck currently exposes database.topology (#10127), identity (#10176), providers.embedding (#10723), providers.summary (#10724/PR #10771). Sibling providers.auth ticket filed under #10770. Sibling providers.neoEmbedding ticket filed under #10773. Sibling features.dream ticket filed under #10779.

The wake substrate has no healthcheck observability surface. Operators (and agents) currently can't see — without filesystem-grepping — whether:

  • The wake-safety-gate is enabled / disabled / tripped (state at .neo-ai-data/wake-daemon/wake-safety-gate.json)
  • The swarm-heartbeat.sh daemon is currently running (process-detection)
  • The daemon is actively polling (concurrency-lock file mtime advances every pulse)

This gap blocks two workflows:

  1. Operators verifying night-shift readiness post-#10781 install would have to check 3 separate filesystem locations + run launchctl list
  2. Agents detecting that the wake substrate is healthy before relying on heartbeat-driven recovery — currently impossible without filesystem reads outside their MCP tool surface

The Problem

The wake substrate has multiple operational dimensions invisible from healthcheck:

  • Gate state: OK / blocking. Operator can flip via CLI but can't confirm via healthcheck.
  • Daemon liveness: is the launchd-managed process actually running? Or did it crash?
  • Polling activity: when did the last heartbeat pulse fire? If > 2× POLL_INTERVAL ago, daemon may be alive-but-stuck.

Without observability, agents fall back to the engagement-on-every-ping pattern (or its inverse) because they can't tell whether the heartbeat substrate would catch a deadlock.

The Architectural Reality

  • ai/mcp/server/memory-core/services/HealthService.mjs — pattern of module-scope pure projection functions (buildIdentityBlock, buildTopologyBlock, buildEmbeddingProviderBlock); test coverage in HealthService.spec.mjs
  • .neo-ai-data/wake-daemon/wake-safety-gate.json — gate state file; readable via wakeSafetyGate.mjs (existing module exports readGateState + hasOverride)
  • .neo-ai-data/heartbeat-concurrency.lock — mtime-advanced-on-poll; readable via fs.stat
  • Daemon running-state — detectable via launchctl list | grep com.neomjs.swarm-heartbeat (macOS) OR pgrep -f swarm-heartbeat.sh (cross-platform fallback)

The Fix

Add buildWakeFeaturesBlock(cfg) module-scope pure projection function to HealthService.mjs. Wire into healthcheck payload as new top-level features.wake:

"features": {
    "wake": {
        "gateState": "tripped",
        "gateReason": "",
        "gateTrippedAt": "2026-05-03T22:53:09.450Z",
        "gateTrippedBy": "cli",
        "daemonRunning": false,
        "lastPulseAt": null,
        "secondsSinceLastPulse": null
    }
}

Field semantics:

  • gateState: 'enabled' | 'disabled' | 'tripped' (read from .neo-ai-data/wake-daemon/wake-safety-gate.json)
  • gateReason / gateTrippedAt / gateTrippedBy: pass-through from the gate state file when applicable (empty / null otherwise)
  • daemonRunning: boolean — is the daemon process detected? (best-effort detection via filesystem signal: lock file mtime within 2 × POLL_INTERVAL AND no missing-process indicator)
  • lastPulseAt: ISO timestamp of .neo-ai-data/heartbeat-concurrency.lock mtime (or null if file absent)
  • secondsSinceLastPulse: derived; surfaces "daemon alive but stalled" signal when > 2× POLL_INTERVAL

Optional v2 extension: daemonProcessIdentity field exposing the launchd-label or PID for direct operator-debug.

Acceptance Criteria

  • (AC1) buildWakeFeaturesBlock(cfg) added to HealthService.mjs as module-scope pure projection
  • (AC2) features.wake block wired into healthcheck payload (alongside existing features.dream from #10779 once that lands; or ship as features: { wake: {...} } alone if #10779 not yet merged — first-mover handles the features namespace creation cleanly)
  • (AC3) Block exposes gate state (read from .neo-ai-data/wake-daemon/wake-safety-gate.json) + daemon-running heuristic + last-pulse timestamp + seconds-since
  • (AC4) Defensive: missing files / unreadable state surfaces sensible defaults (gateState: 'unknown', daemonRunning: false, lastPulseAt: null) WITHOUT throwing
  • (AC5) Unit tests cover: gate enabled / disabled / tripped / file-missing / file-malformed; daemon running with recent pulse / running-but-stalled / not-running / no-lock-file
  • (AC6) learn/agentos/wake-substrate/PersistentProcessManagement.md §3d updated to reference healthcheck-side verification (replacing or complementing the tail -f log inspection commands)

Out of Scope

  • Mechanical alerting / paging on gateState !== 'enabled' — separate observability concern; this ticket surfaces, doesn't alert
  • Direct PID-level process inspection (would require shell-out which complicates testability) — file-mtime + state-file-presence is the lighter-weight signal
  • Cross-machine wake-substrate observability (multi-host shared deployments) — separate concern; this ticket is single-host
  • Auto-recovery from tripped state — operator-territory by wakeSafetyGate deny-by-default discipline

Avoided Traps

  • Bundling with #10779 (features.dream): different substrate (DreamMode/Sandman vs wake-recovery); first-mover handles features namespace; sibling-block is the right shape
  • Auto-tripping the gate from healthcheck signal: healthcheck is read-only observability; trip-action is operator-territory by design
  • Process-detection via shell-out: adds testability complexity + platform-dependence (launchctl vs systemctl vs pgrep). File-mtime heuristic on the concurrency-lock is sufficient signal for "daemon polling actively" without shell-out

Related

  • Sibling observability: #10770 (providers.auth), #10773 (providers.neoEmbedding), #10779 (features.dream)
  • Pattern precedents: #10176 (buildIdentityBlock), #10127 (buildTopologyBlock), #10723 (buildEmbeddingProviderBlock)
  • Adjacent substrate: #10781 (persistent-process management — reads from same files this observability surfaces); #10671 (parent epic)
  • Operational doc: learn/agentos/wake-substrate/PersistentProcessManagement.md §3d (currently uses tail -f for verification; this block would simplify to a healthcheck call)

Origin Session ID: 23b9cbcd-4938-4a46-b21a-0d48dd12e7e7

Retrieval Hint: query_raw_memories(query="healthcheck features wake observability gate-state daemon-running last-pulse 10779 10770 10781 sibling")

tobiu referenced in commit 21d5cdd - "feat(memory-core): healthcheck features.wake observability block (#10783) (#10930) on May 7, 2026, 11:39 PM
tobiu closed this issue on May 7, 2026, 11:39 PM