Context
Surfaced 2026-05-05 during #10781 persistent-process-management investigation. The Memory Core healthcheck currently exposes database.topology (#10127), identity (#10176), providers.embedding (#10723), providers.summary (#10724/PR #10771). Sibling providers.auth ticket filed under #10770. Sibling providers.neoEmbedding ticket filed under #10773. Sibling features.dream ticket filed under #10779.
The wake substrate has no healthcheck observability surface. Operators (and agents) currently can't see — without filesystem-grepping — whether:
- The wake-safety-gate is
enabled / disabled / tripped (state at .neo-ai-data/wake-daemon/wake-safety-gate.json)
- The
swarm-heartbeat.sh daemon is currently running (process-detection)
- The daemon is actively polling (concurrency-lock file mtime advances every pulse)
This gap blocks two workflows:
- Operators verifying night-shift readiness post-#10781 install would have to check 3 separate filesystem locations + run
launchctl list
- Agents detecting that the wake substrate is healthy before relying on heartbeat-driven recovery — currently impossible without filesystem reads outside their MCP tool surface
The Problem
The wake substrate has multiple operational dimensions invisible from healthcheck:
- Gate state: OK / blocking. Operator can flip via CLI but can't confirm via healthcheck.
- Daemon liveness: is the launchd-managed process actually running? Or did it crash?
- Polling activity: when did the last heartbeat pulse fire? If
> 2× POLL_INTERVAL ago, daemon may be alive-but-stuck.
Without observability, agents fall back to the engagement-on-every-ping pattern (or its inverse) because they can't tell whether the heartbeat substrate would catch a deadlock.
The Architectural Reality
ai/mcp/server/memory-core/services/HealthService.mjs — pattern of module-scope pure projection functions (buildIdentityBlock, buildTopologyBlock, buildEmbeddingProviderBlock); test coverage in HealthService.spec.mjs
.neo-ai-data/wake-daemon/wake-safety-gate.json — gate state file; readable via wakeSafetyGate.mjs (existing module exports readGateState + hasOverride)
.neo-ai-data/heartbeat-concurrency.lock — mtime-advanced-on-poll; readable via fs.stat
- Daemon running-state — detectable via
launchctl list | grep com.neomjs.swarm-heartbeat (macOS) OR pgrep -f swarm-heartbeat.sh (cross-platform fallback)
The Fix
Add buildWakeFeaturesBlock(cfg) module-scope pure projection function to HealthService.mjs. Wire into healthcheck payload as new top-level features.wake:
"features": {
"wake": {
"gateState": "tripped",
"gateReason": "",
"gateTrippedAt": "2026-05-03T22:53:09.450Z",
"gateTrippedBy": "cli",
"daemonRunning": false,
"lastPulseAt": null,
"secondsSinceLastPulse": null
}
}
Field semantics:
gateState: 'enabled' | 'disabled' | 'tripped' (read from .neo-ai-data/wake-daemon/wake-safety-gate.json)
gateReason / gateTrippedAt / gateTrippedBy: pass-through from the gate state file when applicable (empty / null otherwise)
daemonRunning: boolean — is the daemon process detected? (best-effort detection via filesystem signal: lock file mtime within 2 × POLL_INTERVAL AND no missing-process indicator)
lastPulseAt: ISO timestamp of .neo-ai-data/heartbeat-concurrency.lock mtime (or null if file absent)
secondsSinceLastPulse: derived; surfaces "daemon alive but stalled" signal when > 2× POLL_INTERVAL
Optional v2 extension: daemonProcessIdentity field exposing the launchd-label or PID for direct operator-debug.
Acceptance Criteria
Out of Scope
- Mechanical alerting / paging on
gateState !== 'enabled' — separate observability concern; this ticket surfaces, doesn't alert
- Direct PID-level process inspection (would require shell-out which complicates testability) — file-mtime + state-file-presence is the lighter-weight signal
- Cross-machine wake-substrate observability (multi-host shared deployments) — separate concern; this ticket is single-host
- Auto-recovery from
tripped state — operator-territory by wakeSafetyGate deny-by-default discipline
Avoided Traps
- Bundling with #10779 (
features.dream): different substrate (DreamMode/Sandman vs wake-recovery); first-mover handles features namespace; sibling-block is the right shape
- Auto-tripping the gate from healthcheck signal: healthcheck is read-only observability; trip-action is operator-territory by design
- Process-detection via shell-out: adds testability complexity + platform-dependence (launchctl vs systemctl vs pgrep). File-mtime heuristic on the concurrency-lock is sufficient signal for "daemon polling actively" without shell-out
Related
- Sibling observability: #10770 (
providers.auth), #10773 (providers.neoEmbedding), #10779 (features.dream)
- Pattern precedents: #10176 (
buildIdentityBlock), #10127 (buildTopologyBlock), #10723 (buildEmbeddingProviderBlock)
- Adjacent substrate: #10781 (persistent-process management — reads from same files this observability surfaces); #10671 (parent epic)
- Operational doc:
learn/agentos/wake-substrate/PersistentProcessManagement.md §3d (currently uses tail -f for verification; this block would simplify to a healthcheck call)
Origin Session ID: 23b9cbcd-4938-4a46-b21a-0d48dd12e7e7
Retrieval Hint: query_raw_memories(query="healthcheck features wake observability gate-state daemon-running last-pulse 10779 10770 10781 sibling")
Context
Surfaced 2026-05-05 during #10781 persistent-process-management investigation. The Memory Core healthcheck currently exposes
database.topology(#10127),identity(#10176),providers.embedding(#10723),providers.summary(#10724/PR #10771). Siblingproviders.authticket filed under #10770. Siblingproviders.neoEmbeddingticket filed under #10773. Siblingfeatures.dreamticket filed under #10779.The wake substrate has no healthcheck observability surface. Operators (and agents) currently can't see — without filesystem-grepping — whether:
enabled/disabled/tripped(state at.neo-ai-data/wake-daemon/wake-safety-gate.json)swarm-heartbeat.shdaemon is currently running (process-detection)This gap blocks two workflows:
launchctl listThe Problem
The wake substrate has multiple operational dimensions invisible from healthcheck:
> 2× POLL_INTERVALago, daemon may be alive-but-stuck.Without observability, agents fall back to the engagement-on-every-ping pattern (or its inverse) because they can't tell whether the heartbeat substrate would catch a deadlock.
The Architectural Reality
ai/mcp/server/memory-core/services/HealthService.mjs— pattern of module-scope pure projection functions (buildIdentityBlock,buildTopologyBlock,buildEmbeddingProviderBlock); test coverage inHealthService.spec.mjs.neo-ai-data/wake-daemon/wake-safety-gate.json— gate state file; readable viawakeSafetyGate.mjs(existing module exportsreadGateState+hasOverride).neo-ai-data/heartbeat-concurrency.lock— mtime-advanced-on-poll; readable viafs.statlaunchctl list | grep com.neomjs.swarm-heartbeat(macOS) ORpgrep -f swarm-heartbeat.sh(cross-platform fallback)The Fix
Add
buildWakeFeaturesBlock(cfg)module-scope pure projection function toHealthService.mjs. Wire into healthcheck payload as new top-levelfeatures.wake:"features": { "wake": { "gateState": "tripped", "gateReason": "", "gateTrippedAt": "2026-05-03T22:53:09.450Z", "gateTrippedBy": "cli", "daemonRunning": false, "lastPulseAt": null, "secondsSinceLastPulse": null } }Field semantics:
gateState:'enabled'|'disabled'|'tripped'(read from.neo-ai-data/wake-daemon/wake-safety-gate.json)gateReason/gateTrippedAt/gateTrippedBy: pass-through from the gate state file when applicable (empty / null otherwise)daemonRunning: boolean — is the daemon process detected? (best-effort detection via filesystem signal: lock file mtime within2 × POLL_INTERVALAND no missing-process indicator)lastPulseAt: ISO timestamp of.neo-ai-data/heartbeat-concurrency.lockmtime (or null if file absent)secondsSinceLastPulse: derived; surfaces "daemon alive but stalled" signal when> 2× POLL_INTERVALOptional v2 extension:
daemonProcessIdentityfield exposing the launchd-label or PID for direct operator-debug.Acceptance Criteria
buildWakeFeaturesBlock(cfg)added toHealthService.mjsas module-scope pure projectionfeatures.wakeblock wired into healthcheck payload (alongside existingfeatures.dreamfrom #10779 once that lands; or ship asfeatures: { wake: {...} }alone if #10779 not yet merged — first-mover handles thefeaturesnamespace creation cleanly).neo-ai-data/wake-daemon/wake-safety-gate.json) + daemon-running heuristic + last-pulse timestamp + seconds-sincegateState: 'unknown',daemonRunning: false,lastPulseAt: null) WITHOUT throwinglearn/agentos/wake-substrate/PersistentProcessManagement.md §3dupdated to reference healthcheck-side verification (replacing or complementing thetail -flog inspection commands)Out of Scope
gateState !== 'enabled'— separate observability concern; this ticket surfaces, doesn't alerttrippedstate — operator-territory bywakeSafetyGatedeny-by-default disciplineAvoided Traps
features.dream): different substrate (DreamMode/Sandman vs wake-recovery); first-mover handlesfeaturesnamespace; sibling-block is the right shapeRelated
providers.auth), #10773 (providers.neoEmbedding), #10779 (features.dream)buildIdentityBlock), #10127 (buildTopologyBlock), #10723 (buildEmbeddingProviderBlock)learn/agentos/wake-substrate/PersistentProcessManagement.md §3d(currently usestail -ffor verification; this block would simplify to a healthcheck call)Origin Session ID: 23b9cbcd-4938-4a46-b21a-0d48dd12e7e7
Retrieval Hint:
query_raw_memories(query="healthcheck features wake observability gate-state daemon-running last-pulse 10779 10770 10781 sibling")