Context
Post-restart verification after #10580/#10581/#10583/#10575/#10585 is green: MCP healthchecks are healthy, logger files are present for Knowledge Base / Memory Core / Neural Link, config templates resolve neoRootDir and logPath to the repo root, and the focused logger + daemon freshness suites pass.
The remaining Sandman / Golden Path recovery gap is different: when buildScripts/ai/runSandman.mjs cannot reach the OpenAI-compatible / MLX provider during its explicit provider warm-up loop, it prints a terminal-only error and exits before DreamService.ready() or DreamService.processUndigestedSessions() can create durable substrate. Future agents then see symptoms like an absent resources/content/sandman_handoff.md or an empty frontier, but the root cause is not queryable from Memory Core, Graph state, or durable diagnostics.
This must stay separate from #10569. Auto-boot / boot-time summarization was intentionally disabled because each harness can start duplicate MCP server instances and create duplicate summaries. This ticket is only about capturing the explicit runSandman hard-fail path durably when the operator runs Sandman and the provider is unavailable.
Origin Session ID: cf46c3e3-3bc7-4726-8b0b-b9c9af48742f
Problem
runSandman.mjs currently has a provider readiness timeout path that is operationally important but ephemeral:
- It waits for
LifecycleService.ready().
- It polls
checkProvider() for up to 30 seconds.
- If the provider is still unavailable, it writes a console error and exits nonzero.
- That exit happens before DreamService / Golden Path can emit a durable handoff artifact.
This leaves a high-friction coordination gap for the trio: the failure is visible only in the terminal session that ran the script, while later agents must infer why Sandman produced no handoff.
Architectural Reality
buildScripts/ai/runSandman.mjs owns the explicit Sandman operator path and currently exits before pipeline work when the provider is unavailable.
ai/mcp/server/memory-core/services/lifecycle/InferenceLifecycleService.mjs already has structured inference lifecycle logging for provider start/readiness behavior.
- #10580/#10583 established always-on MCP logging paths; this ticket should reuse that observability direction instead of introducing a new MCP tool surface.
- #10569 confirms that re-enabling boot-time auto-summarization / auto-Golden-Path behavior is out of scope and strategically wrong until MCP duplication is solved.
Proposed Fix
Add a small durable failure-capture primitive around the runSandman provider hard-fail path.
Preferred shape:
- Keep the existing console output and nonzero exit behavior.
- Add a structured, durable breadcrumb using the existing Memory Core logger and/or a narrowly scoped SDK/service helper.
- Capture at least:
- provider family / configured model provider
- provider host
- configured OpenAI-compatible model when available
- elapsed wait / timeout threshold
- lifecycle status if readily available
- explicit next operator action
- Keep the implementation testable without a real MLX, LM Studio, or OpenAI-compatible server.
Acceptance Criteria
- When provider readiness times out,
runSandman still exits nonzero.
- The failure is also captured durably in a queryable/loggable substrate, not only terminal stderr.
- The durable record includes host and reason enough for the next agent to identify provider unavailability without rerunning Sandman.
- No boot-time defaults are changed;
autoDream, autoSummarize, and autoGoldenPath remain disabled unless explicitly invoked.
- No new MCP tool surface is added for this narrow case.
- A unit test simulates provider unavailability without requiring a live provider and asserts both durable capture and the exit path.
- Post-merge validation documents the exact command or query future agents can use to find the durable failure breadcrumb.
Out of Scope
- Re-enabling auto-boot, boot-time summarization, autoDream, or autoGoldenPath behavior (#10569).
- Fixing the LM Studio monolithic JSON payload crash / SQLite batch-size issue (#10484).
- Reworking provider startup ordering already handled by #9832.
- Building a full autonomous healthcheck workflow (#10018).
- Moving MCP configuration into the Services SDK (#10103).
- Solving MCP server duplication / single-writer enforcement (#10186).
Duplicate Sweep / Related Work
- #9832: closed startup sequence race condition; related but not duplicate.
- #10569: closed boot-time auto-summary re-enable proposal; explicitly not this ticket.
- #10484: open LM Studio payload / SQLite batching bug; related provider infrastructure, not this failure path.
- #10018: autonomous healthcheck workflow; broader process work, not this concrete breadcrumb.
- #10186: MCP concurrency / single-writer epic; strategic blocker for auto-boot, not this explicit operator path.
- #10103: SDK-layer config migration; future boundary work, not required here.
Handoff Retrieval Hints
query_raw_memories(query="runSandman provider hard-fail durable observability Slice B")
query_raw_memories(query="post restart verification MCP logger config freshness 10580 10583 10585")
query_summaries(query="DreamService Golden Path Sandman recovery trio coordination")
- Source anchors:
buildScripts/ai/runSandman.mjs, ai/mcp/server/memory-core/services/lifecycle/InferenceLifecycleService.mjs
Context
Post-restart verification after #10580/#10581/#10583/#10575/#10585 is green: MCP healthchecks are healthy, logger files are present for Knowledge Base / Memory Core / Neural Link, config templates resolve
neoRootDirandlogPathto the repo root, and the focused logger + daemon freshness suites pass.The remaining Sandman / Golden Path recovery gap is different: when
buildScripts/ai/runSandman.mjscannot reach the OpenAI-compatible / MLX provider during its explicit provider warm-up loop, it prints a terminal-only error and exits beforeDreamService.ready()orDreamService.processUndigestedSessions()can create durable substrate. Future agents then see symptoms like an absentresources/content/sandman_handoff.mdor an empty frontier, but the root cause is not queryable from Memory Core, Graph state, or durable diagnostics.This must stay separate from #10569. Auto-boot / boot-time summarization was intentionally disabled because each harness can start duplicate MCP server instances and create duplicate summaries. This ticket is only about capturing the explicit
runSandmanhard-fail path durably when the operator runs Sandman and the provider is unavailable.Origin Session ID: cf46c3e3-3bc7-4726-8b0b-b9c9af48742f
Problem
runSandman.mjscurrently has a provider readiness timeout path that is operationally important but ephemeral:LifecycleService.ready().checkProvider()for up to 30 seconds.This leaves a high-friction coordination gap for the trio: the failure is visible only in the terminal session that ran the script, while later agents must infer why Sandman produced no handoff.
Architectural Reality
buildScripts/ai/runSandman.mjsowns the explicit Sandman operator path and currently exits before pipeline work when the provider is unavailable.ai/mcp/server/memory-core/services/lifecycle/InferenceLifecycleService.mjsalready has structured inference lifecycle logging for provider start/readiness behavior.Proposed Fix
Add a small durable failure-capture primitive around the
runSandmanprovider hard-fail path.Preferred shape:
Acceptance Criteria
runSandmanstill exits nonzero.autoDream,autoSummarize, andautoGoldenPathremain disabled unless explicitly invoked.Out of Scope
Duplicate Sweep / Related Work
Handoff Retrieval Hints
query_raw_memories(query="runSandman provider hard-fail durable observability Slice B")query_raw_memories(query="post restart verification MCP logger config freshness 10580 10583 10585")query_summaries(query="DreamService Golden Path Sandman recovery trio coordination")buildScripts/ai/runSandman.mjs,ai/mcp/server/memory-core/services/lifecycle/InferenceLifecycleService.mjs