LearnNewsExamplesServices
Frontmatter
id10587
titleCapture runSandman inference hard-failures durably
stateClosed
labels
enhancementaitestingarchitecture
assigneesneo-gpt
createdAtMay 1, 2026, 6:03 PM
updatedAtMay 1, 2026, 7:29 PM
githubUrlhttps://github.com/neomjs/neo/issues/10587
authorneo-gpt
commentsCount0
parentIssue9999
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 1, 2026, 7:29 PM

Capture runSandman inference hard-failures durably

Closedenhancementaitestingarchitecture
neo-gpt
neo-gpt commented on May 1, 2026, 6:03 PM

Context

Post-restart verification after #10580/#10581/#10583/#10575/#10585 is green: MCP healthchecks are healthy, logger files are present for Knowledge Base / Memory Core / Neural Link, config templates resolve neoRootDir and logPath to the repo root, and the focused logger + daemon freshness suites pass.

The remaining Sandman / Golden Path recovery gap is different: when buildScripts/ai/runSandman.mjs cannot reach the OpenAI-compatible / MLX provider during its explicit provider warm-up loop, it prints a terminal-only error and exits before DreamService.ready() or DreamService.processUndigestedSessions() can create durable substrate. Future agents then see symptoms like an absent resources/content/sandman_handoff.md or an empty frontier, but the root cause is not queryable from Memory Core, Graph state, or durable diagnostics.

This must stay separate from #10569. Auto-boot / boot-time summarization was intentionally disabled because each harness can start duplicate MCP server instances and create duplicate summaries. This ticket is only about capturing the explicit runSandman hard-fail path durably when the operator runs Sandman and the provider is unavailable.

Origin Session ID: cf46c3e3-3bc7-4726-8b0b-b9c9af48742f

Problem

runSandman.mjs currently has a provider readiness timeout path that is operationally important but ephemeral:

  • It waits for LifecycleService.ready().
  • It polls checkProvider() for up to 30 seconds.
  • If the provider is still unavailable, it writes a console error and exits nonzero.
  • That exit happens before DreamService / Golden Path can emit a durable handoff artifact.

This leaves a high-friction coordination gap for the trio: the failure is visible only in the terminal session that ran the script, while later agents must infer why Sandman produced no handoff.

Architectural Reality

  • buildScripts/ai/runSandman.mjs owns the explicit Sandman operator path and currently exits before pipeline work when the provider is unavailable.
  • ai/mcp/server/memory-core/services/lifecycle/InferenceLifecycleService.mjs already has structured inference lifecycle logging for provider start/readiness behavior.
  • #10580/#10583 established always-on MCP logging paths; this ticket should reuse that observability direction instead of introducing a new MCP tool surface.
  • #10569 confirms that re-enabling boot-time auto-summarization / auto-Golden-Path behavior is out of scope and strategically wrong until MCP duplication is solved.

Proposed Fix

Add a small durable failure-capture primitive around the runSandman provider hard-fail path.

Preferred shape:

  • Keep the existing console output and nonzero exit behavior.
  • Add a structured, durable breadcrumb using the existing Memory Core logger and/or a narrowly scoped SDK/service helper.
  • Capture at least:
    • provider family / configured model provider
    • provider host
    • configured OpenAI-compatible model when available
    • elapsed wait / timeout threshold
    • lifecycle status if readily available
    • explicit next operator action
  • Keep the implementation testable without a real MLX, LM Studio, or OpenAI-compatible server.

Acceptance Criteria

  • When provider readiness times out, runSandman still exits nonzero.
  • The failure is also captured durably in a queryable/loggable substrate, not only terminal stderr.
  • The durable record includes host and reason enough for the next agent to identify provider unavailability without rerunning Sandman.
  • No boot-time defaults are changed; autoDream, autoSummarize, and autoGoldenPath remain disabled unless explicitly invoked.
  • No new MCP tool surface is added for this narrow case.
  • A unit test simulates provider unavailability without requiring a live provider and asserts both durable capture and the exit path.
  • Post-merge validation documents the exact command or query future agents can use to find the durable failure breadcrumb.

Out of Scope

  • Re-enabling auto-boot, boot-time summarization, autoDream, or autoGoldenPath behavior (#10569).
  • Fixing the LM Studio monolithic JSON payload crash / SQLite batch-size issue (#10484).
  • Reworking provider startup ordering already handled by #9832.
  • Building a full autonomous healthcheck workflow (#10018).
  • Moving MCP configuration into the Services SDK (#10103).
  • Solving MCP server duplication / single-writer enforcement (#10186).

Duplicate Sweep / Related Work

  • #9832: closed startup sequence race condition; related but not duplicate.
  • #10569: closed boot-time auto-summary re-enable proposal; explicitly not this ticket.
  • #10484: open LM Studio payload / SQLite batching bug; related provider infrastructure, not this failure path.
  • #10018: autonomous healthcheck workflow; broader process work, not this concrete breadcrumb.
  • #10186: MCP concurrency / single-writer epic; strategic blocker for auto-boot, not this explicit operator path.
  • #10103: SDK-layer config migration; future boundary work, not required here.

Handoff Retrieval Hints

  • query_raw_memories(query="runSandman provider hard-fail durable observability Slice B")
  • query_raw_memories(query="post restart verification MCP logger config freshness 10580 10583 10585")
  • query_summaries(query="DreamService Golden Path Sandman recovery trio coordination")
  • Source anchors: buildScripts/ai/runSandman.mjs, ai/mcp/server/memory-core/services/lifecycle/InferenceLifecycleService.mjs
tobiu referenced in commit 3397976 - "feat(ai): capture runSandman provider hard-failures (#10587) (#10590) on May 1, 2026, 7:29 PM
tobiu closed this issue on May 1, 2026, 7:29 PM