What is the Neural Link?

The Neural Link is a bi-directional bridge that connects AI agents directly to the Neo.mjs runtime. It lets agents inspect the Scene Graph, component state, event listeners, computed styles, and DOM rectangles, and mutate the running application in real time.

Why is Neo.mjs called an Application Engine instead of a framework?

Neo.mjs maintains persistent application objects in a worker-backed Scene Graph instead of compiling application state away into ephemeral DOM nodes. That architecture enables multi-window orchestration, runtime permutation, and deep AI introspection.

What is Context Engineering?

Context Engineering shapes the information and tool environment around AI agents. Neo.mjs implements it through Knowledge Base, Memory Core, GitHub Workflow, and Neural Link MCP servers for frontier harnesses, plus a File System MCP server for internal Neo.ai.Agent local loops.

What is the Neo.mjs Agent OS?

The Neo.mjs Agent OS is the repository Brain: source code and services for Memory Core, Knowledge Base, Active Hybrid GraphRAG, DreamService, Golden Path synthesis, A2A coordination, and Neural Link tooling.

Frontmatter

id	10587
title	Capture runSandman inference hard-failures durably
state	Closed
labels	enhancementaitestingarchitecture
assignees	neo-gpt
createdAt	May 1, 2026, 6:03 PM
updatedAt	May 1, 2026, 7:29 PM
githubUrl	https://github.com/neomjs/neo/issues/10587
author	neo-gpt
commentsCount	0
parentIssue	9999
subIssues	[]
subIssuesCompleted	0
subIssuesTotal	0
blockedBy	[]
blocking	[]
closedAt	May 1, 2026, 7:29 PM

Capture runSandman inference hard-failures durably

Closed v13.0.0/archive-v13-0-0-chunk-7 enhancementaitestingarchitecture

neo-gpt commented on May 1, 2026, 6:03 PM

Context

Post-restart verification after #10580/#10581/#10583/#10575/#10585 is green: MCP healthchecks are healthy, logger files are present for Knowledge Base / Memory Core / Neural Link, config templates resolve neoRootDir and logPath to the repo root, and the focused logger + daemon freshness suites pass.

The remaining Sandman / Golden Path recovery gap is different: when buildScripts/ai/runSandman.mjs cannot reach the OpenAI-compatible / MLX provider during its explicit provider warm-up loop, it prints a terminal-only error and exits before DreamService.ready() or DreamService.processUndigestedSessions() can create durable substrate. Future agents then see symptoms like an absent resources/content/sandman_handoff.md or an empty frontier, but the root cause is not queryable from Memory Core, Graph state, or durable diagnostics.

This must stay separate from #10569. Auto-boot / boot-time summarization was intentionally disabled because each harness can start duplicate MCP server instances and create duplicate summaries. This ticket is only about capturing the explicit runSandman hard-fail path durably when the operator runs Sandman and the provider is unavailable.

Origin Session ID: cf46c3e3-3bc7-4726-8b0b-b9c9af48742f

Problem

runSandman.mjs currently has a provider readiness timeout path that is operationally important but ephemeral:

It waits for LifecycleService.ready().
It polls checkProvider() for up to 30 seconds.
If the provider is still unavailable, it writes a console error and exits nonzero.
That exit happens before DreamService / Golden Path can emit a durable handoff artifact.

This leaves a high-friction coordination gap for the trio: the failure is visible only in the terminal session that ran the script, while later agents must infer why Sandman produced no handoff.

Architectural Reality

buildScripts/ai/runSandman.mjs owns the explicit Sandman operator path and currently exits before pipeline work when the provider is unavailable.
ai/mcp/server/memory-core/services/lifecycle/InferenceLifecycleService.mjs already has structured inference lifecycle logging for provider start/readiness behavior.
#10580/#10583 established always-on MCP logging paths; this ticket should reuse that observability direction instead of introducing a new MCP tool surface.
#10569 confirms that re-enabling boot-time auto-summarization / auto-Golden-Path behavior is out of scope and strategically wrong until MCP duplication is solved.

Proposed Fix

Add a small durable failure-capture primitive around the runSandman provider hard-fail path.

Preferred shape:

Keep the existing console output and nonzero exit behavior.
Add a structured, durable breadcrumb using the existing Memory Core logger and/or a narrowly scoped SDK/service helper.
Capture at least:
- provider family / configured model provider
- provider host
- configured OpenAI-compatible model when available
- elapsed wait / timeout threshold
- lifecycle status if readily available
- explicit next operator action
Keep the implementation testable without a real MLX, LM Studio, or OpenAI-compatible server.

Acceptance Criteria

When provider readiness times out, runSandman still exits nonzero.
The failure is also captured durably in a queryable/loggable substrate, not only terminal stderr.
The durable record includes host and reason enough for the next agent to identify provider unavailability without rerunning Sandman.
No boot-time defaults are changed; autoDream, autoSummarize, and autoGoldenPath remain disabled unless explicitly invoked.
No new MCP tool surface is added for this narrow case.
A unit test simulates provider unavailability without requiring a live provider and asserts both durable capture and the exit path.
Post-merge validation documents the exact command or query future agents can use to find the durable failure breadcrumb.

Out of Scope

Re-enabling auto-boot, boot-time summarization, autoDream, or autoGoldenPath behavior (#10569).
Fixing the LM Studio monolithic JSON payload crash / SQLite batch-size issue (#10484).
Reworking provider startup ordering already handled by #9832.
Building a full autonomous healthcheck workflow (#10018).
Moving MCP configuration into the Services SDK (#10103).
Solving MCP server duplication / single-writer enforcement (#10186).

Duplicate Sweep / Related Work

#9832: closed startup sequence race condition; related but not duplicate.
#10569: closed boot-time auto-summary re-enable proposal; explicitly not this ticket.
#10484: open LM Studio payload / SQLite batching bug; related provider infrastructure, not this failure path.
#10018: autonomous healthcheck workflow; broader process work, not this concrete breadcrumb.
#10186: MCP concurrency / single-writer epic; strategic blocker for auto-boot, not this explicit operator path.
#10103: SDK-layer config migration; future boundary work, not required here.

Handoff Retrieval Hints

query_raw_memories(query="runSandman provider hard-fail durable observability Slice B")
query_raw_memories(query="post restart verification MCP logger config freshness 10580 10583 10585")
query_summaries(query="DreamService Golden Path Sandman recovery trio coordination")
Source anchors: buildScripts/ai/runSandman.mjs, ai/mcp/server/memory-core/services/lifecycle/InferenceLifecycleService.mjs

tobiu referenced in commit 3397976 - "feat(ai): capture runSandman provider hard-failures (#10587) (#10590) on May 1, 2026, 7:29 PM

tobiu closed this issue on May 1, 2026, 7:29 PM