LearnNewsExamplesServices
Frontmatter
id11384
titleProcessSupervisorService discards child stderr — capture per-task to enable failure triage
stateClosed
labels
enhancementaiagent-task:pendingarchitecturebuildmodel-experience
assignees[]
createdAtMay 15, 2026, 1:52 AM
updatedAtMay 15, 2026, 1:54 AM
githubUrlhttps://github.com/neomjs/neo/issues/11384
authorneo-opus-4-7
commentsCount1
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 15, 2026, 1:54 AM

ProcessSupervisorService discards child stderr — capture per-task to enable failure triage

Closedenhancementaiagent-task:pendingarchitecturebuildmodel-experience
neo-opus-4-7
neo-opus-4-7 commented on May 15, 2026, 1:52 AM

Context

ai/daemons/services/ProcessSupervisorService.mjs is the orchestrator's universal child-process launcher. Every supervisor-managed task (chroma, bridge daemon, mlx, summary, kbSync, backup) spawns through runTask() at line 207:

child = this.spawnFn(task.command, task.args, {stdio: 'ignore'});

stdio: 'ignore' discards stdout AND stderr from the child. The orchestrator's own log records only the exit code (exited with code 1); there is no recoverable signal about WHY a child failed.

Empirical anchor (2026-05-14T23:24Z, first npm run ai:orchestrator boot):

  • session summarization exited with code 1 — root cause unknown (likely MC Chroma absence; cascade)
  • mlx inference exited with code 1 — root cause unknown (turned out to be gemma4:31b Ollama-format model arg vs mlx_lm HF-format expectation; required manual python -m mlx_lm.server reproduction to discover)
  • memory core backup exited with code 1 — root cause unknown
  • knowledge base sync exited with code 1 — root cause unknown

Each of those failures required out-of-band manual reproduction to diagnose. The supervisor's own log + state file had no error text. mlx alone has been auto-restart-looping every ~15 seconds since boot (~10 minutes = ~40 restarts) with zero stderr captured anywhere.

Problem

The Agent OS daemon substrate is observability-blind by design. Operator-facing failure modes ("the orchestrator exited code 1") cannot be triaged from the orchestrator log alone. Every failure becomes a forensic exercise of "manually rerun the child outside the supervisor to capture stderr."

This violates the AGENTS.md §13 friction → gold core value: substrate that swallows failure signals cannot be improved via the MX loop because the friction is invisible to the model that should be ticketing it.

Fix (concrete prescription)

Replace {stdio: 'ignore'} at ProcessSupervisorService.mjs line 207 with per-task log-file capture:

const stderrLogPath = path.join(this.dataDir, `${taskName}.stderr.log`);
const stderrFd = fs.openSync(stderrLogPath, 'a');

child = this.spawnFn(task.command, task.args, {
    stdio: ['ignore', 'ignore', stderrFd]
});

// Close fd when child exits to avoid fd leak
child.on('close', () => {
    try { fs.closeSync(stderrFd); } catch (e) {}
});

The supervisor keeps stdout discarded (children should not be log-spamming via stdout under orchestration) but routes stderr to a per-task append-only log alongside the existing .pid file in .neo-ai-data/orchestrator-daemon/.

When a child exits non-zero, the existing writeLog('ERROR', ...) call adds a hint to the orchestrator log:

} else {
    this.taskStateService.markFailed(taskName, code);
    this.writeLog?.('ERROR', `[ProcessSupervisor] ${task.label} exited with code ${code}. stderr: ${stderrLogPath}`);
    this.recordTaskOutcome(taskName, 'failed', {reason, code, stderrLogPath, failedAt: new Date().toISOString()});
}

Acceptance Criteria

  • Per-task stderr log file written to ${dataDir}/${taskName}.stderr.log (append-mode).
  • Each non-zero exit logged with stderr: <path> hint in the orchestrator log.
  • HealthService recordTaskOutcome payload includes stderrLogPath.
  • Log path is greppable from operator shell without restarting any process.
  • File descriptor leak prevention: stderr fd is closed when child exits.
  • Append-mode preserves history across restarts (mlx restart-loop history visible).
  • Manual log-rotation strategy noted in JSDoc (filed as future-ticket if needed; out of scope here).
  • Existing recovery path (recoverTask) unaffected — adopted external PIDs don't get stderr redirection retroactively (that's fine; they had their own stdio at spawn time).

Out of Scope

  • Log rotation — append-mode logs may grow unbounded for fast-failing children (mlx empirical: ~40 restarts in 10min). File a follow-up ticket if size becomes operationally relevant; for now, manual operator cleanup is acceptable.
  • Stdout capture — children should not log via stdout under orchestration; if they do, that's their bug.
  • Restructuring the supervisor to use Neo-style logger — keep current writeLog callback shape; this is a stdio-routing fix, not a logging-architecture refactor.

Avoided Traps

  • Buffering child stderr in-memory: would inflate orchestrator-process memory on restart-loop children (mlx 40-restarts/10min). Disk-append is bounded by disk, not orchestrator RAM.
  • Sending stderr to a single shared file: would interleave output from N children; per-task isolation is the diagnosable shape.
  • Piping stderr through Node IPC (stdio: ['ignore', 'ignore', 'pipe'] + manual child.stderr.on('data', ...)): adds memory pressure + buffer-management complexity for zero gain over fs.openSync fd routing.

Related

  • Substrate authority: ai/daemons/services/ProcessSupervisorService.mjs line 207
  • Orchestrator: ai/daemons/Orchestrator.mjs, ai/daemons/TaskDefinitions.mjs
  • #11009 — orchestrator class refactor establishing the Neo daemon shape this lives within
  • Related substrate gap: orchestrator should manage MC Chroma (separate ticket; would benefit from per-task stderr capture for diagnostics)

Origin Session

  • Origin Session ID: e095c569-beac-4743-998f-e07d4344492e

Retrieval Hint

Search for ProcessSupervisorService stdio ignore child stderr unrecoverable orchestrator daemon log.