Context
ai/daemons/services/ProcessSupervisorService.mjs is the orchestrator's universal child-process launcher. Every supervisor-managed task (chroma, bridge daemon, mlx, summary, kbSync, backup) spawns through runTask() at line 207:
child = this.spawnFn(task.command, task.args, {stdio: 'ignore'});
stdio: 'ignore' discards stdout AND stderr from the child. The orchestrator's own log records only the exit code (exited with code 1); there is no recoverable signal about WHY a child failed.
Empirical anchor (2026-05-14T23:24Z, first npm run ai:orchestrator boot):
session summarization exited with code 1 — root cause unknown (likely MC Chroma absence; cascade)
mlx inference exited with code 1 — root cause unknown (turned out to be gemma4:31b Ollama-format model arg vs mlx_lm HF-format expectation; required manual python -m mlx_lm.server reproduction to discover)
memory core backup exited with code 1 — root cause unknown
knowledge base sync exited with code 1 — root cause unknown
Each of those failures required out-of-band manual reproduction to diagnose. The supervisor's own log + state file had no error text. mlx alone has been auto-restart-looping every ~15 seconds since boot (~10 minutes = ~40 restarts) with zero stderr captured anywhere.
Problem
The Agent OS daemon substrate is observability-blind by design. Operator-facing failure modes ("the orchestrator exited code 1") cannot be triaged from the orchestrator log alone. Every failure becomes a forensic exercise of "manually rerun the child outside the supervisor to capture stderr."
This violates the AGENTS.md §13 friction → gold core value: substrate that swallows failure signals cannot be improved via the MX loop because the friction is invisible to the model that should be ticketing it.
Fix (concrete prescription)
Replace {stdio: 'ignore'} at ProcessSupervisorService.mjs line 207 with per-task log-file capture:
const stderrLogPath = path.join(this.dataDir, `${taskName}.stderr.log`);
const stderrFd = fs.openSync(stderrLogPath, 'a');
child = this.spawnFn(task.command, task.args, {
stdio: ['ignore', 'ignore', stderrFd]
});
child.on('close', () => {
try { fs.closeSync(stderrFd); } catch (e) {}
});
The supervisor keeps stdout discarded (children should not be log-spamming via stdout under orchestration) but routes stderr to a per-task append-only log alongside the existing .pid file in .neo-ai-data/orchestrator-daemon/.
When a child exits non-zero, the existing writeLog('ERROR', ...) call adds a hint to the orchestrator log:
} else {
this.taskStateService.markFailed(taskName, code);
this.writeLog?.('ERROR', `[ProcessSupervisor] ${task.label} exited with code ${code}. stderr: ${stderrLogPath}`);
this.recordTaskOutcome(taskName, 'failed', {reason, code, stderrLogPath, failedAt: new Date().toISOString()});
}
Acceptance Criteria
Out of Scope
- Log rotation — append-mode logs may grow unbounded for fast-failing children (mlx empirical: ~40 restarts in 10min). File a follow-up ticket if size becomes operationally relevant; for now, manual operator cleanup is acceptable.
- Stdout capture — children should not log via stdout under orchestration; if they do, that's their bug.
- Restructuring the supervisor to use Neo-style logger — keep current
writeLog callback shape; this is a stdio-routing fix, not a logging-architecture refactor.
Avoided Traps
- Buffering child stderr in-memory: would inflate orchestrator-process memory on restart-loop children (mlx 40-restarts/10min). Disk-append is bounded by disk, not orchestrator RAM.
- Sending stderr to a single shared file: would interleave output from N children; per-task isolation is the diagnosable shape.
- Piping stderr through Node IPC (
stdio: ['ignore', 'ignore', 'pipe'] + manual child.stderr.on('data', ...)): adds memory pressure + buffer-management complexity for zero gain over fs.openSync fd routing.
Related
- Substrate authority:
ai/daemons/services/ProcessSupervisorService.mjs line 207
- Orchestrator:
ai/daemons/Orchestrator.mjs, ai/daemons/TaskDefinitions.mjs
- #11009 — orchestrator class refactor establishing the Neo daemon shape this lives within
- Related substrate gap: orchestrator should manage MC Chroma (separate ticket; would benefit from per-task stderr capture for diagnostics)
Origin Session
- Origin Session ID:
e095c569-beac-4743-998f-e07d4344492e
Retrieval Hint
Search for ProcessSupervisorService stdio ignore child stderr unrecoverable orchestrator daemon log.
Context
ai/daemons/services/ProcessSupervisorService.mjsis the orchestrator's universal child-process launcher. Every supervisor-managed task (chroma, bridge daemon, mlx, summary, kbSync, backup) spawns throughrunTask()at line 207:child = this.spawnFn(task.command, task.args, {stdio: 'ignore'});stdio: 'ignore'discards stdout AND stderr from the child. The orchestrator's own log records only the exit code (exited with code 1); there is no recoverable signal about WHY a child failed.Empirical anchor (2026-05-14T23:24Z, first
npm run ai:orchestratorboot):session summarization exited with code 1— root cause unknown (likely MC Chroma absence; cascade)mlx inference exited with code 1— root cause unknown (turned out to begemma4:31bOllama-format model arg vs mlx_lm HF-format expectation; required manualpython -m mlx_lm.serverreproduction to discover)memory core backup exited with code 1— root cause unknownknowledge base sync exited with code 1— root cause unknownEach of those failures required out-of-band manual reproduction to diagnose. The supervisor's own log + state file had no error text. mlx alone has been auto-restart-looping every ~15 seconds since boot (~10 minutes = ~40 restarts) with zero stderr captured anywhere.
Problem
The Agent OS daemon substrate is observability-blind by design. Operator-facing failure modes ("the orchestrator exited code 1") cannot be triaged from the orchestrator log alone. Every failure becomes a forensic exercise of "manually rerun the child outside the supervisor to capture stderr."
This violates the AGENTS.md §13 friction → gold core value: substrate that swallows failure signals cannot be improved via the MX loop because the friction is invisible to the model that should be ticketing it.
Fix (concrete prescription)
Replace
{stdio: 'ignore'}atProcessSupervisorService.mjsline 207 with per-task log-file capture:const stderrLogPath = path.join(this.dataDir, `${taskName}.stderr.log`); const stderrFd = fs.openSync(stderrLogPath, 'a'); child = this.spawnFn(task.command, task.args, { stdio: ['ignore', 'ignore', stderrFd] }); // Close fd when child exits to avoid fd leak child.on('close', () => { try { fs.closeSync(stderrFd); } catch (e) {} });The supervisor keeps stdout discarded (children should not be log-spamming via stdout under orchestration) but routes stderr to a per-task append-only log alongside the existing
.pidfile in.neo-ai-data/orchestrator-daemon/.When a child exits non-zero, the existing
writeLog('ERROR', ...)call adds a hint to the orchestrator log:} else { this.taskStateService.markFailed(taskName, code); this.writeLog?.('ERROR', `[ProcessSupervisor] ${task.label} exited with code ${code}. stderr: ${stderrLogPath}`); this.recordTaskOutcome(taskName, 'failed', {reason, code, stderrLogPath, failedAt: new Date().toISOString()}); }Acceptance Criteria
${dataDir}/${taskName}.stderr.log(append-mode).stderr: <path>hint in the orchestrator log.recordTaskOutcomepayload includesstderrLogPath.recoverTask) unaffected — adopted external PIDs don't get stderr redirection retroactively (that's fine; they had their own stdio at spawn time).Out of Scope
writeLogcallback shape; this is a stdio-routing fix, not a logging-architecture refactor.Avoided Traps
stdio: ['ignore', 'ignore', 'pipe']+ manualchild.stderr.on('data', ...)): adds memory pressure + buffer-management complexity for zero gain overfs.openSyncfd routing.Related
ai/daemons/services/ProcessSupervisorService.mjsline 207ai/daemons/Orchestrator.mjs,ai/daemons/TaskDefinitions.mjsOrigin Session
e095c569-beac-4743-998f-e07d4344492eRetrieval Hint
Search for
ProcessSupervisorService stdio ignore child stderr unrecoverable orchestrator daemon log.