Context
Sub-of M4 per-task coordinator decomposition (learn/agentos/v13-path.md:188-193, 195). Sibling-fileable from #11062 (BackupCoordinatorService) — same gap-fill pattern: M4 coordinator named in v13-path but never filed as ticket. Operator surfaced 2026-05-09 during the post-PR-#11021 architectural-correction audit.
Empirical anchor (v13-path.md:195):
"Exit gate: npm run ai:run-sandman becomes optional manual override; Orchestrator schedules it natively. Sandman + Golden Path back automatically — single source of truth. DreamService restoration thesis closed — sandman_handoff.md produced by daemon-driven mathematical-weighted-priorities replaces manual cross-family next-task-selection cognition (per D3.1)."
Empirical anchor (v13-path.md:312): operator quote:
"if DreamService was fully functional, gemma4-31b would parse the graph and give us sandman_handoff with mathematical weighted priorities — way less cognitive load"
This is the load-bearing v13 enabler per the DreamService restoration thesis. M3.5 keystone substrate (TaskStateService + ProcessSupervisorService + CadenceEngine, all merged 2026-05-09) makes the M4 per-task coordinators incremental; SandmanCoordinatorService is the highest cognitive-load-offload coordinator.
NEW architectural dimension (operator clarification 2026-05-09T22): "sandman => can disable (overwhelm) add_memory for several minutes. gemma4 has a huge backlog to parse. but we need to be careful that it does not freeze DB access for too long"
The Sandman cycle is heavy LLM-bound work (gemma4-31b Tri-Vector extraction across N sessions per cycle) + heavy graph writes (per-session SemanticGraphExtractor + TopologyInferenceEngine + GapInferenceEngine + GarbageCollection). Concurrent peer work (heavy unit tests, MCP write traffic) experiences add_memory saturation under this load — empirically observed in this very session as add_memory returning module-not-found timeouts during heavy testing.
This makes SandmanCoordinatorService fundamentally different from BackupCoordinatorService: backup is a fast filesystem snapshot with bounded duration; Sandman is multi-minute LLM-bound work that competes with peer agents for shared substrate. The coordinator must own contention-aware scheduling semantics, not just interval-based "is it 24h since last run."
Duplicate sweep before filing:
gh issue list --search "SandmanCoordinatorService extraction" — no matches
gh issue list --label ai --search "Orchestrator schedules sandman" — no open match (#10780 closed-not-planned 2026-05-09 was the manual-discipline approach)
- v13-path.md line 193 names "SandmanCoordinatorService" — never filed
The Problem
DreamService is currently disabled by deliberate operator-safety pause:
ai/mcp/server/memory-core/config.template.mjs:34,49,55 — autoSummarize: false, autoDream: false, autoGoldenPath: false
npm run ai:run-sandman is the only invocation path (buildScripts/ai/runSandman.mjs)
- Manual operator invocation is the only way to refresh
sandman_handoff.md
- Result: cross-family next-task selection cognition is manual instead of daemon-driven (per
v13-path.md:312 operator quote)
Wrong-shape "fix" rejected: flipping autoDream: true would fire on EVERY MC server boot, with no contention awareness, no scheduling, no backup precondition. Would add LLM saturation pressure unpredictably. Correct shape: orchestrator-driven scheduling via SandmanCoordinatorService.
The Architectural Reality
Sibling precedent: ai/daemons/services/SummarizationCoordinatorService.mjs (#11009) for the basic shape; ai/daemons/services/BackupCoordinatorService.mjs (#11062, in flight) for the M4 coordinator pattern. Both lift getDueTask({state, now, ...intervals}) returning trigger envelope or null.
SandmanCoordinatorService extends with contention-awareness:
- Backup precondition (per
v13-path.md:90): getDueTask returns null if no recent successful backup exists (operator-safety: don't run heavy graph mutations without recovery point)
- LLM-provider-readiness check: probe gemma4 endpoint health before signaling due (avoid spawning a child that immediately fails on provider unreachability)
- Substrate-contention deferral: if
add_memory p99 latency > threshold OR concurrent task spawn count > threshold, defer this cycle (back-off pattern; operator-clarification 2026-05-09T22 anchor)
- Off-peak window preference: prefer scheduling during operator-defined low-contention windows (env-var configurable; default 24h interval but with deferral allowed up to N hours before forced spawn)
Existing Sandman mechanics:
buildScripts/ai/runSandman.mjs — operator-runnable; bypasses auto-flags + waits for LLM provider + runs DreamService.processUndigestedSessions()
ai/daemons/DreamService.mjs#processUndigestedSessions — full REM pipeline (Phase 0 FS ingest → Phase 1 Tri-Vector → Phase 2 Topology → Phase 3 Gap → Phase 4 GC → Phase 5 Golden Path → sandman_handoff.md)
Memory_Config.modelProvider === 'openAiCompatible' triggers MLX provider readiness probe inside DreamService
Required wiring:
- New:
ai/daemons/services/SandmanCoordinatorService.mjs (class-only, post-#11049 invariant)
- Orchestrator.mjs: add
sandman task definition pointing at buildScripts/ai/runSandman.mjs
- Orchestrator.mjs: add
sandmanCoordinator_ config + sandmanIntervalMs_ + sandmanContentionThresholds_ (DI seam)
- Orchestrator.mjs: add
runSandmanCycle(now) method matching summary/kbSync/backup pattern
- Cross-coordinator dependency: SandmanCoord queries BackupCoord state OR HealthService for backup-recency
The Fix
Step 1: Create ai/daemons/services/SandmanCoordinatorService.mjs
import Base from '../../../src/core/Base.mjs';
export function buildSandmanTrigger({
now,
state,
intervalMs,
backupRecencyMs = 36 * 60 * 60 * 1000,
providerHealthy,
activeTasks,
offPeakWindow
}) {
if (intervalMs <= 0) {
return null;
}
if (now - (state.sandman?.lastRunAt || 0) < intervalMs) {
return {deferralReason: 'interval-not-due'};
}
if (!state.backup || (now - state.backup.lastSuccessAt > backupRecencyMs)) {
return {deferralReason: 'backup-stale'};
}
if (!providerHealthy) {
return {deferralReason: 'provider-unreachable'};
}
if (activeTasks.includes('summary') || activeTasks.includes('kbSync') || activeTasks.includes('backup')) {
return {deferralReason: 'peer-task-running'};
}
if (offPeakWindow && !isInOffPeakWindow(now, offPeakWindow)) {
return {deferralReason: 'outside-off-peak'};
}
return {
taskName: 'sandman',
source : 'periodic-sweep',
reason : `periodic-sweep:${intervalMs}`
};
}
class SandmanCoordinatorService extends Base { }
Step 2: Wire into Orchestrator (post-Sub-4 slimmed shape; rebase to lift runIfDue / DI patterns Sub-4 introduces)
[Wiring shape mirrors BackupCoordinatorService #11062 + summary/kbSync existing pattern.]
Step 3: Backup-recency-state coupling
SandmanCoord needs to know if recent backup succeeded. Options:
- (A) Read TaskStateService directly via DI for
state.backup.lastSuccessAt (coupled but simple)
- (B) Read from HealthService.getTaskOutcome (cleaner, asks the observability layer)
- (C) Cross-coordinator query method on BackupCoordinatorService (
getLastSuccessAt())
Likely option B (HealthService), aligned with how SummarizationCoordinatorService queries state via DI.
Step 4: Test substrate
- New:
BackupCoordinatorService.spec.mjs-style for SandmanCoordinatorService
- Boundary tests for each deferral reason (5+ cases)
- Test-spec-as-entry-point (Neo+core bootstrap at top per #11049)
Step 5: Documentation
- Update
learn/agentos/DreamPipeline.md "Running the Pipeline" section: reflect orchestrator-driven scheduling vs npm run ai:run-sandman manual override
- Document deferral-reason vocabulary for operator log-reading
- Cross-reference SandmanCoord substrate from
runSandman.mjs JSDoc
Acceptance Criteria
Dependencies
- #11062 (BackupCoordinatorService) — Sandman's backup-recency precondition needs HealthService task-outcome state. SandmanCoord MUST land AFTER BackupCoord OR a backup-recency-stub (reads HealthService) exists. Recommendation: file as blocked by #11062 → land after BackupCoord ships.
- M3.5 Sub-4 (Orchestrator slim-down by @neo-gemini-3-1-pro, in flight) — SandmanCoord wires into the slimmed Orchestrator shape; rebase onto post-Sub-4 dev to lift the new DI pattern (
CadenceEngine.runIfDue per Gemini's A2A 2026-05-09).
Out of Scope
- Backup-success precondition for ALL DreamMode/Sandman invocation paths (not just orchestrator) — operator can still bypass via
npm run ai:run-sandman manual; that path stays the manual-override route per v13-path.md:195. Mechanical-gate variant (runSandman halts unless backup exists) is filable separately if needed.
- Adaptive contention-threshold tuning (auto-adjust deferral thresholds based on observed
add_memory p99) — file as scope-extension once empirical data justifies adaptive logic
- Cross-coordinator scheduling synchronization (don't spawn TWO heavy tasks at once even when both are due) — orchestrator's existing per-task PID-file lock handles single-task-instance; cross-task gating is filable as scope-extension if observed contention shows need
- DreamCoordinatorService extraction — paired but separate; file once SandmanCoord lands so the pair-shape is clear (coordinator-pair lifts the shared LLM-provider-readiness primitive)
- GoldenPathCoordinatorService / GraphMaintenanceCoordinatorService — separate filings per the M4 5-coordinator landscape; sequencing TBD
Avoided Traps
- ❌ Naive interval-only scheduling: would freeze peer DB access during operator-active hours per the contention concern. Contention-awareness is load-bearing.
- ❌ Bundling backup-precondition gate into runSandman.mjs: would also gate the manual operator-override path, defeating its purpose. The gate belongs in the COORDINATOR (orchestrator-driven path only).
- ❌ Re-enabling
autoDream: true startup flag as the "fix": wrong shape per v13-path.md:195 exit-gate framing — orchestrator IS the scheduling substrate; startup-auto-fire is the discarded approach.
- ❌ Hardcoded contention thresholds: env-driven config (with sensible defaults) lets operators tune without code changes.
- ❌ Silent deferral (return null): loses operator visibility. Each deferral SHOULD log its reason — supports the "why didn't Sandman fire?" diagnostic.
Provenance
- Operator architectural direction (2026-05-09):
v13-path.md:195 — "Exit gate: npm run ai:run-sandman becomes optional manual override; Orchestrator schedules it natively"
- Operator cognitive-load-offload framing (2026-05-09):
v13-path.md:312 — "if DreamService was fully functional, gemma4-31b would parse the graph and give us sandman_handoff with mathematical weighted priorities — way less cognitive load"
- Operator contention clarification (2026-05-09T22): "sandman => can disable (overwhelm) add_memory for several minutes. gemma4 has a huge backlog to parse. but we need to be careful that it does not freeze DB access for too long" — substrate-contention dimension that distinguishes Sandman from Backup
- Sibling precedents: SummarizationCoordinatorService.mjs (#11009), BackupCoordinatorService #11062 (in flight)
- D3.1 boundary anchor:
v13-path.md:188,193 — coordinator-vs-supervisor-vs-state separation; M3.5 keystone substrate makes M4 incremental
Related
- Parent epic: #11022 (Orchestrator decomposition M3.5; M4 follow-up)
- Predecessor (closed): #10780 (manual backup-first discipline — superseded by orchestrator-owned scheduling per this ticket + #11062)
- Sibling M4 coordinators: #11062 (BackupCoordinatorService); DreamCoordinatorService / GoldenPathCoordinatorService / GraphMaintenanceCoordinatorService not yet filed
- Sandman invocation:
buildScripts/ai/runSandman.mjs — the manual-override path
- DreamService:
ai/daemons/DreamService.mjs — the actual REM pipeline that runs inside the spawn
- Pre-disable PR: #10863 (autoX flip-defaults reversal that made operator manual-disable necessary)
- PR #10494 (DreamService regression — possible underlying cause of operator-disable)
- DreamPipeline architecture:
learn/agentos/DreamPipeline.md
Self-Identification: @neo-opus-4-7 (Claude Opus 4.7, Claude Code) — chief-architect lane, post-Round-4 architectural follow-up filing. Closes the gap-pattern named in PR #11021 closure (M4 coordinators named in v13-path.md but never filed). Open lane for self-selection; preferred sequence: file → wait for #11062 + Sub-4 → claim post-merge.
Origin Session ID: c2912891-b459-4a03-b2af-154d5e264df1
Context
Sub-of M4 per-task coordinator decomposition (
learn/agentos/v13-path.md:188-193, 195). Sibling-fileable from #11062 (BackupCoordinatorService) — same gap-fill pattern: M4 coordinator named in v13-path but never filed as ticket. Operator surfaced 2026-05-09 during the post-PR-#11021 architectural-correction audit.Empirical anchor (
v13-path.md:195):Empirical anchor (
v13-path.md:312): operator quote:This is the load-bearing v13 enabler per the DreamService restoration thesis. M3.5 keystone substrate (TaskStateService + ProcessSupervisorService + CadenceEngine, all merged 2026-05-09) makes the M4 per-task coordinators incremental; SandmanCoordinatorService is the highest cognitive-load-offload coordinator.
NEW architectural dimension (operator clarification 2026-05-09T22): "sandman => can disable (overwhelm) add_memory for several minutes. gemma4 has a huge backlog to parse. but we need to be careful that it does not freeze DB access for too long"
The Sandman cycle is heavy LLM-bound work (gemma4-31b Tri-Vector extraction across N sessions per cycle) + heavy graph writes (per-session SemanticGraphExtractor + TopologyInferenceEngine + GapInferenceEngine + GarbageCollection). Concurrent peer work (heavy unit tests, MCP write traffic) experiences
add_memorysaturation under this load — empirically observed in this very session asadd_memoryreturning module-not-found timeouts during heavy testing.This makes SandmanCoordinatorService fundamentally different from BackupCoordinatorService: backup is a fast filesystem snapshot with bounded duration; Sandman is multi-minute LLM-bound work that competes with peer agents for shared substrate. The coordinator must own contention-aware scheduling semantics, not just interval-based "is it 24h since last run."
Duplicate sweep before filing:
gh issue list --search "SandmanCoordinatorService extraction"— no matchesgh issue list --label ai --search "Orchestrator schedules sandman"— no open match (#10780 closed-not-planned 2026-05-09 was the manual-discipline approach)The Problem
DreamService is currently disabled by deliberate operator-safety pause:
ai/mcp/server/memory-core/config.template.mjs:34,49,55—autoSummarize: false,autoDream: false,autoGoldenPath: falsenpm run ai:run-sandmanis the only invocation path (buildScripts/ai/runSandman.mjs)sandman_handoff.mdv13-path.md:312operator quote)Wrong-shape "fix" rejected: flipping
autoDream: truewould fire on EVERY MC server boot, with no contention awareness, no scheduling, no backup precondition. Would add LLM saturation pressure unpredictably. Correct shape: orchestrator-driven scheduling via SandmanCoordinatorService.The Architectural Reality
Sibling precedent:
ai/daemons/services/SummarizationCoordinatorService.mjs(#11009) for the basic shape;ai/daemons/services/BackupCoordinatorService.mjs(#11062, in flight) for the M4 coordinator pattern. Both liftgetDueTask({state, now, ...intervals})returning trigger envelope or null.SandmanCoordinatorService extends with contention-awareness:
v13-path.md:90):getDueTaskreturns null if no recent successful backup exists (operator-safety: don't run heavy graph mutations without recovery point)add_memoryp99 latency > threshold OR concurrent task spawn count > threshold, defer this cycle (back-off pattern; operator-clarification 2026-05-09T22 anchor)Existing Sandman mechanics:
buildScripts/ai/runSandman.mjs— operator-runnable; bypasses auto-flags + waits for LLM provider + runsDreamService.processUndigestedSessions()ai/daemons/DreamService.mjs#processUndigestedSessions— full REM pipeline (Phase 0 FS ingest → Phase 1 Tri-Vector → Phase 2 Topology → Phase 3 Gap → Phase 4 GC → Phase 5 Golden Path →sandman_handoff.md)Memory_Config.modelProvider === 'openAiCompatible'triggers MLX provider readiness probe inside DreamServiceRequired wiring:
ai/daemons/services/SandmanCoordinatorService.mjs(class-only, post-#11049 invariant)sandmantask definition pointing atbuildScripts/ai/runSandman.mjssandmanCoordinator_config +sandmanIntervalMs_+sandmanContentionThresholds_(DI seam)runSandmanCycle(now)method matching summary/kbSync/backup patternThe Fix
Step 1: Create
ai/daemons/services/SandmanCoordinatorService.mjs// Class-only file (entry-point invariant per #11049) import Base from '../../../src/core/Base.mjs'; /** * @summary Builds the task trigger for the Sandman REM-cycle lane with contention-awareness. * * Sandman is heavy LLM-bound + graph-write-heavy work. Naive interval-based scheduling * would fire during peer-active windows and saturate `add_memory` writes (empirical anchor: * 2026-05-09 operator clarification — gemma4 backlog + concurrent unit-test contention). * * The coordinator returns a trigger only when ALL of these hold: * 1. Interval is due (default 24h) * 2. Recent successful backup exists (default within last 36h) * 3. LLM provider is reachable (probe-result via providerCheckFn) * 4. No concurrent heavy task is spawned (kbSync, summary, backup not running) * 5. Optional: current time is within configured off-peak window * * Each precondition has its own deferral source for observability — agents reading * orchestrator.log can see WHY Sandman didn't fire this cycle. * * @param {Object} options * @returns {Object|null} Trigger envelope or null when not due / contended. */ export function buildSandmanTrigger({ now, state, intervalMs, backupRecencyMs = 36 * 60 * 60 * 1000, providerHealthy, activeTasks, offPeakWindow }) { if (intervalMs <= 0) { return null; // disabled } if (now - (state.sandman?.lastRunAt || 0) < intervalMs) { return {deferralReason: 'interval-not-due'}; // not-due signal } if (!state.backup || (now - state.backup.lastSuccessAt > backupRecencyMs)) { return {deferralReason: 'backup-stale'}; } if (!providerHealthy) { return {deferralReason: 'provider-unreachable'}; } if (activeTasks.includes('summary') || activeTasks.includes('kbSync') || activeTasks.includes('backup')) { return {deferralReason: 'peer-task-running'}; } if (offPeakWindow && !isInOffPeakWindow(now, offPeakWindow)) { return {deferralReason: 'outside-off-peak'}; } return { taskName: 'sandman', source : 'periodic-sweep', reason : `periodic-sweep:${intervalMs}` }; } class SandmanCoordinatorService extends Base { /* ... */ }Step 2: Wire into Orchestrator (post-Sub-4 slimmed shape; rebase to lift
runIfDue/ DI patterns Sub-4 introduces)[Wiring shape mirrors BackupCoordinatorService #11062 + summary/kbSync existing pattern.]
Step 3: Backup-recency-state coupling
SandmanCoord needs to know if recent backup succeeded. Options:
state.backup.lastSuccessAt(coupled but simple)getLastSuccessAt())Likely option B (HealthService), aligned with how SummarizationCoordinatorService queries state via DI.
Step 4: Test substrate
BackupCoordinatorService.spec.mjs-style for SandmanCoordinatorServiceStep 5: Documentation
learn/agentos/DreamPipeline.md"Running the Pipeline" section: reflect orchestrator-driven scheduling vsnpm run ai:run-sandmanmanual overriderunSandman.mjsJSDocAcceptance Criteria
ai/daemons/services/SandmanCoordinatorService.mjsexists; class-only file (no Neo import);getDueTaskreturns trigger or nullbuildSandmanTriggerpure function exported; covers interval-due + backup-recency + provider-health + peer-task-contention + off-peak deferrals; returns deferralReason envelope (NOT just null) for observabilitysandmantask inbuildTaskDefinitionspointing atbuildScripts/ai/runSandman.mjssandmanCoordinator+sandmanIntervalMs(env overrideNEO_ORCHESTRATOR_SANDMAN_INTERVAL_MS, default 24h) + reasonable defaults for backup-recency-window + off-peak-window-config (env-driven, optional)runMaintenanceCycleadds Sandman lane (failure-isolated, matching summary + kbSync + backup pattern)SandmanCoordinatorService.spec.mjscovering all 5+ deferral-reason boundary caseslearn/agentos/DreamPipeline.md"Running the Pipeline" section updated to reflect orchestrator-driven schedulingpull-request §6.1Dependencies
CadenceEngine.runIfDueper Gemini's A2A 2026-05-09).Out of Scope
npm run ai:run-sandmanmanual; that path stays the manual-override route perv13-path.md:195. Mechanical-gate variant (runSandmanhalts unless backup exists) is filable separately if needed.add_memoryp99) — file as scope-extension once empirical data justifies adaptive logicAvoided Traps
autoDream: truestartup flag as the "fix": wrong shape perv13-path.md:195exit-gate framing — orchestrator IS the scheduling substrate; startup-auto-fire is the discarded approach.Provenance
v13-path.md:195— "Exit gate:npm run ai:run-sandmanbecomes optional manual override; Orchestrator schedules it natively"v13-path.md:312— "if DreamService was fully functional, gemma4-31b would parse the graph and give us sandman_handoff with mathematical weighted priorities — way less cognitive load"v13-path.md:188,193— coordinator-vs-supervisor-vs-state separation; M3.5 keystone substrate makes M4 incrementalRelated
buildScripts/ai/runSandman.mjs— the manual-override pathai/daemons/DreamService.mjs— the actual REM pipeline that runs inside the spawnlearn/agentos/DreamPipeline.mdSelf-Identification: @neo-opus-4-7 (Claude Opus 4.7, Claude Code) — chief-architect lane, post-Round-4 architectural follow-up filing. Closes the gap-pattern named in PR #11021 closure (M4 coordinators named in v13-path.md but never filed). Open lane for self-selection; preferred sequence: file → wait for #11062 + Sub-4 → claim post-merge.
Origin Session ID: c2912891-b459-4a03-b2af-154d5e264df1