LearnNewsExamplesServices
Frontmatter
id11065
titleExtract SandmanCoordinatorService as M4 per-task coordinator (orchestrator-driven Sandman/DreamService scheduling)
stateClosed
labels
enhancementaiarchitecturemodel-experience
assigneesneo-gpt
createdAtMay 9, 2026, 11:58 PM
updatedAtMay 12, 2026, 4:08 AM
githubUrlhttps://github.com/neomjs/neo/issues/11065
authorneo-opus-4-7
commentsCount3
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 10, 2026, 5:45 PM

Extract SandmanCoordinatorService as M4 per-task coordinator (orchestrator-driven Sandman/DreamService scheduling)

Closedenhancementaiarchitecturemodel-experience
neo-opus-4-7
neo-opus-4-7 commented on May 9, 2026, 11:58 PM

Context

Sub-of M4 per-task coordinator decomposition (learn/agentos/v13-path.md:188-193, 195). Sibling-fileable from #11062 (BackupCoordinatorService) — same gap-fill pattern: M4 coordinator named in v13-path but never filed as ticket. Operator surfaced 2026-05-09 during the post-PR-#11021 architectural-correction audit.

Empirical anchor (v13-path.md:195):

"Exit gate: npm run ai:run-sandman becomes optional manual override; Orchestrator schedules it natively. Sandman + Golden Path back automatically — single source of truth. DreamService restoration thesis closedsandman_handoff.md produced by daemon-driven mathematical-weighted-priorities replaces manual cross-family next-task-selection cognition (per D3.1)."

Empirical anchor (v13-path.md:312): operator quote:

"if DreamService was fully functional, gemma4-31b would parse the graph and give us sandman_handoff with mathematical weighted priorities — way less cognitive load"

This is the load-bearing v13 enabler per the DreamService restoration thesis. M3.5 keystone substrate (TaskStateService + ProcessSupervisorService + CadenceEngine, all merged 2026-05-09) makes the M4 per-task coordinators incremental; SandmanCoordinatorService is the highest cognitive-load-offload coordinator.

NEW architectural dimension (operator clarification 2026-05-09T22): "sandman => can disable (overwhelm) add_memory for several minutes. gemma4 has a huge backlog to parse. but we need to be careful that it does not freeze DB access for too long"

The Sandman cycle is heavy LLM-bound work (gemma4-31b Tri-Vector extraction across N sessions per cycle) + heavy graph writes (per-session SemanticGraphExtractor + TopologyInferenceEngine + GapInferenceEngine + GarbageCollection). Concurrent peer work (heavy unit tests, MCP write traffic) experiences add_memory saturation under this load — empirically observed in this very session as add_memory returning module-not-found timeouts during heavy testing.

This makes SandmanCoordinatorService fundamentally different from BackupCoordinatorService: backup is a fast filesystem snapshot with bounded duration; Sandman is multi-minute LLM-bound work that competes with peer agents for shared substrate. The coordinator must own contention-aware scheduling semantics, not just interval-based "is it 24h since last run."

Duplicate sweep before filing:

  • gh issue list --search "SandmanCoordinatorService extraction" — no matches
  • gh issue list --label ai --search "Orchestrator schedules sandman" — no open match (#10780 closed-not-planned 2026-05-09 was the manual-discipline approach)
  • v13-path.md line 193 names "SandmanCoordinatorService" — never filed

The Problem

DreamService is currently disabled by deliberate operator-safety pause:

  • ai/mcp/server/memory-core/config.template.mjs:34,49,55autoSummarize: false, autoDream: false, autoGoldenPath: false
  • npm run ai:run-sandman is the only invocation path (buildScripts/ai/runSandman.mjs)
  • Manual operator invocation is the only way to refresh sandman_handoff.md
  • Result: cross-family next-task selection cognition is manual instead of daemon-driven (per v13-path.md:312 operator quote)

Wrong-shape "fix" rejected: flipping autoDream: true would fire on EVERY MC server boot, with no contention awareness, no scheduling, no backup precondition. Would add LLM saturation pressure unpredictably. Correct shape: orchestrator-driven scheduling via SandmanCoordinatorService.

The Architectural Reality

Sibling precedent: ai/daemons/services/SummarizationCoordinatorService.mjs (#11009) for the basic shape; ai/daemons/services/BackupCoordinatorService.mjs (#11062, in flight) for the M4 coordinator pattern. Both lift getDueTask({state, now, ...intervals}) returning trigger envelope or null.

SandmanCoordinatorService extends with contention-awareness:

  1. Backup precondition (per v13-path.md:90): getDueTask returns null if no recent successful backup exists (operator-safety: don't run heavy graph mutations without recovery point)
  2. LLM-provider-readiness check: probe gemma4 endpoint health before signaling due (avoid spawning a child that immediately fails on provider unreachability)
  3. Substrate-contention deferral: if add_memory p99 latency > threshold OR concurrent task spawn count > threshold, defer this cycle (back-off pattern; operator-clarification 2026-05-09T22 anchor)
  4. Off-peak window preference: prefer scheduling during operator-defined low-contention windows (env-var configurable; default 24h interval but with deferral allowed up to N hours before forced spawn)

Existing Sandman mechanics:

  • buildScripts/ai/runSandman.mjs — operator-runnable; bypasses auto-flags + waits for LLM provider + runs DreamService.processUndigestedSessions()
  • ai/daemons/DreamService.mjs#processUndigestedSessions — full REM pipeline (Phase 0 FS ingest → Phase 1 Tri-Vector → Phase 2 Topology → Phase 3 Gap → Phase 4 GC → Phase 5 Golden Path → sandman_handoff.md)
  • Memory_Config.modelProvider === 'openAiCompatible' triggers MLX provider readiness probe inside DreamService

Required wiring:

  • New: ai/daemons/services/SandmanCoordinatorService.mjs (class-only, post-#11049 invariant)
  • Orchestrator.mjs: add sandman task definition pointing at buildScripts/ai/runSandman.mjs
  • Orchestrator.mjs: add sandmanCoordinator_ config + sandmanIntervalMs_ + sandmanContentionThresholds_ (DI seam)
  • Orchestrator.mjs: add runSandmanCycle(now) method matching summary/kbSync/backup pattern
  • Cross-coordinator dependency: SandmanCoord queries BackupCoord state OR HealthService for backup-recency

The Fix

Step 1: Create ai/daemons/services/SandmanCoordinatorService.mjs

// Class-only file (entry-point invariant per #11049)
import Base from '../../../src/core/Base.mjs';

/**
 * @summary Builds the task trigger for the Sandman REM-cycle lane with contention-awareness.
 *
 * Sandman is heavy LLM-bound + graph-write-heavy work. Naive interval-based scheduling
 * would fire during peer-active windows and saturate `add_memory` writes (empirical anchor:
 * 2026-05-09 operator clarification — gemma4 backlog + concurrent unit-test contention).
 *
 * The coordinator returns a trigger only when ALL of these hold:
 * 1. Interval is due (default 24h)
 * 2. Recent successful backup exists (default within last 36h)
 * 3. LLM provider is reachable (probe-result via providerCheckFn)
 * 4. No concurrent heavy task is spawned (kbSync, summary, backup not running)
 * 5. Optional: current time is within configured off-peak window
 *
 * Each precondition has its own deferral source for observability — agents reading
 * orchestrator.log can see WHY Sandman didn't fire this cycle.
 *
 * @param {Object} options
 * @returns {Object|null} Trigger envelope or null when not due / contended.
 */
export function buildSandmanTrigger({
    now,
    state,
    intervalMs,
    backupRecencyMs = 36 * 60 * 60 * 1000,
    providerHealthy,
    activeTasks,
    offPeakWindow
}) {
    if (intervalMs <= 0) {
        return null; // disabled
    }
    if (now - (state.sandman?.lastRunAt || 0) < intervalMs) {
        return {deferralReason: 'interval-not-due'}; // not-due signal
    }
    if (!state.backup || (now - state.backup.lastSuccessAt > backupRecencyMs)) {
        return {deferralReason: 'backup-stale'};
    }
    if (!providerHealthy) {
        return {deferralReason: 'provider-unreachable'};
    }
    if (activeTasks.includes('summary') || activeTasks.includes('kbSync') || activeTasks.includes('backup')) {
        return {deferralReason: 'peer-task-running'};
    }
    if (offPeakWindow && !isInOffPeakWindow(now, offPeakWindow)) {
        return {deferralReason: 'outside-off-peak'};
    }
    return {
        taskName: 'sandman',
        source  : 'periodic-sweep',
        reason  : `periodic-sweep:${intervalMs}`
    };
}

class SandmanCoordinatorService extends Base { /* ... */ }

Step 2: Wire into Orchestrator (post-Sub-4 slimmed shape; rebase to lift runIfDue / DI patterns Sub-4 introduces)

[Wiring shape mirrors BackupCoordinatorService #11062 + summary/kbSync existing pattern.]

Step 3: Backup-recency-state coupling

SandmanCoord needs to know if recent backup succeeded. Options:

  • (A) Read TaskStateService directly via DI for state.backup.lastSuccessAt (coupled but simple)
  • (B) Read from HealthService.getTaskOutcome (cleaner, asks the observability layer)
  • (C) Cross-coordinator query method on BackupCoordinatorService (getLastSuccessAt())

Likely option B (HealthService), aligned with how SummarizationCoordinatorService queries state via DI.

Step 4: Test substrate

  • New: BackupCoordinatorService.spec.mjs-style for SandmanCoordinatorService
  • Boundary tests for each deferral reason (5+ cases)
  • Test-spec-as-entry-point (Neo+core bootstrap at top per #11049)

Step 5: Documentation

  • Update learn/agentos/DreamPipeline.md "Running the Pipeline" section: reflect orchestrator-driven scheduling vs npm run ai:run-sandman manual override
  • Document deferral-reason vocabulary for operator log-reading
  • Cross-reference SandmanCoord substrate from runSandman.mjs JSDoc

Acceptance Criteria

  • AC1 — ai/daemons/services/SandmanCoordinatorService.mjs exists; class-only file (no Neo import); getDueTask returns trigger or null
  • AC2 — buildSandmanTrigger pure function exported; covers interval-due + backup-recency + provider-health + peer-task-contention + off-peak deferrals; returns deferralReason envelope (NOT just null) for observability
  • AC3 — Orchestrator.mjs wires sandman task in buildTaskDefinitions pointing at buildScripts/ai/runSandman.mjs
  • AC4 — Orchestrator.mjs wires DI for sandmanCoordinator + sandmanIntervalMs (env override NEO_ORCHESTRATOR_SANDMAN_INTERVAL_MS, default 24h) + reasonable defaults for backup-recency-window + off-peak-window-config (env-driven, optional)
  • AC5 — Orchestrator.mjs runMaintenanceCycle adds Sandman lane (failure-isolated, matching summary + kbSync + backup pattern)
  • AC6 — Orchestrator logs deferralReason on each non-spawn cycle (operator visibility into "why didn't Sandman fire?")
  • AC7 — Unit test coverage for SandmanCoordinatorService.spec.mjs covering all 5+ deferral-reason boundary cases
  • AC8 — learn/agentos/DreamPipeline.md "Running the Pipeline" section updated to reflect orchestrator-driven scheduling
  • AC9 — Cross-family review per pull-request §6.1

Dependencies

  • #11062 (BackupCoordinatorService) — Sandman's backup-recency precondition needs HealthService task-outcome state. SandmanCoord MUST land AFTER BackupCoord OR a backup-recency-stub (reads HealthService) exists. Recommendation: file as blocked by #11062 → land after BackupCoord ships.
  • M3.5 Sub-4 (Orchestrator slim-down by @neo-gemini-3-1-pro, in flight) — SandmanCoord wires into the slimmed Orchestrator shape; rebase onto post-Sub-4 dev to lift the new DI pattern (CadenceEngine.runIfDue per Gemini's A2A 2026-05-09).

Out of Scope

  • Backup-success precondition for ALL DreamMode/Sandman invocation paths (not just orchestrator) — operator can still bypass via npm run ai:run-sandman manual; that path stays the manual-override route per v13-path.md:195. Mechanical-gate variant (runSandman halts unless backup exists) is filable separately if needed.
  • Adaptive contention-threshold tuning (auto-adjust deferral thresholds based on observed add_memory p99) — file as scope-extension once empirical data justifies adaptive logic
  • Cross-coordinator scheduling synchronization (don't spawn TWO heavy tasks at once even when both are due) — orchestrator's existing per-task PID-file lock handles single-task-instance; cross-task gating is filable as scope-extension if observed contention shows need
  • DreamCoordinatorService extraction — paired but separate; file once SandmanCoord lands so the pair-shape is clear (coordinator-pair lifts the shared LLM-provider-readiness primitive)
  • GoldenPathCoordinatorService / GraphMaintenanceCoordinatorService — separate filings per the M4 5-coordinator landscape; sequencing TBD

Avoided Traps

  • Naive interval-only scheduling: would freeze peer DB access during operator-active hours per the contention concern. Contention-awareness is load-bearing.
  • Bundling backup-precondition gate into runSandman.mjs: would also gate the manual operator-override path, defeating its purpose. The gate belongs in the COORDINATOR (orchestrator-driven path only).
  • Re-enabling autoDream: true startup flag as the "fix": wrong shape per v13-path.md:195 exit-gate framing — orchestrator IS the scheduling substrate; startup-auto-fire is the discarded approach.
  • Hardcoded contention thresholds: env-driven config (with sensible defaults) lets operators tune without code changes.
  • Silent deferral (return null): loses operator visibility. Each deferral SHOULD log its reason — supports the "why didn't Sandman fire?" diagnostic.

Provenance

  • Operator architectural direction (2026-05-09): v13-path.md:195 — "Exit gate: npm run ai:run-sandman becomes optional manual override; Orchestrator schedules it natively"
  • Operator cognitive-load-offload framing (2026-05-09): v13-path.md:312 — "if DreamService was fully functional, gemma4-31b would parse the graph and give us sandman_handoff with mathematical weighted priorities — way less cognitive load"
  • Operator contention clarification (2026-05-09T22): "sandman => can disable (overwhelm) add_memory for several minutes. gemma4 has a huge backlog to parse. but we need to be careful that it does not freeze DB access for too long" — substrate-contention dimension that distinguishes Sandman from Backup
  • Sibling precedents: SummarizationCoordinatorService.mjs (#11009), BackupCoordinatorService #11062 (in flight)
  • D3.1 boundary anchor: v13-path.md:188,193 — coordinator-vs-supervisor-vs-state separation; M3.5 keystone substrate makes M4 incremental

Related

  • Parent epic: #11022 (Orchestrator decomposition M3.5; M4 follow-up)
  • Predecessor (closed): #10780 (manual backup-first discipline — superseded by orchestrator-owned scheduling per this ticket + #11062)
  • Sibling M4 coordinators: #11062 (BackupCoordinatorService); DreamCoordinatorService / GoldenPathCoordinatorService / GraphMaintenanceCoordinatorService not yet filed
  • Sandman invocation: buildScripts/ai/runSandman.mjs — the manual-override path
  • DreamService: ai/daemons/DreamService.mjs — the actual REM pipeline that runs inside the spawn
  • Pre-disable PR: #10863 (autoX flip-defaults reversal that made operator manual-disable necessary)
  • PR #10494 (DreamService regression — possible underlying cause of operator-disable)
  • DreamPipeline architecture: learn/agentos/DreamPipeline.md

Self-Identification: @neo-opus-4-7 (Claude Opus 4.7, Claude Code) — chief-architect lane, post-Round-4 architectural follow-up filing. Closes the gap-pattern named in PR #11021 closure (M4 coordinators named in v13-path.md but never filed). Open lane for self-selection; preferred sequence: file → wait for #11062 + Sub-4 → claim post-merge.

Origin Session ID: c2912891-b459-4a03-b2af-154d5e264df1