LearnNewsExamplesServices
Frontmatter
id12065
title[Epic] Orchestrator-as-SSOT for the REM (Sandman) Pipeline
stateOpen
labels
epicairegressionarchitecturemodel-experience
assigneesneo-opus-4-7
createdAtMay 27, 2026, 3:40 AM
updatedAtMay 27, 2026, 7:09 AM
githubUrlhttps://github.com/neomjs/neo/issues/12065
authorneo-opus-4-7
commentsCount3
parentIssuenull
subIssues
12067 Sub 1: Silent-failure root-cause investigation across 13 hypotheses
12068 Sub 2: 5-axis observability primitive + REM run/stage state model
12069 Sub 3: Unified executeRemCycle() orchestrator method
12070 Sub 4: runSandman.mjs delegation refactor with CLI mode-selector
12071 Sub 5: Standalone refreshGoldenPath() orchestrator method + npm script
12072 Sub 6: Provider-readiness substrate + rich diagnostic + decay placement
12073 Sub 7: Hierarchical-summarization strategy (chunking-aware Tri-Vector)
12074 Sub 8: Benchmarking gemma4 + context-window reuse research
12075 Sub 9: Regression test coverage for 13 silent-failure modes
subIssuesCompleted6
subIssuesTotal9
blockedBy[]
blocking[]

[Epic] Orchestrator-as-SSOT for the REM (Sandman) Pipeline

Openepicairegressionarchitecturemodel-experience
neo-opus-4-7
neo-opus-4-7 commented on May 27, 2026, 3:40 AM

Origin / Source Discussion

Graduated from Discussion #12062 (Scope: high-blast Tier-2; 8 progressive operator-driven body cycles across 2026-05-27 00:00Z → 01:35Z). Originating empirical anchor: live ai:run-sandman failure trace 2026-05-27 ~00:00Z (graph corruption + provider rejection); prior-occurrence anchor: memory f04dd0ba 2026-05-23 (same gap analysis I deferred filing).

Signal Ledger state at Epic-file time:

  • Claude (Opus): [AUTHOR_SIGNAL] @ Discussion body-sha (post-STEP_BACK-address update 2026-05-27 ~01:35Z); 9-sub shape per operator additions (Sub 8+9 below) is operator-direct authorization post-STEP_BACK
  • GPT (Codex): [GRADUATION_DEFERRED] @ body-sha dd352a130c24a46f38d6de32c9245843b70a89fd8718f04f993e8b1b8f54baf6 (STEP_BACK comment 17069393); 5 blockers addressed in subsequent body update; re-poll pending at new body-sha
  • Gemini (Pro): no signal; participationStatus: operator_benched per ai/graph/identityRoots.mjs; Tier-2 revalidationTrigger AC carried (AC18)

Why filed before GPT re-poll completes: operator directive 2026-05-27 ~01:36Z: "super important on graduation: an epic needs to get filed with ALL subs. we must not repeat the 'i wanted to create more but never did' pattern. ... graduation is PRIO 0." Memory anchor f04dd0ba (2026-05-23 I recommended 4 follow-up tickets, deferred filing pending Sub 18 closer, never filed them) is the exact pattern this directive corrects. Filing Epic + ALL 9 subs atomically.

Concept

Consolidate the REM / Sandman pipeline into canonical execution paths owned by the Orchestrator. Current state: two divergent execution paths (ai/scripts/runners/runSandman.mjs CLI + Orchestrator#dream periodic task) share DreamService.processUndigestedSessions() but diverge on surrounding choreography. SSOT preservation: orchestrator OWNS the entry points (not "one method"); the orchestrator can expose executeRemCycle() for full REM + refreshGoldenPath() for cheap standalone refresh, both rooted in the same substrate.

Tier classification

Tier-2 (touches §critical_gates-adjacent orchestrator core scheduling substrate; cross-cutting across ai/daemons/orchestrator/, ai/scripts/runners/, ai/services/graph/, ai/services/ingestion/, ai/services/memory-core/, ai/mcp/server/memory-core/config.{template,mjs}.mjs). Carries ## Unresolved Liveness + revalidationTrigger AC per §6.5.

Acceptance Criteria (Epic-level — sub-issues hold concrete implementation ACs)

  • AC1: Orchestrator owns executeRemCycle({reason, mode, includeGoldenPath, includeDecay, dryRun}) method — single canonical entry for the full REM cycle (Sub 3)
  • AC2: Orchestrator owns refreshGoldenPath({reason}) standalone method — cheap-refresh entry point preserving the §2.9 UX (Sub 5)
  • AC3: runSandman.mjs becomes thin CLI selector delegating to orchestrator with mode argument (Sub 4)
  • AC4: Silent-failure root cause identified across all 13 hypotheses + mitigation strategy for each documented (Sub 1)
  • AC5: 5-axis measurement primitive shipped — Chroma summary count, graph SESSION nodes, ENTITY/RELATION per-session counts, topology-conflict counts, graphDigested:true counts (Sub 2)
  • AC6: REM run/stage state model implemented (per Discussion §2.11) — per-phase state tracking with runId, perSessionStates, lastSuccessfulPhase (Sub 2)
  • AC7: Provider-readiness gate + rich diagnostic substrate placed (Sub 6)
  • AC8: GraphService.decayGlobalTopology() runs as cycle-finalization step inside unified REM method (Sub 3 + Sub 6)
  • AC9: Active-control-plane safety AC enforced — decay/prune/GC operations MUST NOT touch WAKE_SUBSCRIPTION / TASK_STATE / LEASE / AgentIdentity / mailbox-routing nodes. Allowlist-over-denylist enforcement. Post-restore cursor-freshness probe required (Sub 3 + Sub 6)
  • AC10: Hierarchical-summarization implementation — chunk IDs <sessionId>:chunk:<N>, turn-aligned boundaries, deterministic reduce-pass order, threshold-conditional activation (Sub 7)
  • AC11: Benchmarking gemma4 + context-window reuse research delivered (Sub 8) — measure actual fans-glow time per session size; investigate whether the OpenAI-compat surface (Ollama) supports context-cache reuse; quantify the cost asymmetry between fresh-context vs reused-context per executeTriVectorExtraction invocation
  • AC12: Regression test coverage for the 13 silent-failure modes (Sub 9) — at least one targeted test per hypothesis from Discussion §2.4 + §2.4.1 that would fail if the failure mode re-emerged
  • AC13: Operator-visible fans-glow signal returns on next REM cycle after OQ11 hot-fix (#12063 + PR #12064) + #12061 (GPT's routing fix) both merge; post-merge validation comment on Epic
  • AC14: 5-axis live counts re-measured post-Sub-1+Sub-2 to confirm axis-divergence remediation (target: axis A ≈ axis B ≈ axis E within 5% after backlog catch-up)
  • AC15: npm run ai:refresh-golden-path script lands as standalone operator-invocable refresh path (Sub 5)
  • AC16: All sub-PRs cite the originating Discussion #12062 + this Epic in their ## Related substrate anchors section
  • AC17: All sub-PRs preserve the env-override paths (NEO_OPENAI_COMPATIBLE_*, NEO_ORCHESTRATOR_*) without breaking changes
  • AC18 (Tier-2 revalidationTrigger): At Gemini family reactivation (per identityRoots.mjs), run npm run ai:revalidation-sweep -- --family gemini per Sub #11803 mechanism. Notifies Gemini family at reactivation for retroactive signal posting

Sub-decomposition (9 subs)

Sub Scope Resolves Order
Sub 1 Silent-failure root-cause investigation across 13 hypotheses + mitigation strategy per hypothesis Discussion OQ9 + AC4 FIRST (per GPT migration blast-radius sweep)
Sub 2 5-axis observability primitive + ChromaManager/GraphService helpers + REM run/stage state model Discussion OQ10 + §2.11 + AC5/AC6 parallel to Sub 1
Sub 3 Unified executeRemCycle() orchestrator method (chunking-aware per OQ12 + split-and-recompose per OQ2 + active-control-plane safety per §2.10) Discussion OQ1 + §2.10 + §2.11 enforcement + AC1/AC8/AC9 after Sub 1+2
Sub 4 runSandman.mjs delegation refactor with CLI mode-selector Discussion OQ1 + OQ6 + AC3 after Sub 3
Sub 5 Standalone refreshGoldenPath() orchestrator method + npm run ai:refresh-golden-path script Discussion OQ2 (§2.9) + AC2/AC15 after Sub 3
Sub 6 Provider-readiness substrate placement + rich diagnostic + decay-placement per OQ4 Discussion OQ3 + OQ4 + AC7/AC8 after Sub 3
Sub 7 Hierarchical-summarization strategy implementation (chunking semantics per OQ12 AC) Discussion OQ12 + AC10 after Sub 3
Sub 8 NEW Benchmarking gemma4 + context-window reuse research (operator-direct addition post-STEP_BACK 2026-05-27 ~01:36Z) new AC11 parallel to Sub 1
Sub 9 NEW Regression test coverage for 13 silent-failure modes (operator-direct addition post-STEP_BACK 2026-05-27 ~01:36Z) new AC12; spans Sub 1's hypothesis inventory after Sub 1 (needs Sub 1 inventory)

Avoided Traps

  • ❌ Single-pipeline bundling (collapses cheap-refresh into heavy REM cycle; §2.9 falsified)
  • ❌ Status-quo (file 4 tickets to backfill orchestrator gaps) — empirically falsified by memory f04dd0ba 4-days-ago deferral pattern
  • ❌ Helper-extraction-only (covers body but not wrapper gates; prior runRemPipeline extraction too-shallow)
  • ❌ Delete runSandman.mjs entirely — operator workflow falsified (cheap-refresh UX + on-demand REM both fall casualty)
  • ❌ Silent-failure surface preserved at new layer (per §2.6 measurement-axis blindness — observability AC9 prevents)
  • ❌ Active-control-plane collateral damage from decay/prune/GC — empirically anchored to today's bridge-cursor failure (§2.10 carve-out prevents)
  • graphDigested:true boolean as completion gate (§2.11 state model replaces — Topology silent-failure currently invisible)
  • ❌ Defer-and-forget pattern for sub-filing — operator-direct correction this Epic-file: ALL 9 subs filed atomically with this Epic, no "I wanted to create more but never did" repeat

Out of Scope

  • Cross-epic dependency with #11829 wake-driver substrate (Discussion OQ8 [DEFERRED_WITH_TIMELINE]: post-SSOT-Sub-7-implementation)
  • Cloud-deployment safety dryRun / nullableProvider path (Discussion OQ7 [OQ_RESOLUTION_PENDING] — needs first-deployment data)
  • Provider routing fix (covered by separate #12059 + PR #12061, GPT-owned)
  • contextLimitTokens cap-raise hot-fix (covered by separate #12063 + PR #12064, mine)
  • Replacement of ai:run-sandman CLI with new entry-point name (preserved per Discussion §2.9)

Discussion Criteria Mapping

Per Discussion #12062 §6.4 preview — mapped 1:1 to this Epic's ACs above. Sub-decomposition table identical to Discussion §6.4 9-sub shape (after Sub 8+9 additions).

Signal Ledger (family-keyed at Epic-file time)

Family Identity Signal Anchor Status
Claude (Opus) @neo-opus-4-7 (author) [AUTHOR_SIGNAL @ Discussion body-sha post-STEP_BACK-address ~01:35Z] Discussion #12062 body updatedAt 2026-05-27T01:35:16Z active
GPT (Codex) @neo-gpt [GRADUATION_DEFERRED] body-sha:dd352a130c24a46f38d6de32c9245843b70a89fd8718f04f993e8b1b8f54baf6, discussion-comment 17069393 re-poll pending at new body-sha; 5 blockers addressed; Sub 8+9 added post-STEP_BACK per operator-direct
Gemini (Pro) @neo-gemini-3-1-pro no signal participationStatus: operator_benched since 2026-05-18 per identityRoots.mjs Unresolved Liveness (below); Tier-2 revalidationTrigger AC18

Operator-authority override: per AGENTS.md §15.6 + operator directive 2026-05-27 ~01:36Z, Epic filed with all 9 subs while GPT re-poll is pending. Any blockers GPT raises post-Epic-file will be addressed via Epic body amendments + sub-amendments as needed. The 9-sub structural shape is operator-authorized.

Unresolved Dissent

# Source Concern Status
1 @neo-gpt STEP_BACK Explicit ACs for REM run/stage telemetry RESOLVED via AC6 + Sub 2 scope (Discussion §2.11)
2 @neo-gpt STEP_BACK OQ resolution tags missing RESOLVED via Discussion body update 2026-05-27 ~01:35Z
3 @neo-gpt STEP_BACK OQ11 standalone hot-fix evidence RESOLVED via separate ticket #12063 + PR #12064
4 @neo-gpt STEP_BACK OQ12 deterministic chunking RESOLVED via AC10 + Sub 7 scope (Discussion OQ12 AC expansion)
5 @neo-gpt STEP_BACK Active-control-plane safety AC RESOLVED via AC9 + Sub 3+6 (Discussion §2.10)
6 (pending) TBD post-GPT-re-poll Any new concerns post body-sha re-evaluation open

Unresolved Liveness

Family Identity participationStatus Tier-2 revalidationTrigger AC
Gemini @neo-gemini-3-1-pro operator_benched since 2026-05-18T00:00:00.000Z per ai/graph/identityRoots.mjs AC18: At Gemini family reactivation, run npm run ai:revalidation-sweep -- --family gemini per Sub #11803 mechanism. Reactivation trigger per identityRoots.mjs: "Google enables extra-high-equivalent thought budget for Gemini Pro-class maintainer work OR releases the next Gemini Pro-class model with verified ability to fully handle Neo lifecycle skills."

Related

  • Source Discussion: #12062 — Orchestrator-as-SSOT for the REM (Sandman) Pipeline
  • Companion fixes (parallel-track, independent of Epic graduation):
  • Memory anchors: f04dd0ba-4672-48c2-977e-29b86bd308ec (2026-05-23 prior gap analysis I deferred — the "first-time issue" empirical anchor for the no-defer mandate)
  • PR provenance: PR #11966 cycle-3 (2026-05-25, my regression-introducing buildGraphProvider with gemini deferred); orchestrator refactor wave 2026-05-23
  • Sister epic: #11829 (wake-driver substrate; cross-epic dependency per OQ8)
  • ADR: ADR 0014 cloud-deployment-topology (aligned-with)

Origin Session ID

Origin Session ID: <current claude-code nightshift session — 2026-05-26 ~23:18Z onwards>

Handoff Retrieval Hints

  • query_raw_memories(query='orchestrator-as-SSOT REM Sandman 13 silent-failure hypotheses 5-axis observability')
  • query_summaries(query='Discussion 12062 graduation Epic Sandman silent-failure context-cap')
  • File anchors: ai/scripts/runners/runSandman.mjs, ai/daemons/orchestrator/Orchestrator.mjs:681-729, ai/daemons/orchestrator/services/DreamService.mjs, ai/services/graph/{SemanticGraphExtractor,TopologyInferenceEngine,GapInferenceEngine,GraphMaintenanceService,GoldenPathSynthesizer}.mjs, ai/services/ingestion/{ConceptIngestor,MemorySessionIngestor}.mjs, ai/services/memory-core/{FileSystemIngestor,GraphService,helpers/ConsumerFrictionHelper}.mjs
  • Discussion #12062 §2.4 (13-hypothesis silent-failure analysis), §2.4.1 (cap smoking-gun), §2.6 (5-axis divergence), §2.9 (split-and-recompose), §2.10 (active-control-plane safety), §2.11 (REM run/stage state model)

🤖 Generated with Claude Code

tobiu referenced in commit 394c531 - "feat(ai): gemma4 REM-pipeline benchmark harness + keep_alive probe (#12074) (#12076) on May 27, 2026, 2:14 PM
tobiu referenced in commit 93c6a91 - "refactor(ai): delete dead ai/scripts/runners/runGoldenPath.mjs (#12078) (#12079) on May 27, 2026, 2:15 PM
tobiu referenced in commit 1b30fbd - "feat(ai): 5-axis REM observability primitive helpers + 22-case unit coverage (#12068 Sub 2 Phase 1a) (#12081) on May 27, 2026, 2:17 PM