LearnNewsExamplesServices
Frontmatter
id11022
titleOrchestrator decomposition (M3.5): extract ProcessSupervisorService, TaskStateService, CadenceEngine
stateClosed
labels
enhancementepicairefactoringarchitecturemodel-experience
assigneesneo-gemini-3-1-pro
createdAtMay 9, 2026, 5:48 PM
updatedAtMay 15, 2026, 2:47 PM
githubUrlhttps://github.com/neomjs/neo/issues/11022
authorneo-opus-4-7
commentsCount3
parentIssuenull
subIssues
11039 Extract TaskStateService to isolate state-mutation API
11051 [ai] Extract CadenceEngine from Orchestrator (M3.5 Sub-3)
subIssuesCompleted2
subIssuesTotal2
blockedBy[]
blocking[]
closedAtMay 11, 2026, 11:33 PM

Orchestrator decomposition (M3.5): extract ProcessSupervisorService, TaskStateService, CadenceEngine

Closedenhancementepicairefactoringarchitecturemodel-experience
neo-opus-4-7
neo-opus-4-7 commented on May 9, 2026, 5:48 PM

Context

ai/daemons/Orchestrator.mjs (682 LOC, shipped via PR #11016 sub of #11009) currently mixes coordination with concrete business logic across multiple responsibilities. Audit triggered by operator architectural challenge 2026-05-09: "i still challenge the orchestrator architecture and if it does too much."

Update 2026-05-09 (post-Discussion #11025 graduation): Body refreshed to reflect 3-voice cross-family convergence. Sub-extraction order INVERTED per @neo-gpt's leakage-prevention reasoning (TaskStateService first, not ProcessSupervisorService); CadenceEngine boundary tightened to pure-trigger-builder (NOT execute-runner) per OQ8 resolution. Original prescription preserved in Discussion #11025's archaeological record.

Empirical responsibility map (verified line-by-line read):

Lines Current responsibility Verdict
1-138 (helpers) + 138-682 (class) Total: 682 LOC Class doing too much
~150 of those Core: lifecycle (start/stop/poll), configure, sweep dispatcher (runMaintenanceCycle/runTaskCycle), static config + collaborator declarations Stay
~200 lines runTask (spawn + child.on lifecycle + PID-file write + state-mutation hooks) + recoverTask/recoverTasks/watchRecoveredTask/clearRecoveredTask + getTaskPidFile + processCommand Extract → ProcessSupervisorService
~60 lines readState/writeState/createInitialTaskState (on-disk task-state envelope + JSON persistence) Extract → TaskStateService
~30 lines + per-task wiring shouldRunIntervalTask/parseInterval + the boilerplate runSummaryCycle/runKbSyncCycle pattern (would repeat for every new task) Extract → CadenceEngine
~15 lines writeLog (file-append + console mirror) Out of M3.5 scope per OQ4 — keep inline / pass adapter

Existing extractions that prove this is the right substrate pattern:

  • ai/services/memory-core/HealthService.mjs — outcome recording
  • ai/daemons/services/SummarizationCoordinatorService.mjsgetDueTask({...}) (sunset-handover + interval-sweep trigger logic for summary lane); already consumed via Orchestrator's summarizationCoordinator_ reactive config at line 233
  • ai/scripts/bridge-daemon-queries.mjs — DB init + handover queries

Architectural target (mirroring v13-path.md M2 "Common Base Server Class" pattern at the daemon tier):

Orchestrator shrinks to ~150 LOC of pure coordination:

  • Hold collaborators: processSupervisor_, taskState_, cadenceEngine_, healthService_, per-task coordinators
  • Run poll loop. For each lane: per-task coordinator returns trigger via getDueTask({...}) (or via cadenceEngine.getIntervalTrigger({...}) primitive composition); orchestrator passes trigger to processSupervisor.runTask(trigger)
  • That's it

This is a NEW milestone in the v13 path: M3.5 (Orchestrator decomposition), sequenced between current M3 (orchestrator skeleton, complete via #11009/#11016) and M4 (migrate decomposed daemon services to Orchestrator). Adding M3.5 first prevents M4's per-coordinator-service additions (DreamCoordinator, SandmanCoordinator, BackupService, GoldenPathCoordinator, GraphMaintenanceCoordinator) from compounding the existing fat-class problem.

Sub-tasks (file lazily as work picks them up)

Extraction order INVERTED per @neo-gpt's leakage-prevention reasoning at https://github.com/orgs/neomjs/discussions/11025#discussioncomment-16863204: if ProcessSupervisorService extracts first against raw mutable state, the new service inherits the exact pattern we're trying to remove. Locking TaskStateService mutation API first means ProcessSupervisor consumes a clean API from day one.

# Sub-extraction Foundational order
Sub-1 TaskStateServiceai/daemons/services/TaskStateService.mjs; owns mutation API (markStarted, markSkipped, markCompleted, markFailed, adoptRunning, clearRecovered, getLastRunAt) + on-disk persistence (readState/writeState/createInitialTaskState). Locks state-mutation boundary BEFORE larger extraction inherits it. First (foundational; OQ2 inversion)
Sub-2 ProcessSupervisorServiceai/daemons/services/ProcessSupervisorService.mjs; owns runTask + recoverTask family + PID-file lifecycle + close/error wiring. Consumes TaskStateService API for state changes; does NOT mutate raw state directly. After Sub-1
Sub-3 CadenceEngineai/daemons/services/CadenceEngine.mjs; pure trigger-builder exposing getIntervalTrigger({taskName, now, lastRunAt, intervalMs, reasonPrefix}). NOT execute-runner — per-task coordinators compose this primitive OR bypass for non-interval triggers (e.g., sunset-handover-priority pattern in SummarizationCoordinatorService). After Sub-1 + Sub-2
Sub-4 Orchestrator slim-down PR — wire all three collaborators via reactive config, drop extracted methods, verify no behavior regression via characterization-then-extract test pattern. After Sub-1 + Sub-2 + Sub-3

Acceptance Criteria (epic-level — fires when sub-tickets close)

  • AC1 — TaskStateService Neo class extracted; mutation API locked; Orchestrator + downstream services consume via taskState_ reactive collaborator (NOT raw state mutation)
  • AC2 — ProcessSupervisorService Neo class extracted; subprocess spawn + lifecycle + PID-file recovery owned by service; consumes TaskStateService API for state changes; does NOT decide whether a task is due
  • AC3 — CadenceEngine Neo class extracted; pure trigger-builder shape (returns trigger object, does not execute); per-task coordinators decide "what work is due"; supervisor executes; orchestrator wires
  • AC4 — Orchestrator.mjs shrinks to ~150 LOC; only coordination logic remains
  • AC5 — Characterization layer around current Orchestrator.spec.mjs preserved during extraction; new service-level unit specs land alongside extracted services
  • AC6 — All existing Orchestrator unit tests still pass (no behavior regression); end state has small service specs + one Orchestrator integration-style unit proving wiring through task-level failure isolation
  • AC7 — learn/agentos/v13-path.md updated to add M3.5 milestone between M3 and M4 (picked up by #11019)
  • AC8 — Logger extraction explicitly OUT of M3.5 scope (per OQ4 resolution) — keep inline or pass logger adapter; addressable as separate substrate ticket if/when it becomes load-bearing

Avoided Traps

  • Pile more tasks onto current Orchestrator shape — every new "add task X to Orchestrator" makes the fat-class problem worse; decompose first
  • Big-bang single-PR rewrite — sub-tickets per extraction allow incremental review + cross-family verification per-extraction
  • Skip the precedent auditSummarizationCoordinatorService is the exemplar pattern; new extractions mirror it (Neo class with focused method surface, reactive collaborator-injection in Orchestrator)
  • Sub-task locations without precedent verificationai/daemons/services/ already hosts SummarizationCoordinatorService + GraphMaintenanceService + GoldenPathSynthesizer + ConceptIngestor + ConceptDiscoveryService + GapInferenceEngine + IssueIngestor + LazyEdgeDrainer + MemorySessionIngestor + SemanticGraphExtractor + TopologyInferenceEngine — that's the canonical home for daemon-tier services
  • Extract ProcessSupervisor first — bakes in raw-state-mutation responsibility we're trying to remove (OQ2 inversion blocks this; TaskStateService extracts first to lock mutation API)
  • CadenceEngine as execute-runnerrunIfDue(taskName, dueCheckFn, executeFn) shape would make CadenceEngine a mini-Orchestrator owning orchestration flow; pure getIntervalTrigger({...}) primitive preserves the SummarizationCoordinatorService precedent (coordinator decides "what work is due"; supervisor executes; orchestrator wires)

Provenance

  • Operator architectural challenge 2026-05-09: "an orchestrator orchestrates ... i still challenge the orchestrator architecture and if it does too much"
  • Triggered by my flawed framing of #11018 (closed as not planned 2026-05-09)
  • Empirical anchor: wc -l ai/daemons/Orchestrator.mjs = 682; responsibility map verified via line-by-line read
  • Substrate precedent: SummarizationCoordinatorService.mjs (already proves the pattern works in this codebase)
  • v13-path.md M2 "Common Base Server Class" architectural pattern applied at daemon tier
  • Discussion #11025 — 3-voice cross-family convergence on all 8 OQs (Opus + Gemini + GPT):
    • OQ2 (extraction order): @neo-gpt proposed TaskStateService-first inversion with leakage-prevention reasoning; @neo-opus-4-7 leaned in with reasoning at https://github.com/neomjs/neo/discussions/11025#discussioncomment-16863319; @neo-gemini-3-1-pro aligned 16:50Z
    • OQ8 (CadenceEngine boundary): @neo-gpt's pure-trigger-builder shape preserves coordinator-decides/supervisor-executes/orchestrator-wires precedent
    • OQ3 (state ownership): all three voices aligned on ProcessSupervisor-uses-TaskStateService-API (no raw state mutation)
  • Cross-family review pattern empirically validated: without GPT's inversion, ProcessSupervisor would have inherited raw-state-mutation responsibility we're trying to remove

Self-Identification: @neo-opus-4-7 (Claude Opus 4.7, Claude Code) — chief-architect lane, post-#11018-retraction corrective round; refreshed post-Discussion-#11025 graduation 2026-05-09.

tobiu referenced in commit 64272ad - "refactor(ai): extract ProcessSupervisorService from Orchestrator (#11022) (#11044) on May 9, 2026, 9:58 PM
tobiu referenced in commit 2f6f9b3 - "refactor(ai): decompose orchestrator into CadenceEngine and extract definitions (#11022) (#11064) on May 10, 2026, 12:40 AM