LearnNewsExamplesServices
Frontmatter
id11487
titleAdd orchestrator maintenance backpressure
stateClosed
labels
bugaiarchitectureperformancemodel-experience
assigneesneo-gpt
createdAtMay 16, 2026, 10:09 PM
updatedAtMay 16, 2026, 11:32 PM
githubUrlhttps://github.com/neomjs/neo/issues/11487
authorneo-gpt
commentsCount0
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 16, 2026, 11:32 PM

Add orchestrator maintenance backpressure

Closedbugaiarchitectureperformancemodel-experience
neo-gpt
neo-gpt commented on May 16, 2026, 10:09 PM

Context

Operator logs on 2026-05-16 showed npm run ai:orchestrator restart from an already pressured laptop state, adopt existing long-lived daemons, then immediately start both periodic maintenance lanes:

[2026-05-16T19:36:21.072Z] [PID:42037] [INFO] [Orchestrator] Started. summaryInterval=600000ms kbSyncInterval=1800000ms poll=3000ms.
[2026-05-16T19:36:21.578Z] [PID:42037] [INFO] [ProcessSupervisor] Starting session summarization (periodic-sweep:600000).
[2026-05-16T19:36:21.579Z] [PID:42037] [INFO] [ProcessSupervisor] Starting knowledge base sync (periodic-sync:1800000).

During the same window, add_memory from @neo-gpt timed out/froze. A later Memory Core healthcheck showed the memory count had increased by one, so the write path may commit late while the request/response path remains unusable under pressure. The operator therefore instructed @neo-gpt to skip add_memory until the current pressure issue is fixed.

The Problem

The orchestrator currently treats each scheduled maintenance task independently. After restart, if multiple interval tasks are overdue, the first poll() can launch them in the same tick. That is safe for per-task singleton correctness but unsafe for shared-substrate pressure: summary, KB sync, graph ingestion, Dream/Sandman, and future heavy lanes compete for Chroma, SQLite, MLX, and the same laptop memory budget.

This is not solved by #11459. #11459 fixed child stderr severity and duplicate already-running skip log spam. It did not add cross-task scheduling backpressure.

This is also not a duplicate of #11065. #11065 is closed and explicitly listed cross-coordinator scheduling synchronization as out of scope: "don't spawn TWO heavy tasks at once even when both are due" was filable once contention was observed. The 2026-05-16 restart log is that observation.

The Architectural Reality

Empirical code anchors from current dev:

  • ai/daemons/Orchestrator.mjs:326-354 configures/recover tasks and then calls this.poll() immediately after logging Started.
  • ai/daemons/Orchestrator.mjs:400-438 evaluates summary and kbSync with independent cadenceEngine.runIfDue(...) calls in the same poll() pass.
  • ai/daemons/services/CadenceEngine.mjs:52-54 uses pure interval due logic: intervalMs > 0 && now - lastRunAt >= intervalMs.
  • ai/daemons/services/CadenceEngine.mjs:64-70 executes every truthy trigger passed to runIfDue; it does not know whether another heavy task already started earlier in the poll.
  • ai/daemons/services/SummarizationCoordinatorService.mjs:25-43 returns a summary trigger for unread handovers or overdue interval; it does not account for KB sync or other active heavy tasks.
  • ai/daemons/services/ProcessSupervisorService.mjs:266-270 dedupes repeated starts for the same already-running task, but this does not prevent starting a different heavy task in the same poll.

learn/agentos/v13-path.md:115 already states the cadence model must be block-aware because graph-processing tasks can block add_memory while running. The current implementation still lacks the shared backpressure primitive that applies that direction across maintenance lanes.

The Fix

Add a narrow orchestrator maintenance backpressure primitive that gates heavy periodic work across task names.

Suggested shape:

  1. Define a heavy-maintenance task set in the orchestrator scheduling layer. Initial candidates: summary, kbSync, primaryDevSync, dream, goldenPath; include backup only if implementation evidence shows it is materially contending, otherwise leave it out as a fast safety task.
  2. During one poll() pass, allow at most one due heavy maintenance task to start. If a heavy task is already running at task-state level, defer other due heavy tasks.
  3. Preserve continuous daemon supervision (chroma, memoryCoreChroma, bridgeDaemon, mlx) outside this backpressure gate.
  4. Preserve per-task failure isolation and existing cadenceEngine.runIfDue(...) ergonomics where practical. The gate can live in Orchestrator.mjs or a small helper/service only if that reduces complexity without over-abstracting.
  5. Log deferrals sparingly and below error severity. The log should answer "why did a due task not start?" without reintroducing per-poll noise.
  6. Add unit tests that prove restart/overdue conditions start only one heavy task per poll while non-heavy/continuous supervision remains unaffected.

Contract Ledger Matrix

Target Surface Source of Authority Proposed Behavior Fallback Docs Evidence
Orchestrator maintenance scheduler This ticket + learn/agentos/v13-path.md:115 + operator 2026-05-16 restart log Heavy scheduled tasks are backpressured across task names so restart does not launch every overdue lane at once Existing per-task singleton guard still prevents duplicate starts of the same task JSDoc on helper/config if the task classification is not self-evident Unit test with overdue summary + kbSync proves one start and one deferral
Deferral observability #11459 logging discipline + this ticket Deferrals are visible but sparse, non-error logs Health outcome may record skipped/deferred state if existing health semantics support it Inline comment only where needed Unit test asserts no duplicate per-poll log flood
Continuous daemon supervision Existing orchestrator daemon contract chroma, memoryCoreChroma, bridgeDaemon, and mlx restart checks remain outside heavy-maintenance backpressure Current cooldown remains authoritative Existing docs unchanged Existing orchestrator tests continue passing

Acceptance Criteria

  • Restart/first-poll scheduling cannot start both summary and kbSync in the same poll when both are due.
  • A running heavy maintenance task defers other due heavy maintenance tasks instead of starting them concurrently.
  • Continuous daemon restart checks remain unaffected by the heavy-maintenance gate.
  • Deferral logs are non-error and deduped or naturally sparse enough to avoid per-poll noise.
  • Unit coverage proves the cross-task gate and preserves existing scheduling/failure-isolation behavior.
  • No changes reduce child stderr visibility or undo #11459’s child log severity classification.
  • No changes alter add_memory semantics directly; this ticket reduces orchestrator-created pressure around it.

Out of Scope

  • Rewriting Memory Core write handling or Chroma internals.
  • Changing the embedding model, vector dimensions, or KB sync content semantics.
  • Reintroducing boot-time auto-summarize/auto-dream behavior.
  • Implementing SandmanCoordinatorService itself; #11065 covered that lane and is already closed.
  • Adding adaptive latency thresholds. A simple deterministic backpressure gate is enough for this incident.

Avoided Traps

  • Naive interval-only scheduling: already shown to start overdue lanes together after restart.
  • A global stop-the-world lock: continuous daemons still need supervision and should not be blocked by a KB sync.
  • Hiding deferrals entirely: silent non-starts make operator diagnosis worse.
  • Treating #11459 as sufficient: logging fixes reduce noise, not shared resource pressure.

Related

  • #11459 — fixed child stderr severity and already-running skip log spam.
  • #11065 — closed SandmanCoordinatorService ticket; explicitly left cross-task heavy scheduling synchronization as future scope if observed.
  • #10576 — prior KB sync volume/backpressure and add_memory contention evidence.
  • learn/agentos/v13-path.md:115 — block-aware cadence direction.

Handoff Retrieval Hint: query_raw_memories(query="orchestrator restart summary kbSync add_memory freeze heavy task backpressure")

tobiu referenced in commit 1c9ed50 - "fix(ai): add orchestrator maintenance backpressure (#11487) (#11489) on May 16, 2026, 11:32 PM
tobiu closed this issue on May 16, 2026, 11:32 PM