Context
Operator logs on 2026-05-16 showed npm run ai:orchestrator restart from an already pressured laptop state, adopt existing long-lived daemons, then immediately start both periodic maintenance lanes:
[2026-05-16T19:36:21.072Z] [PID:42037] [INFO] [Orchestrator] Started. summaryInterval=600000ms kbSyncInterval=1800000ms poll=3000ms.
[2026-05-16T19:36:21.578Z] [PID:42037] [INFO] [ProcessSupervisor] Starting session summarization (periodic-sweep:600000).
[2026-05-16T19:36:21.579Z] [PID:42037] [INFO] [ProcessSupervisor] Starting knowledge base sync (periodic-sync:1800000).
During the same window, add_memory from @neo-gpt timed out/froze. A later Memory Core healthcheck showed the memory count had increased by one, so the write path may commit late while the request/response path remains unusable under pressure. The operator therefore instructed @neo-gpt to skip add_memory until the current pressure issue is fixed.
The Problem
The orchestrator currently treats each scheduled maintenance task independently. After restart, if multiple interval tasks are overdue, the first poll() can launch them in the same tick. That is safe for per-task singleton correctness but unsafe for shared-substrate pressure: summary, KB sync, graph ingestion, Dream/Sandman, and future heavy lanes compete for Chroma, SQLite, MLX, and the same laptop memory budget.
This is not solved by #11459. #11459 fixed child stderr severity and duplicate already-running skip log spam. It did not add cross-task scheduling backpressure.
This is also not a duplicate of #11065. #11065 is closed and explicitly listed cross-coordinator scheduling synchronization as out of scope: "don't spawn TWO heavy tasks at once even when both are due" was filable once contention was observed. The 2026-05-16 restart log is that observation.
The Architectural Reality
Empirical code anchors from current dev:
ai/daemons/Orchestrator.mjs:326-354 configures/recover tasks and then calls this.poll() immediately after logging Started.
ai/daemons/Orchestrator.mjs:400-438 evaluates summary and kbSync with independent cadenceEngine.runIfDue(...) calls in the same poll() pass.
ai/daemons/services/CadenceEngine.mjs:52-54 uses pure interval due logic: intervalMs > 0 && now - lastRunAt >= intervalMs.
ai/daemons/services/CadenceEngine.mjs:64-70 executes every truthy trigger passed to runIfDue; it does not know whether another heavy task already started earlier in the poll.
ai/daemons/services/SummarizationCoordinatorService.mjs:25-43 returns a summary trigger for unread handovers or overdue interval; it does not account for KB sync or other active heavy tasks.
ai/daemons/services/ProcessSupervisorService.mjs:266-270 dedupes repeated starts for the same already-running task, but this does not prevent starting a different heavy task in the same poll.
learn/agentos/v13-path.md:115 already states the cadence model must be block-aware because graph-processing tasks can block add_memory while running. The current implementation still lacks the shared backpressure primitive that applies that direction across maintenance lanes.
The Fix
Add a narrow orchestrator maintenance backpressure primitive that gates heavy periodic work across task names.
Suggested shape:
- Define a heavy-maintenance task set in the orchestrator scheduling layer. Initial candidates:
summary, kbSync, primaryDevSync, dream, goldenPath; include backup only if implementation evidence shows it is materially contending, otherwise leave it out as a fast safety task.
- During one
poll() pass, allow at most one due heavy maintenance task to start. If a heavy task is already running at task-state level, defer other due heavy tasks.
- Preserve continuous daemon supervision (
chroma, memoryCoreChroma, bridgeDaemon, mlx) outside this backpressure gate.
- Preserve per-task failure isolation and existing
cadenceEngine.runIfDue(...) ergonomics where practical. The gate can live in Orchestrator.mjs or a small helper/service only if that reduces complexity without over-abstracting.
- Log deferrals sparingly and below error severity. The log should answer "why did a due task not start?" without reintroducing per-poll noise.
- Add unit tests that prove restart/overdue conditions start only one heavy task per poll while non-heavy/continuous supervision remains unaffected.
Contract Ledger Matrix
| Target Surface |
Source of Authority |
Proposed Behavior |
Fallback |
Docs |
Evidence |
| Orchestrator maintenance scheduler |
This ticket + learn/agentos/v13-path.md:115 + operator 2026-05-16 restart log |
Heavy scheduled tasks are backpressured across task names so restart does not launch every overdue lane at once |
Existing per-task singleton guard still prevents duplicate starts of the same task |
JSDoc on helper/config if the task classification is not self-evident |
Unit test with overdue summary + kbSync proves one start and one deferral |
| Deferral observability |
#11459 logging discipline + this ticket |
Deferrals are visible but sparse, non-error logs |
Health outcome may record skipped/deferred state if existing health semantics support it |
Inline comment only where needed |
Unit test asserts no duplicate per-poll log flood |
| Continuous daemon supervision |
Existing orchestrator daemon contract |
chroma, memoryCoreChroma, bridgeDaemon, and mlx restart checks remain outside heavy-maintenance backpressure |
Current cooldown remains authoritative |
Existing docs unchanged |
Existing orchestrator tests continue passing |
Acceptance Criteria
Out of Scope
- Rewriting Memory Core write handling or Chroma internals.
- Changing the embedding model, vector dimensions, or KB sync content semantics.
- Reintroducing boot-time auto-summarize/auto-dream behavior.
- Implementing SandmanCoordinatorService itself; #11065 covered that lane and is already closed.
- Adding adaptive latency thresholds. A simple deterministic backpressure gate is enough for this incident.
Avoided Traps
- Naive interval-only scheduling: already shown to start overdue lanes together after restart.
- A global stop-the-world lock: continuous daemons still need supervision and should not be blocked by a KB sync.
- Hiding deferrals entirely: silent non-starts make operator diagnosis worse.
- Treating #11459 as sufficient: logging fixes reduce noise, not shared resource pressure.
Related
- #11459 — fixed child stderr severity and already-running skip log spam.
- #11065 — closed SandmanCoordinatorService ticket; explicitly left cross-task heavy scheduling synchronization as future scope if observed.
- #10576 — prior KB sync volume/backpressure and
add_memory contention evidence.
learn/agentos/v13-path.md:115 — block-aware cadence direction.
Handoff Retrieval Hint: query_raw_memories(query="orchestrator restart summary kbSync add_memory freeze heavy task backpressure")
Context
Operator logs on 2026-05-16 showed
npm run ai:orchestratorrestart from an already pressured laptop state, adopt existing long-lived daemons, then immediately start both periodic maintenance lanes:During the same window,
add_memoryfrom @neo-gpt timed out/froze. A later Memory Core healthcheck showed the memory count had increased by one, so the write path may commit late while the request/response path remains unusable under pressure. The operator therefore instructed @neo-gpt to skipadd_memoryuntil the current pressure issue is fixed.The Problem
The orchestrator currently treats each scheduled maintenance task independently. After restart, if multiple interval tasks are overdue, the first
poll()can launch them in the same tick. That is safe for per-task singleton correctness but unsafe for shared-substrate pressure: summary, KB sync, graph ingestion, Dream/Sandman, and future heavy lanes compete for Chroma, SQLite, MLX, and the same laptop memory budget.This is not solved by #11459. #11459 fixed child stderr severity and duplicate already-running skip log spam. It did not add cross-task scheduling backpressure.
This is also not a duplicate of #11065. #11065 is closed and explicitly listed cross-coordinator scheduling synchronization as out of scope: "don't spawn TWO heavy tasks at once even when both are due" was filable once contention was observed. The 2026-05-16 restart log is that observation.
The Architectural Reality
Empirical code anchors from current
dev:ai/daemons/Orchestrator.mjs:326-354configures/recover tasks and then callsthis.poll()immediately after loggingStarted.ai/daemons/Orchestrator.mjs:400-438evaluatessummaryandkbSyncwith independentcadenceEngine.runIfDue(...)calls in the samepoll()pass.ai/daemons/services/CadenceEngine.mjs:52-54uses pure interval due logic:intervalMs > 0 && now - lastRunAt >= intervalMs.ai/daemons/services/CadenceEngine.mjs:64-70executes every truthy trigger passed torunIfDue; it does not know whether another heavy task already started earlier in the poll.ai/daemons/services/SummarizationCoordinatorService.mjs:25-43returns a summary trigger for unread handovers or overdue interval; it does not account for KB sync or other active heavy tasks.ai/daemons/services/ProcessSupervisorService.mjs:266-270dedupes repeated starts for the same already-running task, but this does not prevent starting a different heavy task in the same poll.learn/agentos/v13-path.md:115already states the cadence model must be block-aware because graph-processing tasks can blockadd_memorywhile running. The current implementation still lacks the shared backpressure primitive that applies that direction across maintenance lanes.The Fix
Add a narrow orchestrator maintenance backpressure primitive that gates heavy periodic work across task names.
Suggested shape:
summary,kbSync,primaryDevSync,dream,goldenPath; includebackuponly if implementation evidence shows it is materially contending, otherwise leave it out as a fast safety task.poll()pass, allow at most one due heavy maintenance task to start. If a heavy task is already running at task-state level, defer other due heavy tasks.chroma,memoryCoreChroma,bridgeDaemon,mlx) outside this backpressure gate.cadenceEngine.runIfDue(...)ergonomics where practical. The gate can live inOrchestrator.mjsor a small helper/service only if that reduces complexity without over-abstracting.Contract Ledger Matrix
learn/agentos/v13-path.md:115+ operator 2026-05-16 restart logsummary+kbSyncproves one start and one deferralchroma,memoryCoreChroma,bridgeDaemon, andmlxrestart checks remain outside heavy-maintenance backpressureAcceptance Criteria
summaryandkbSyncin the same poll when both are due.add_memorysemantics directly; this ticket reduces orchestrator-created pressure around it.Out of Scope
Avoided Traps
Related
add_memorycontention evidence.learn/agentos/v13-path.md:115— block-aware cadence direction.Handoff Retrieval Hint:
query_raw_memories(query="orchestrator restart summary kbSync add_memory freeze heavy task backpressure")