The operator surfaced a current daemon-pressure incident: npm run ai:orchestrator can leave the laptop under high memory pressure while heavy maintenance work overlaps or appears to overlap. The initial visible symptom was startup cadence pressure (session summarization and knowledge base sync both due), but the scope is broader: the orchestrator can run for days, and all substrate-heavy jobs must be mutually exclusive over daemon lifetime, not only during boot.
This ticket exists because the current codebase already has partial protection but not a complete heavy-substrate mutex:
ai/daemons/Orchestrator.mjs defines DEFAULT_HEAVY_MAINTENANCE_TASK_NAMES containing summary, kbSync, primary-dev-sync, dream, and golden-path.
Orchestrator.poll() uses a shared activeHeavyTask to defer later heavy tasks when one is active.
test/playwright/unit/ai/daemons/Orchestrator.spec.mjs verifies same-poll backpressure for summary vs KB sync.
Historical #11065 explicitly listed cross-coordinator scheduling synchronization as out of scope and “filable as scope-extension if observed contention shows need.” That need is now observed.
Duplicate sweep / V-B-A notes:
Live GitHub issue search for orchestrator daemon heavy maintenance Sandman Chroma kbSync returned no open duplicate.
Live GitHub PR search for the same scope returned no open duplicate.
Local resources/content/issues, resources/content/discussions, and resources/content/pulls sweeps found adjacent closed work only: #11065, #11077, #11487, #11496, #11459, #11471, #11469.
Chroma-backed semantic KB search was intentionally skipped for this creation pass because the active incident is Memory/Chroma pressure and the operator explicitly asked agents to avoid memory paths that freeze. Exact GitHub + local-content sweeps were used as the safe fallback.
The Problem
The current implementation protects only part of the invariant:
Orchestrator-owned heavy task starts are mostly serialized. Summary, KB sync, primary-dev-sync, DreamService, and Golden Path run through the same scheduler guard.
Backup is not in the heavy set.backup can overlap with summary, KB sync, DreamService, or Golden Path even though it exports KB Chroma, Memory Core Chroma, and SQLite graph state.
Manual heavy scripts bypass the orchestrator guard.npm run ai:run-sandman, npm run ai:sync-kb, npm run ai:backup, and npm run ai:sync-github-workflow can run while the daemon is already doing heavy work.
Primary dev sync has a nested KB-sync path.PrimaryRepoSyncService.runKbSync() shells out to npm run ai:sync-kb inside the primary-dev-sync task. The outer service task is serialized, but the nested KB sync is not observable as the kbSync child task.
Recovered stale children can inherit bad state. If an older broken daemon started two heavy children, recoverTasks() can adopt both. The new scheduler should not start new collisions, but the operator still needs visible diagnostics for inherited collision states.
The load-bearing invariant is:
At any time, across orchestrator-owned tasks and approved manual maintenance scripts, no more than one substrate-heavy maintenance job may hold the heavy-maintenance lease.
#11496 — removed the second Memory Core Chroma daemon from orchestrator supervision after ADR 0003 unified Chroma direction.
#11077 — M4 orchestrator convergence epic; closed, but useful historical context rather than the parent for new v13 work.
#10960 — current open v13 release-focus parent. This ticket should be linked under #10960.
Responsibility Map / Lane Split
This is intentionally an umbrella ticket. Recommended sub-lanes are independent enough for team self-selection after this ticket lands:
Lane
Scope
Likely owner surface
Collision risk
A — Scheduler completion
Add backup to the heavy-maintenance set and strengthen tests for dream / golden-path / backup deferral across polls
Orchestrator.mjs, Orchestrator.spec.mjs
Low; localized daemon scheduler
B — Shared lease primitive
Introduce a reusable heavy-maintenance lease/mutex used by orchestrator and CLI scripts
new/existing daemon service helper + tests
Medium; may touch buildScripts and daemon services
C — Manual script adoption
Wrap runSandman, syncKnowledgeBase, backup, and syncGithubWorkflow with the shared lease or a fail-fast/deferral mode
buildScripts/ai/*.mjs, package docs
Medium; operator workflow impact
D — Nested KB sync visibility
Replace or annotate PrimaryRepoSyncService.runKbSync() so nested KB sync participates in the same lease and is observable as KB work
PrimaryRepoSyncService.mjs, tests
Medium; avoids hidden heavy work
E — Observability / stale adoption
Surface inherited heavy-task collisions and lease owner in logs / health output
HealthService, ProcessSupervisorService, task state
Low-to-medium; observability contract
This map is the convergence artifact for this ticket. Do not file all child tickets blindly in one burst; split only after intake confirms current code and peer capacity.
The Fix
Define and enforce one heavy-maintenance mutex contract across the Agent OS maintenance surfaces.
Minimum viable shape:
Treat backup as heavy in the orchestrator scheduler.
Add unit coverage proving summary, kbSync, backup, primary-dev-sync, dream, and golden-path cannot start while another heavy task is running or was started earlier in the same poll.
Introduce a reusable lease/mutex primitive with stale-owner metadata. The primitive must be usable from both long-lived orchestrator code and CLI scripts.
Wrap manual heavy scripts so they either acquire the lease or exit/defer with a clear non-error message naming the active owner.
Make PrimaryRepoSyncService's nested KB sync use the same lease, or refactor it into an orchestrator-visible task handoff.
Add health/log observability for active lease owner, deferred task, stale lease cleanup, and inherited collision states.
Acquire shared heavy-maintenance lease before doing Chroma/SQLite/LLM-heavy work
Exit/defer clearly when lease held
CLI JSDoc / README where applicable
Unit or smoke tests with temp lock path
Nested KB sync cascade
PrimaryRepoSyncService.runKbSync()
Participates in the same lease and exposes KB-sync ownership
Skip/defer if lease held
Service JSDoc
Unit test for nested path
Operator observability
Orchestrator logs + HealthService task outcomes
Logs owner/deferred task; health can expose active lease/collision
Sparse logs, no WARN flood
Health block if added
Tests assert log/health projection
Acceptance Criteria
AC1 — backup participates in heavy-maintenance backpressure; it cannot overlap with summary, kbSync, primary-dev-sync, dream, or golden-path when scheduled by the orchestrator.
AC2 — Focused orchestrator tests prove daemon-lifetime deferral across polls for every heavy task class, including dream and golden-path, not just same-poll summary-vs-KB.
AC3 — A reusable heavy-maintenance lease/mutex exists with owner, reason, PID/process identity where available, timestamp, stale TTL, and safe release semantics.
AC4 — npm run ai:run-sandman / buildScripts/ai/runSandman.mjs respects the lease so Sandman cannot run while summary, KB sync, backup, primary-dev-sync, or another Sandman/GoldenPath lane is active.
AC5 — npm run ai:sync-kb / buildScripts/ai/syncKnowledgeBase.mjs respects the lease.
AC6 — npm run ai:backup / buildScripts/ai/backup.mjs respects the lease.
AC7 — npm run ai:sync-github-workflow / buildScripts/ai/syncGithubWorkflow.mjs is either classified as heavy and guarded, or explicitly documented as outside the Chroma/SQLite heavy-maintenance mutex with V-B-A evidence.
AC8 — PrimaryRepoSyncService.runKbSync() no longer hides unguarded KB work inside the primary-dev-sync lane.
AC9 — Logs/health clearly distinguish deferred due to active heavy-maintenance lease from task failure.
AC10 — Stale/inherited collision state is visible to the operator without causing a WARN/ERROR flood.
Out of Scope
Reintroducing a second Memory Core Chroma daemon. ADR 0003 / #11496 direction stands.
Changing the embedding model/provider/vector dimension.
Running heavy scripts during implementation on the operator laptop unless explicitly approved.
Adding a new external daemon supervisor such as systemd or pm2.
Changing project-board semantics beyond attaching this ticket to ProjectV2 #12 and linking it under #10960.
Avoided Traps
Startup-only fix: rejected. The daemon can run for days; the invariant is lifetime mutual exclusion.
Scheduler-only fix: rejected. Manual scripts bypass scheduler-local state today.
Per-script private locks: rejected. That would prevent two Sandman runs but still allow Sandman + KB sync collisions.
Treating backup as cheap: rejected. The backup exports Chroma and SQLite state and can contend with summary/KB/Dream workloads.
Filing five child tickets immediately: rejected for lead-role fan-out discipline. This umbrella preserves the map; child tickets should split after intake/peer convergence.
tobiu referenced in commit a5c6380 - "feat(buildScripts): wrap manual heavy-maintenance scripts with shared lease — Lane C of #11503 (#11507) (#11509) on May 17, 2026, 3:08 AM
tobiu referenced in commit 4e0d0db - "feat(ai): add backup to heavy-maintenance set + cross-poll deferral tests — Lane A of #11503 (#11513) (#11514) on May 17, 2026, 8:22 AM
tobiu referenced in commit 55dadc7 - "feat(ai): annotate runKbSync cascade as kbSync lifecycle (#11520) (#11521) on May 17, 2026, 8:23 AM
tobiu referenced in commit 0cedd3d - "feat(ai): orchestrator shared-lease wrap + cascade env inheritance (#11519) (#11527) on May 17, 2026, 10:19 AM
Context
The operator surfaced a current daemon-pressure incident:
npm run ai:orchestratorcan leave the laptop under high memory pressure while heavy maintenance work overlaps or appears to overlap. The initial visible symptom was startup cadence pressure (session summarizationandknowledge base syncboth due), but the scope is broader: the orchestrator can run for days, and all substrate-heavy jobs must be mutually exclusive over daemon lifetime, not only during boot.This ticket exists because the current codebase already has partial protection but not a complete heavy-substrate mutex:
ai/daemons/Orchestrator.mjsdefinesDEFAULT_HEAVY_MAINTENANCE_TASK_NAMEScontainingsummary,kbSync,primary-dev-sync,dream, andgolden-path.Orchestrator.poll()uses a sharedactiveHeavyTaskto defer later heavy tasks when one is active.test/playwright/unit/ai/daemons/Orchestrator.spec.mjsverifies same-poll backpressure for summary vs KB sync.Duplicate sweep / V-B-A notes:
orchestrator daemon heavy maintenance Sandman Chroma kbSyncreturned no open duplicate.resources/content/issues,resources/content/discussions, andresources/content/pullssweeps found adjacent closed work only: #11065, #11077, #11487, #11496, #11459, #11471, #11469.The Problem
The current implementation protects only part of the invariant:
backupcan overlap with summary, KB sync, DreamService, or Golden Path even though it exports KB Chroma, Memory Core Chroma, and SQLite graph state.npm run ai:run-sandman,npm run ai:sync-kb,npm run ai:backup, andnpm run ai:sync-github-workflowcan run while the daemon is already doing heavy work.PrimaryRepoSyncService.runKbSync()shells out tonpm run ai:sync-kbinside theprimary-dev-synctask. The outer service task is serialized, but the nested KB sync is not observable as thekbSyncchild task.recoverTasks()can adopt both. The new scheduler should not start new collisions, but the operator still needs visible diagnostics for inherited collision states.The load-bearing invariant is:
Architectural Reality
Relevant current surfaces:
ai/daemons/Orchestrator.mjs— daemon scheduler, heavy-task list, active-heavy deferral.ai/daemons/TaskDefinitions.mjs— task definitions forsummary,kbSync,backup,primary-dev-sync,dream, andgolden-path.ai/daemons/services/ProcessSupervisorService.mjs— child-process spawn/adoption and running-state projection.ai/daemons/services/TaskStateService.mjs— persisted per-task running/lastRunAt state.ai/daemons/services/PrimaryRepoSyncService.mjs— nestednpm run ai:sync-kbcascade.ai/daemons/DreamService.mjs+ai/daemons/services/GoldenPathSynthesizer.mjs— Sandman/Dream/Golden Path heavy graph and LLM work.buildScripts/ai/runSandman.mjs,buildScripts/ai/syncKnowledgeBase.mjs,buildScripts/ai/backup.mjs,buildScripts/ai/syncGithubWorkflow.mjs— manual/operator CLI paths that can bypass scheduler-local backpressure.Precedents / prior scope anchors:
Responsibility Map / Lane Split
This is intentionally an umbrella ticket. Recommended sub-lanes are independent enough for team self-selection after this ticket lands:
backupto the heavy-maintenance set and strengthen tests for dream / golden-path / backup deferral across pollsOrchestrator.mjs,Orchestrator.spec.mjsrunSandman,syncKnowledgeBase,backup, andsyncGithubWorkflowwith the shared lease or a fail-fast/deferral modebuildScripts/ai/*.mjs, package docsPrimaryRepoSyncService.runKbSync()so nested KB sync participates in the same lease and is observable as KB workPrimaryRepoSyncService.mjs, testsHealthService,ProcessSupervisorService, task stateThis map is the convergence artifact for this ticket. Do not file all child tickets blindly in one burst; split only after intake confirms current code and peer capacity.
The Fix
Define and enforce one heavy-maintenance mutex contract across the Agent OS maintenance surfaces.
Minimum viable shape:
backupas heavy in the orchestrator scheduler.summary,kbSync,backup,primary-dev-sync,dream, andgolden-pathcannot start while another heavy task is running or was started earlier in the same poll.PrimaryRepoSyncService's nested KB sync use the same lease, or refactor it into an orchestrator-visible task handoff.Contract Ledger Matrix
Orchestrator.mjs, #11487runSandman,syncKnowledgeBase,backup,syncGithubWorkflowPrimaryRepoSyncService.runKbSync()Acceptance Criteria
backupparticipates in heavy-maintenance backpressure; it cannot overlap withsummary,kbSync,primary-dev-sync,dream, orgolden-pathwhen scheduled by the orchestrator.dreamandgolden-path, not just same-poll summary-vs-KB.npm run ai:run-sandman/buildScripts/ai/runSandman.mjsrespects the lease so Sandman cannot run while summary, KB sync, backup, primary-dev-sync, or another Sandman/GoldenPath lane is active.npm run ai:sync-kb/buildScripts/ai/syncKnowledgeBase.mjsrespects the lease.npm run ai:backup/buildScripts/ai/backup.mjsrespects the lease.npm run ai:sync-github-workflow/buildScripts/ai/syncGithubWorkflow.mjsis either classified as heavy and guarded, or explicitly documented as outside the Chroma/SQLite heavy-maintenance mutex with V-B-A evidence.PrimaryRepoSyncService.runKbSync()no longer hides unguarded KB work inside theprimary-dev-synclane.deferred due to active heavy-maintenance leasefrom task failure.Out of Scope
Avoided Traps
backupas cheap: rejected. The backup exports Chroma and SQLite state and can contend with summary/KB/Dream workloads.Related
Origin Session ID: unavailable — Memory Core
add_memoryis intentionally skipped during this Chroma/memory-pressure incident per operator instruction.Handoff Retrieval Hint:
query_raw_memories(query="orchestrator heavy maintenance mutex Sandman backup kbSync primary-dev-sync Chroma memory pressure")