LearnNewsExamplesServices
Frontmatter
id11503
titleEnforce heavy-maintenance mutex across Agent OS tasks
stateClosed
labels
epicaiarchitectureperformancemodel-experience
assignees[]
createdAtMay 17, 2026, 12:30 AM
updatedAtMay 17, 2026, 9:05 PM
githubUrlhttps://github.com/neomjs/neo/issues/11503
authorneo-gpt
commentsCount4
parentIssue10960
subIssues
11505 Add shared heavy-maintenance lease primitive
11511 Classify Golden Path as light orchestrator maintenance
11513 Lane A of #11503 — add `backup` to DEFAULT_HEAVY_MAINTENANCE_TASK_NAMES + cross-poll deferral tests
11519 Cross-daemon lease coverage: orchestrator-side shared-lease adoption + env-var child inheritance
11520 Lane D of #11503 — annotate PrimaryRepoSyncService.runKbSync cascade as observable kbSync task
subIssuesCompleted5
subIssuesTotal5
blockedBy[]
blocking[]
closedAtMay 17, 2026, 9:05 PM

Enforce heavy-maintenance mutex across Agent OS tasks

Closedepicaiarchitectureperformancemodel-experience
neo-gpt
neo-gpt commented on May 17, 2026, 12:30 AM

Context

The operator surfaced a current daemon-pressure incident: npm run ai:orchestrator can leave the laptop under high memory pressure while heavy maintenance work overlaps or appears to overlap. The initial visible symptom was startup cadence pressure (session summarization and knowledge base sync both due), but the scope is broader: the orchestrator can run for days, and all substrate-heavy jobs must be mutually exclusive over daemon lifetime, not only during boot.

This ticket exists because the current codebase already has partial protection but not a complete heavy-substrate mutex:

  • ai/daemons/Orchestrator.mjs defines DEFAULT_HEAVY_MAINTENANCE_TASK_NAMES containing summary, kbSync, primary-dev-sync, dream, and golden-path.
  • Orchestrator.poll() uses a shared activeHeavyTask to defer later heavy tasks when one is active.
  • test/playwright/unit/ai/daemons/Orchestrator.spec.mjs verifies same-poll backpressure for summary vs KB sync.
  • Historical #11065 explicitly listed cross-coordinator scheduling synchronization as out of scope and “filable as scope-extension if observed contention shows need.” That need is now observed.

Duplicate sweep / V-B-A notes:

  • Live GitHub issue search for orchestrator daemon heavy maintenance Sandman Chroma kbSync returned no open duplicate.
  • Live GitHub PR search for the same scope returned no open duplicate.
  • Local resources/content/issues, resources/content/discussions, and resources/content/pulls sweeps found adjacent closed work only: #11065, #11077, #11487, #11496, #11459, #11471, #11469.
  • Chroma-backed semantic KB search was intentionally skipped for this creation pass because the active incident is Memory/Chroma pressure and the operator explicitly asked agents to avoid memory paths that freeze. Exact GitHub + local-content sweeps were used as the safe fallback.

The Problem

The current implementation protects only part of the invariant:

  1. Orchestrator-owned heavy task starts are mostly serialized. Summary, KB sync, primary-dev-sync, DreamService, and Golden Path run through the same scheduler guard.
  2. Backup is not in the heavy set. backup can overlap with summary, KB sync, DreamService, or Golden Path even though it exports KB Chroma, Memory Core Chroma, and SQLite graph state.
  3. Manual heavy scripts bypass the orchestrator guard. npm run ai:run-sandman, npm run ai:sync-kb, npm run ai:backup, and npm run ai:sync-github-workflow can run while the daemon is already doing heavy work.
  4. Primary dev sync has a nested KB-sync path. PrimaryRepoSyncService.runKbSync() shells out to npm run ai:sync-kb inside the primary-dev-sync task. The outer service task is serialized, but the nested KB sync is not observable as the kbSync child task.
  5. Recovered stale children can inherit bad state. If an older broken daemon started two heavy children, recoverTasks() can adopt both. The new scheduler should not start new collisions, but the operator still needs visible diagnostics for inherited collision states.

The load-bearing invariant is:

At any time, across orchestrator-owned tasks and approved manual maintenance scripts, no more than one substrate-heavy maintenance job may hold the heavy-maintenance lease.

Architectural Reality

Relevant current surfaces:

  • ai/daemons/Orchestrator.mjs — daemon scheduler, heavy-task list, active-heavy deferral.
  • ai/daemons/TaskDefinitions.mjs — task definitions for summary, kbSync, backup, primary-dev-sync, dream, and golden-path.
  • ai/daemons/services/ProcessSupervisorService.mjs — child-process spawn/adoption and running-state projection.
  • ai/daemons/services/TaskStateService.mjs — persisted per-task running/lastRunAt state.
  • ai/daemons/services/PrimaryRepoSyncService.mjs — nested npm run ai:sync-kb cascade.
  • ai/daemons/DreamService.mjs + ai/daemons/services/GoldenPathSynthesizer.mjs — Sandman/Dream/Golden Path heavy graph and LLM work.
  • buildScripts/ai/runSandman.mjs, buildScripts/ai/syncKnowledgeBase.mjs, buildScripts/ai/backup.mjs, buildScripts/ai/syncGithubWorkflow.mjs — manual/operator CLI paths that can bypass scheduler-local backpressure.

Precedents / prior scope anchors:

  • #11065 — SandmanCoordinatorService ticket named contention-aware scheduling and explicitly deferred cross-coordinator synchronization.
  • #11487 — added scheduler-local maintenance backpressure.
  • #11496 — removed the second Memory Core Chroma daemon from orchestrator supervision after ADR 0003 unified Chroma direction.
  • #11077 — M4 orchestrator convergence epic; closed, but useful historical context rather than the parent for new v13 work.
  • #10960 — current open v13 release-focus parent. This ticket should be linked under #10960.

Responsibility Map / Lane Split

This is intentionally an umbrella ticket. Recommended sub-lanes are independent enough for team self-selection after this ticket lands:

Lane Scope Likely owner surface Collision risk
A — Scheduler completion Add backup to the heavy-maintenance set and strengthen tests for dream / golden-path / backup deferral across polls Orchestrator.mjs, Orchestrator.spec.mjs Low; localized daemon scheduler
B — Shared lease primitive Introduce a reusable heavy-maintenance lease/mutex used by orchestrator and CLI scripts new/existing daemon service helper + tests Medium; may touch buildScripts and daemon services
C — Manual script adoption Wrap runSandman, syncKnowledgeBase, backup, and syncGithubWorkflow with the shared lease or a fail-fast/deferral mode buildScripts/ai/*.mjs, package docs Medium; operator workflow impact
D — Nested KB sync visibility Replace or annotate PrimaryRepoSyncService.runKbSync() so nested KB sync participates in the same lease and is observable as KB work PrimaryRepoSyncService.mjs, tests Medium; avoids hidden heavy work
E — Observability / stale adoption Surface inherited heavy-task collisions and lease owner in logs / health output HealthService, ProcessSupervisorService, task state Low-to-medium; observability contract

This map is the convergence artifact for this ticket. Do not file all child tickets blindly in one burst; split only after intake confirms current code and peer capacity.

The Fix

Define and enforce one heavy-maintenance mutex contract across the Agent OS maintenance surfaces.

Minimum viable shape:

  1. Treat backup as heavy in the orchestrator scheduler.
  2. Add unit coverage proving summary, kbSync, backup, primary-dev-sync, dream, and golden-path cannot start while another heavy task is running or was started earlier in the same poll.
  3. Introduce a reusable lease/mutex primitive with stale-owner metadata. The primitive must be usable from both long-lived orchestrator code and CLI scripts.
  4. Wrap manual heavy scripts so they either acquire the lease or exit/defer with a clear non-error message naming the active owner.
  5. Make PrimaryRepoSyncService's nested KB sync use the same lease, or refactor it into an orchestrator-visible task handoff.
  6. Add health/log observability for active lease owner, deferred task, stale lease cleanup, and inherited collision states.

Contract Ledger Matrix

Target Surface Source of Authority Proposed Behavior Fallback Docs Evidence
Orchestrator heavy task scheduler Orchestrator.mjs, #11487 At most one heavy task starts or remains active per poll / daemon lifetime Defers non-owner with non-error log JSDoc + focused tests Unit tests across all heavy task names
Manual heavy scripts runSandman, syncKnowledgeBase, backup, syncGithubWorkflow Acquire shared heavy-maintenance lease before doing Chroma/SQLite/LLM-heavy work Exit/defer clearly when lease held CLI JSDoc / README where applicable Unit or smoke tests with temp lock path
Nested KB sync cascade PrimaryRepoSyncService.runKbSync() Participates in the same lease and exposes KB-sync ownership Skip/defer if lease held Service JSDoc Unit test for nested path
Operator observability Orchestrator logs + HealthService task outcomes Logs owner/deferred task; health can expose active lease/collision Sparse logs, no WARN flood Health block if added Tests assert log/health projection

Acceptance Criteria

  • AC1 — backup participates in heavy-maintenance backpressure; it cannot overlap with summary, kbSync, primary-dev-sync, dream, or golden-path when scheduled by the orchestrator.
  • AC2 — Focused orchestrator tests prove daemon-lifetime deferral across polls for every heavy task class, including dream and golden-path, not just same-poll summary-vs-KB.
  • AC3 — A reusable heavy-maintenance lease/mutex exists with owner, reason, PID/process identity where available, timestamp, stale TTL, and safe release semantics.
  • AC4 — npm run ai:run-sandman / buildScripts/ai/runSandman.mjs respects the lease so Sandman cannot run while summary, KB sync, backup, primary-dev-sync, or another Sandman/GoldenPath lane is active.
  • AC5 — npm run ai:sync-kb / buildScripts/ai/syncKnowledgeBase.mjs respects the lease.
  • AC6 — npm run ai:backup / buildScripts/ai/backup.mjs respects the lease.
  • AC7 — npm run ai:sync-github-workflow / buildScripts/ai/syncGithubWorkflow.mjs is either classified as heavy and guarded, or explicitly documented as outside the Chroma/SQLite heavy-maintenance mutex with V-B-A evidence.
  • AC8 — PrimaryRepoSyncService.runKbSync() no longer hides unguarded KB work inside the primary-dev-sync lane.
  • AC9 — Logs/health clearly distinguish deferred due to active heavy-maintenance lease from task failure.
  • AC10 — Stale/inherited collision state is visible to the operator without causing a WARN/ERROR flood.

Out of Scope

  • Reintroducing a second Memory Core Chroma daemon. ADR 0003 / #11496 direction stands.
  • Changing the embedding model/provider/vector dimension.
  • Running heavy scripts during implementation on the operator laptop unless explicitly approved.
  • Adding a new external daemon supervisor such as systemd or pm2.
  • Changing project-board semantics beyond attaching this ticket to ProjectV2 #12 and linking it under #10960.

Avoided Traps

  • Startup-only fix: rejected. The daemon can run for days; the invariant is lifetime mutual exclusion.
  • Scheduler-only fix: rejected. Manual scripts bypass scheduler-local state today.
  • Per-script private locks: rejected. That would prevent two Sandman runs but still allow Sandman + KB sync collisions.
  • Treating backup as cheap: rejected. The backup exports Chroma and SQLite state and can contend with summary/KB/Dream workloads.
  • Filing five child tickets immediately: rejected for lead-role fan-out discipline. This umbrella preserves the map; child tickets should split after intake/peer convergence.

Related

  • Parent focus: #10960 v13 Release Tracking — main-focus-items canonical sub-issue tree
  • Historical orchestrator epic: #11077 M4 Architectural Convergence — closed, context only
  • Prior scheduler backpressure: #11487 / PR #11489
  • Sandman scheduling scope anchor: #11065
  • Memory Core Chroma cleanup: #11496 / PR #11499
  • Child stderr / skip-noise adjacent fix: #11459
  • Gh-workflow CLI heavy script: #11469

Origin Session ID: unavailable — Memory Core add_memory is intentionally skipped during this Chroma/memory-pressure incident per operator instruction.

Handoff Retrieval Hint: query_raw_memories(query="orchestrator heavy maintenance mutex Sandman backup kbSync primary-dev-sync Chroma memory pressure")

tobiu referenced in commit a5c6380 - "feat(buildScripts): wrap manual heavy-maintenance scripts with shared lease — Lane C of #11503 (#11507) (#11509) on May 17, 2026, 3:08 AM
tobiu referenced in commit 4e0d0db - "feat(ai): add backup to heavy-maintenance set + cross-poll deferral tests — Lane A of #11503 (#11513) (#11514) on May 17, 2026, 8:22 AM
tobiu referenced in commit 55dadc7 - "feat(ai): annotate runKbSync cascade as kbSync lifecycle (#11520) (#11521) on May 17, 2026, 8:23 AM
tobiu referenced in commit 0cedd3d - "feat(ai): orchestrator shared-lease wrap + cascade env inheritance (#11519) (#11527) on May 17, 2026, 10:19 AM