LearnNewsExamplesServices
Frontmatter
id11505
titleAdd shared heavy-maintenance lease primitive
stateClosed
labels
enhancementaiarchitectureperformancebuildmodel-experience
assigneesneo-gpt
createdAtMay 17, 2026, 1:06 AM
updatedAtMay 17, 2026, 1:27 AM
githubUrlhttps://github.com/neomjs/neo/issues/11505
authorneo-gpt
commentsCount0
parentIssue11503
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 17, 2026, 1:27 AM

Add shared heavy-maintenance lease primitive

Closedenhancementaiarchitectureperformancebuildmodel-experience
neo-gpt
neo-gpt commented on May 17, 2026, 1:06 AM

Context

#11503 established the prio-0 umbrella for daemon / manual-script heavy-maintenance mutual exclusion after the operator observed Chroma / Memory pressure and overlapping maintenance work. This child ticket isolates Lane B: the reusable lease primitive that later lanes can consume without inventing per-script private locks.

Duplicate sweep / V-B-A notes:

  • Remote GitHub search for heavy-maintenance lease, heavy maintenance mutex, and backup kbSync Sandman found #11503 as the only open matching umbrella; adjacent issues #11065, #11062, #11018, and #11487 are closed.
  • Local content sweep found related mutex precedents only: heartbeat lock #10319 / PR #10598, identity inflight lock #10674 / PR #10683, and closed scheduler backpressure #11487. None provide a current reusable heavy-maintenance lease for orchestrator plus CLI scripts.
  • Chroma-backed semantic KB search was intentionally skipped because the active incident is Chroma / Memory pressure and the operator instructed agents to avoid memory paths that freeze.

Structural pre-flight:

Pre-Flight (structural fast-path): authoring ai/daemons/services/HeavyMaintenanceLeaseService.mjs matches the sibling pattern of ai/daemons/services/TaskStateService.mjs and ai/daemons/services/ProcessSupervisorService.mjs in ai/daemons/services/; this is daemon-owned coordination state consumed by the orchestrator and operator-runnable maintenance scripts. buildScripts/ai/*.mjs already import daemon-side services where needed, so no novel directory choice is introduced. Map-maintenance: not-needed; the existing Agent OS daemon-services structural row covers this service class.

The Problem

The current repo has scheduler-local backpressure, but no shared lease substrate:

  1. ai/daemons/Orchestrator.mjs:43 defines the heavy-maintenance task list without backup.
  2. ai/daemons/Orchestrator.mjs:497 implements scheduler-local createMaintenanceExecutor(...) and applies it to summary, kbSync, primary-dev-sync, dream, and golden-path, but backup is still scheduled through raw executeTask at ai/daemons/Orchestrator.mjs:575.
  3. Manual scripts bypass the scheduler guard entirely: buildScripts/ai/syncKnowledgeBase.mjs:16, buildScripts/ai/runSandman.mjs:157, buildScripts/ai/backup.mjs:118, and buildScripts/ai/syncGithubWorkflow.mjs:47.
  4. PrimaryRepoSyncService.runKbSync() shells npm run ai:sync-kb at ai/daemons/services/PrimaryRepoSyncService.mjs:483-486, hiding KB work inside the primary-dev-sync lane.
  5. syncGithubWorkflow is not only GitHub IO. SyncService.runFullSync() triggers Stage 2 Native Graph ingestion at ai/services/github-workflow/SyncService.mjs:160-172, and IssueIngestor uses StorageRouter.getGraphCollection() plus graph writes at ai/daemons/services/IssueIngestor.mjs:93-94, :210, :235, :289, and :335.

Without a shared primitive, each later lane has to choose between duplicating lock behavior or continuing to trust process-local state. Both shapes preserve the collision class #11503 exists to close.

The Architectural Reality

Existing precedents show two different lock classes:

  • ai/scripts/heartbeatLock.mjs is a wrapper-level file lock for expensive heartbeat-adjacent work. It proves the file-lock pattern and stale-lock inspection shape, but it is not tied to orchestrator task ownership or health reporting.
  • ai/scripts/inflightLock.mjs is per-identity wake / restart state. It proves durable diagnostic lock payloads, but its release semantics are memory-observation based and not appropriate for maintenance task ownership.
  • ai/daemons/Orchestrator.mjs already owns in-process heavy deferral semantics, but those semantics are not visible to manual scripts or nested child processes.

The new primitive belongs in daemon services because the lease is daemon-owned coordination state, not a one-shot operator script. Manual scripts should consume the same primitive rather than each writing a private lock.

The Fix

Introduce a reusable heavy-maintenance lease service with testable, deterministic semantics.

Minimum implementation shape:

  1. Add ai/daemons/services/HeavyMaintenanceLeaseService.mjs exposing pure / injectable helpers and a default singleton service for runtime use.
  2. Store lease state in .neo-ai-data/orchestrator-daemon/heavy-maintenance-lease.json by default, with test seams for alternate paths and time sources.
  3. Capture owner, reason, pid, process identity where available, acquiredAt, expiresAt or staleAfterMs, and optional metadata.
  4. Use atomic create semantics where available (wx write or equivalent) so two processes cannot both acquire the lease.
  5. Provide acquire / inspect / release / withLease helpers with safe release semantics: only the current owner token may release unless stale cleanup is explicitly invoked.
  6. Return non-throwing status objects for normal contention so callers can log deferred due to active heavy-maintenance lease without treating deferral as task failure.
  7. Add focused unit coverage for acquire success, held-lease deferral, stale replacement, token-guarded release, and malformed-lock recovery.

Contract Ledger Matrix

Target Surface Source of Authority Proposed Behavior Fallback Docs Evidence
HeavyMaintenanceLeaseService #11503 AC3 Single reusable lease primitive for orchestrator and manual maintenance scripts Return deferred / held status with active owner metadata JSDoc + ticket body Unit tests for acquire / release / stale / malformed state
Lease file payload .neo-ai-data/orchestrator-daemon state pattern JSON payload records owner, reason, pid, token, acquiredAt, staleAfterMs, metadata Malformed payload is treated as recoverable stale state with diagnostic JSDoc schema comments Unit test corrupt payload handling
Caller contention semantics #11503 AC9 Normal active-lease contention is a non-error deferral, not WARN/ERROR flood Sparse log with owner and reason JSDoc on return statuses Unit tests assert status object, not thrown error

Acceptance Criteria

  • AC1 — HeavyMaintenanceLeaseService exists under ai/daemons/services/ with JSDoc / @summary anchors explaining #11503 and the heavy-maintenance mutex contract.
  • AC2 — The service can acquire a lease with owner, reason, pid, token, timestamp, stale TTL, and metadata.
  • AC3 — Concurrent acquisition attempts converge: one owner succeeds; other callers receive a structured held/deferred status naming the active owner.
  • AC4 — Release is token-guarded so a non-owner cannot clear a live lease accidentally.
  • AC5 — Stale lease inspection / replacement is deterministic and testable without wall-clock sleeps.
  • AC6 — Malformed or partially written lease files recover safely without permanently wedging maintenance work.
  • AC7 — Unit tests cover acquire, held contention, stale replacement, guarded release, idempotent cleanup, and malformed payload recovery.
  • AC8 — No manual maintenance script adoption lands in this ticket beyond the minimal seams needed for tests; Lane C owns script wrapping after this API exists.
  • AC9 — No heavy maintenance scripts are executed as validation on the operator laptop.

Out of Scope

  • Wrapping runSandman, syncKnowledgeBase, backup, or syncGithubWorkflow; that is Lane C under #11503.
  • Refactoring PrimaryRepoSyncService.runKbSync(); that is Lane D / #11503 AC8 unless a minimal seam is needed to prove the API.
  • Reintroducing a second Memory Core Chroma daemon.
  • Changing the embedding model, vector dimensions, or Chroma config.
  • Running live Chroma, KB sync, Sandman, backup, or gh-workflow sync during implementation.

Avoided Traps

  • Per-script private locks: rejected because they only prevent duplicate runs of the same script and still allow Sandman + KB sync + backup collisions.
  • In-process-only mutex: rejected because manual scripts and nested child processes bypass orchestrator-local memory.
  • Heartbeat lock reuse: rejected because heartbeat skipping and heavy-maintenance ownership are different contracts; reuse would couple unrelated stale TTLs and observability.
  • Ownerless lock file: rejected because the operator needs active-owner diagnostics during pressure incidents.
  • Throw-on-contention: rejected because contention is expected deferral, not task failure.

Related

Origin Session ID: unavailable — Memory Core add_memory is intentionally skipped during this Chroma / memory-pressure incident per operator instruction.

Handoff Retrieval Hint: rg "HeavyMaintenanceLeaseService|heavy-maintenance lease|maintenance mutex|backup.*executeTask|runKbSync" ai buildScripts test resources/content

tobiu referenced in commit efede3a - "feat(ai): add heavy-maintenance lease service (#11505) (#11506) on May 17, 2026, 1:27 AM
tobiu closed this issue on May 17, 2026, 1:27 AM