LearnNewsExamplesServices
Frontmatter
id11640
titlePhase 4B — Manifest Reconciliation Daemon: Tenant-State vs Chroma-Actual Sync
stateClosed
labels
enhancementaiarchitecture
assigneesneo-opus-ada
createdAtMay 19, 2026, 1:57 PM
updatedAtJun 7, 2026, 7:13 PM
githubUrlhttps://github.com/neomjs/neo/issues/11640
authorneo-opus-ada
commentsCount2
parentIssue11628
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[x] 11639 Phase 4A — Per-Tenant Ingestion Observability Daemon (KBRecorderService Extension)
blocking[x] 11641 Phase 4C — Stale-Chunk Garbage Collection Daemon: Orphan Detection + Retention Enforcement
closedAtMay 21, 2026, 11:33 AM

Phase 4B — Manifest Reconciliation Daemon: Tenant-State vs Chroma-Actual Sync

Closed v13.0.0/archive-v13-0-0-chunk-12 enhancementaiarchitecture
neo-opus-ada
neo-opus-ada commented on May 19, 2026, 1:57 PM

Context

Sub of Phase 4 Epic #11628 (meta-Epic #11624).

Periodic state-reconciliation daemon — catches missed tombstones, handles force-push detection beyond per-push payload.

The Problem

Phase 0/1A defines tombstone / manifest / revision-boundary deletion-signaling, but production failure modes exist:

  • Client hooks fail mid-push (some files pushed, deletes lost)
  • Force-push history rewrite (per-push revision-boundary insufficient for branch-rewrite scenarios)
  • Tenant config changes invalidate previously-pushed chunks
  • Network errors cause partial-push state

Without periodic reconciliation, tenant Chroma state drifts from tenant's actual repo state silently.

The Fix

New daemon: ai/scripts/kb-reconciliation-daemon.mjs (sibling to existing daemons).

Per scheduled tick (configurable; default hourly):

  1. For each active tenant:
    • Fetch tenant's claimed-state manifest (from KnowledgeBaseTenantConfig Phase 2E + last-received revision boundary)
    • Fetch Chroma's actual-state chunks for the tenant (via where: {tenantId} filter from Phase 0/1D)
    • Diff: chunks in Chroma but not in claimed-state → orphaned (queue for GC or auto-tombstone)
    • Diff: claimed-state paths not represented in Chroma → re-ingestion candidates (notify tenant via observability surface)
  2. Reconciliation actions:
    • Auto-tombstone orphans older than retention threshold (configurable per-tenant)
    • Emit observability events via Phase 4A
    • Alert operator if drift exceeds threshold (Phase 4D)

Acceptance Criteria

  • ai/scripts/kb-reconciliation-daemon.mjs exists; follows existing daemon pattern
  • Per-tenant reconciliation logic implemented (claim vs actual diff)
  • Configurable tick interval (aiConfig.knowledgeBase.reconciliationIntervalMs; default 3600000 = 1h)
  • Configurable per-tenant retention threshold (orphan auto-tombstone after N days)
  • Reconciliation events emitted via Phase 4A observability daemon
  • Alerts emitted via Phase 4D when drift exceeds threshold
  • Unit tests: diff logic, auto-tombstone, retention threshold
  • Integration test: simulate tenant force-push → reconciliation removes orphans

Out of Scope

  • Initial observability daemon → Phase 4A
  • Operator alerting infrastructure → Phase 4D
  • Per-chunk garbage collection scheduling → Phase 4C (separate ticket for GC-specific concerns)

Contract Ledger

Added at intake by @neo-opus-ada (Claude Code) 2026-05-21 — satisfies the ticket-intake §7 Contract Completeness readiness gate (intake comment: https://github.com/neomjs/neo/issues/11640#issuecomment-4504572192). The original-author session is inactive; per ticket-intake §7 the claiming maintainer authors the missing ledger. Tier target: T3 (Explicit Matrix). The ledger is the binding contract; the loose Acceptance Criteria above are refined by these rows.

V1 scope — substrate-grounded. A fresh sweep of the merged Phase 2 code shows the ticket's "claim-vs-actual path-manifest diff" envisions a persisted claimed-state manifest that Phase 2 does not store: KnowledgeBaseTenantConfig carries no path manifest, and revision boundaries (baseRevision / headRevision) are per-push payload parameters, not persisted state. A periodic daemon with no repo access and no stored manifest therefore cannot detect arbitrary drift. The substrate-real V1 reconciliation signal is config-invalidation reconciliation — the tenantConfigVersion chunk stamp (#11637 / VectorService.resolveTenantStamp) compared against the tenant's current getTenantConfig().version. This delivers the ticket's failure-mode #3 ("tenant config changes invalidate previously-pushed chunks") completely; failure-modes #1/#2/#4 (force-push, mid-push, partial-push) are V1.x-deferred — Row 6.

Refined 2026-05-21 per @neo-gpt's #11640 pre-PR peer review: V1 Phase 4D integration is drift-presence/frequency alerting via reconcileEvents (Row 5) — drift-volume threshold alerting needs a chunks_total rollup metric and is V1.x (Row 6).

Target Surface Source of Authority Proposed Behavior Fallback / Edge Case Docs Evidence
aiConfig.knowledgeBase reconciliation config block — 4 new keys #11628 Phase 4B; this ticket AC; #11642's aiConfig.knowledgeBase precedent reconciliationEnabled (Boolean, default false) — master opt-in; the daemon exits early when false. reconciliationIntervalMs (Number, default 3600000 = 1h) — poll-tick interval. reconciliationAutoTombstone (Boolean, default false) — opt-in for the destructive auto-tombstone action; default-off ⇒ the daemon detects + reports only. reconciliationOrphanVersionGap (Number, default 2) — a config-stale chunk is auto-tombstone-eligible when currentVersion − chunk.tenantConfigVersion ≥ this. A stale gitignored config.mjs predating #11640 has no knowledgeBase key (or lacks the reconciliation keys) → every key is read defensively against its default (mirrors #11642's defensive read). Yes — ai/config.template.mjs block + JSDoc Unit: config-defaulting + the daemon's opt-in gate
Config-stale orphan detection — KbReconciliationEngine pure core #11637 tenantConfigVersion stamp; getTenantConfig().version The pure, dependency-free diff engine (mirrors #11642's KbAlertRuleEngine). diffTenantChunks({rows, currentVersion, orphanVersionGap}) → a chunk is a config-stale orphan when typeof metadata.tenantConfigVersion === 'number' && metadata.tenantConfigVersion < currentVersion; its versionGap = currentVersion − tenantConfigVersion; it is actionable when versionGap ≥ orphanVersionGap. Returns {staleOrphans, staleCount, actionableIds, actionableCount}. No I/O, no clock. currentVersion === 0 (tenant on the yaml/default config tier — no graph node) → no chunk can be stale (v < 0 is never true); zero orphans, no special-case. A chunk with a missing / non-numeric tenantConfigVersion (pre-#11637 ingest) → not flagged (fail-safe: never auto-action a chunk we cannot classify). Yes — JSDoc Unit: stale-detection, version-gap partition, currentVersion: 0 no-op, missing-stamp skip
The auto-tombstone reconciliation action — knowledge-base Chroma collection delete this ticket AC ("auto-tombstone orphans"); the destructive-action conservatism principle When reconciliationAutoTombstone is true, the daemon deletes a tenant's actionableIds via collection.delete({ids}). The delete is tenant-scopedactionableIds derive only from rows fetched with where: {tenantId} (the getTenantRows batched-collection.get pattern); tenant A's reconciliation never touches tenant B's chunks. reconciliationAutoTombstone is false (the default) → no delete is ever issued; the daemon detects + emits telemetry only. A collection.delete throw → logger.error, best-effort; the daemon continues to the next tenant. Yes — daemon JSDoc Unit: delete gated by the opt-in flag; tenant-scoped id set; delete-throw tolerance
Phase 4A telemetry emission — KBRecorderService.recordIngestionMetric #11639 / #11665 Phase 4A; recordIngestionMetric's 'reconcile' event type When a tenant has ≥ 1 config-stale orphan, the daemon emits exactly one recordIngestionMetric({tenantId, repoSlug, eventType: 'reconcile', chunksTotal: staleCount, chunksDeleted: <count tombstoned this tick — 0 when auto-tombstone is off>, detail: {staleCount, actionableCount, currentVersion, autoTombstone}}). A clean tenant (zero orphans) emits nothing — so reconcileEvents > 0 genuinely means "drift was found". recordIngestionMetric is already best-effort (never throws into the caller). The recorder unavailable → the metric is silently dropped; reconciliation still runs. Yes — JSDoc Unit: a reconcile metric is emitted for a drifting tenant, suppressed for a clean one
Phase 4D alerting integration — telemetry-only seam #11642 Phase 4D; KbAlertRuleEngine.KNOWN_METRICS The ticket AC "alerts emitted via Phase 4D when drift exceeds threshold" is satisfied with no #11642 code change as drift-presence/frequency alerting. The daemon emits one reconcile event per tenant-with-drift per tick; reconcileEvents is already a KbAlertRuleEngine.KNOWN_METRICS field, so an operator's aiConfig.knowledgeBase.alertRules entry on reconcileEvents fires when a tenant shows persistent / frequent drift across the alert window. chunksDeleted is also KNOWN_METRICS-covered — alertable as action volume, but non-zero only when reconciliationAutoTombstone is on (0 in the default detect-only posture). The existing #11642 KbAlertingService rolls up the events this daemon emits and fires; the reconciliation daemon does not dispatch alerts itself — no duplication of #11642's channel logic. Drift-volume thresholding (alert when the stale-chunk count exceeds N) is not available in V1: getTenantIngestionRollup does not aggregate chunks_total, and chunksTotal / staleCount are not in KNOWN_METRICS. The daemon still records chunksTotal: staleCount + a detail payload (raw-row and detail-visible), so the data is captured — but rollup-aggregated volume alerting needs a chunks_total rollup metric + KNOWN_METRICS coverage → Row 6 (V1.x). No alertRules configured → telemetry is still recorded; only alert dispatch is absent. Yes — JSDoc cross-ref Verified by inspection (@neo-gpt #11640 pre-PR peer review, 2026-05-21): reconcileEventsKNOWN_METRICS; getTenantIngestionRollup SQL has no chunks_total aggregation
V1 scope boundary — manifest / force-push detection + per-tenant threshold override + drift-volume alerting this ticket "The Fix" (failure-modes #1/#2/#4) + AC; ticket-intake "challenge prescribed fixes"; @neo-gpt #11640 pre-PR peer review V1.x-deferred, documented. (a) Manifest / force-push / partial-push orphan detection requires a persisted claimed-state manifest or a persisted last-received revision boundary — neither exists in Phase 2 substrate. (b) A per-tenant orphanVersionGap override requires extending getTenantConfig's fixed 8-field projection — a #11637-surface change. (c) Drift-volume threshold alerting requires a chunks_total rollup metric in getTenantIngestionRollup + chunksTotal in KbAlertRuleEngine.KNOWN_METRICS — a #11639 / #11642-surface change. V1 ships drift-presence/frequency alerting via reconcileEvents (Row 5) and touches zero merged Phase 2 code (purely additive: 3 new files + the aiConfig template block). V1.x is a separate follow-up ticket: it adds the persisted-manifest substrate (or a daemon-side repo walk), the per-tenant override, and the chunks_total rollup metric. The PR body files it. Yes — PR body "Deltas" + a V1.x follow-up ticket N/A — explicit scope boundary

Prior Art / Backup-Restore Substrate Cross-References

Substrate-correct V-B-A calibration 2026-05-19 post-PR-#11647 merge: the backup/restore/defrag substrate I underspecified during Phase 0/1A scoping is load-bearing for this daemon's design. Per #10129 Phase 3 peer architecture:

  • buildScripts/ai/backup.mjs — canonical atomic-bundle orchestrator. Layout .neo-ai-data/backups/backup-<ISO-ts>/{kb,mc,graph,concepts,trajectories}/. Reconciliation daemon should COEXIST with the existing bundle-restore lifecycle, NOT replace it.
  • buildScripts/ai/restore.mjs — merge-mode preserve-live semantics already shipped. Graph SQLite uses INSERT OR IGNORE (post-#11141 fix; pre-fix used silently-broken INSERT OR REPLACE). Chroma side uses collection.upsert(); #11144 tracks preserve-live parity follow-up. The daemon's tenant-state reconciliation operates ABOVE this layer — it diffs tenant-claimed vs Chroma-actual; restore.mjs operates on the bundle vs collection axis. Both can coexist.
  • buildScripts/ai/defragChromaDB.mjs — peer of backup.mjs (NOT a delegate). 5-step nuke-and-pave with private pre-nuke physical-copy snapshots at dist/chromadb-backups/<target>/. Operators chain ai:defrag-kb && ai:backup for compacted backups. Phase 4C (#11641 GC daemon) is the closer substrate sibling.
  • Existing test substrate to extend, not duplicate: test/playwright/unit/ai/buildScripts/restore.spec.mjs, restore-hardening.spec.mjs, restore-filters.spec.mjs, backup.spec.mjs.

Operator framing (2026-05-19): "we have more complex restoration scripts, since our live db got wiped 2x already, and we can now merge backups and new live DB data." — merge-mode capability IS prior art, not Phase 4 new substrate. This daemon adds tenant-state reconciliation on top.

Related

  • Parent: #11628
  • Blocked-by: Phase 4A (observability event emission), Phase 2 Epic (ingestion pipeline must exist)
  • Related substrate (cross-references): #10129 atomic-bundle backup orchestrator, #11141 graph preserve-live fix, #11144 Chroma preserve-live parity follow-up
  • Sibling backup primitives: buildScripts/ai/{backup,restore,defragChromaDB}.mjs; tests under test/playwright/unit/ai/buildScripts/
  • Daemon pattern precedent: ai/scripts/orchestrator-daemon.mjs
  • Tombstone contract: Phase 0/1A

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • orchestrator-daemon.mjs scheduling pattern is the architectural reference
  • Reconciliation = claim-vs-actual diff; the substrate-correct sibling pattern to look at is git's index-vs-working-tree-diff (conceptual reference, not code reuse)
tobiu referenced in commit 617c712 - "fix(memory-core): add .claude and .codex to FileSystemIngestor ignorePatterns (#11650) (#11651) on May 19, 2026, 5:25 PM
tobiu referenced in commit 5d64a1f - "feat(ai): KB ingestion telemetry schema + recordIngestionMetric API (#11639) (#11667) on May 20, 2026, 8:01 AM
tobiu referenced in commit 90a880d - "feat(ai): KB reconciliation daemon — Phase 4B (#11640) (#11710) on May 21, 2026, 11:33 AM
tobiu closed this issue on May 21, 2026, 11:33 AM
tobiu referenced in commit d03179a - "feat(ai): stamp ingestedAt on tenant KB chunks (#11712) (#11713) on May 21, 2026, 11:33 AM
tobiu referenced in commit 82ea006 - "feat(ai): KB garbage-collection daemon — Phase 4C (#11641) (#11715) on May 21, 2026, 1:19 PM
tobiu referenced in commit 5626d8e - "fix(ai): mark kb-config node visibility:team for offline daemon reads (#11716) (#11717) on May 21, 2026, 2:01 PM