Context
Sub of Phase 4 Epic #11628 (meta-Epic #11624).
Periodic state-reconciliation daemon — catches missed tombstones, handles force-push detection beyond per-push payload.
The Problem
Phase 0/1A defines tombstone / manifest / revision-boundary deletion-signaling, but production failure modes exist:
- Client hooks fail mid-push (some files pushed, deletes lost)
- Force-push history rewrite (per-push revision-boundary insufficient for branch-rewrite scenarios)
- Tenant config changes invalidate previously-pushed chunks
- Network errors cause partial-push state
Without periodic reconciliation, tenant Chroma state drifts from tenant's actual repo state silently.
The Fix
New daemon: ai/scripts/kb-reconciliation-daemon.mjs (sibling to existing daemons).
Per scheduled tick (configurable; default hourly):
- For each active tenant:
- Fetch tenant's claimed-state manifest (from
KnowledgeBaseTenantConfig Phase 2E + last-received revision boundary)
- Fetch Chroma's actual-state chunks for the tenant (via
where: {tenantId} filter from Phase 0/1D)
- Diff: chunks in Chroma but not in claimed-state → orphaned (queue for GC or auto-tombstone)
- Diff: claimed-state paths not represented in Chroma → re-ingestion candidates (notify tenant via observability surface)
- Reconciliation actions:
- Auto-tombstone orphans older than retention threshold (configurable per-tenant)
- Emit observability events via Phase 4A
- Alert operator if drift exceeds threshold (Phase 4D)
Acceptance Criteria
Out of Scope
- Initial observability daemon → Phase 4A
- Operator alerting infrastructure → Phase 4D
- Per-chunk garbage collection scheduling → Phase 4C (separate ticket for GC-specific concerns)
Contract Ledger
Added at intake by @neo-opus-ada (Claude Code) 2026-05-21 — satisfies the ticket-intake §7 Contract Completeness readiness gate (intake comment: https://github.com/neomjs/neo/issues/11640#issuecomment-4504572192). The original-author session is inactive; per ticket-intake §7 the claiming maintainer authors the missing ledger. Tier target: T3 (Explicit Matrix). The ledger is the binding contract; the loose Acceptance Criteria above are refined by these rows.
V1 scope — substrate-grounded. A fresh sweep of the merged Phase 2 code shows the ticket's "claim-vs-actual path-manifest diff" envisions a persisted claimed-state manifest that Phase 2 does not store: KnowledgeBaseTenantConfig carries no path manifest, and revision boundaries (baseRevision / headRevision) are per-push payload parameters, not persisted state. A periodic daemon with no repo access and no stored manifest therefore cannot detect arbitrary drift. The substrate-real V1 reconciliation signal is config-invalidation reconciliation — the tenantConfigVersion chunk stamp (#11637 / VectorService.resolveTenantStamp) compared against the tenant's current getTenantConfig().version. This delivers the ticket's failure-mode #3 ("tenant config changes invalidate previously-pushed chunks") completely; failure-modes #1/#2/#4 (force-push, mid-push, partial-push) are V1.x-deferred — Row 6.
Refined 2026-05-21 per @neo-gpt's #11640 pre-PR peer review: V1 Phase 4D integration is drift-presence/frequency alerting via reconcileEvents (Row 5) — drift-volume threshold alerting needs a chunks_total rollup metric and is V1.x (Row 6).
| Target Surface |
Source of Authority |
Proposed Behavior |
Fallback / Edge Case |
Docs |
Evidence |
aiConfig.knowledgeBase reconciliation config block — 4 new keys |
#11628 Phase 4B; this ticket AC; #11642's aiConfig.knowledgeBase precedent |
reconciliationEnabled (Boolean, default false) — master opt-in; the daemon exits early when false. reconciliationIntervalMs (Number, default 3600000 = 1h) — poll-tick interval. reconciliationAutoTombstone (Boolean, default false) — opt-in for the destructive auto-tombstone action; default-off ⇒ the daemon detects + reports only. reconciliationOrphanVersionGap (Number, default 2) — a config-stale chunk is auto-tombstone-eligible when currentVersion − chunk.tenantConfigVersion ≥ this. |
A stale gitignored config.mjs predating #11640 has no knowledgeBase key (or lacks the reconciliation keys) → every key is read defensively against its default (mirrors #11642's defensive read). |
Yes — ai/config.template.mjs block + JSDoc |
Unit: config-defaulting + the daemon's opt-in gate |
Config-stale orphan detection — KbReconciliationEngine pure core |
#11637 tenantConfigVersion stamp; getTenantConfig().version |
The pure, dependency-free diff engine (mirrors #11642's KbAlertRuleEngine). diffTenantChunks({rows, currentVersion, orphanVersionGap}) → a chunk is a config-stale orphan when typeof metadata.tenantConfigVersion === 'number' && metadata.tenantConfigVersion < currentVersion; its versionGap = currentVersion − tenantConfigVersion; it is actionable when versionGap ≥ orphanVersionGap. Returns {staleOrphans, staleCount, actionableIds, actionableCount}. No I/O, no clock. |
currentVersion === 0 (tenant on the yaml/default config tier — no graph node) → no chunk can be stale (v < 0 is never true); zero orphans, no special-case. A chunk with a missing / non-numeric tenantConfigVersion (pre-#11637 ingest) → not flagged (fail-safe: never auto-action a chunk we cannot classify). |
Yes — JSDoc |
Unit: stale-detection, version-gap partition, currentVersion: 0 no-op, missing-stamp skip |
The auto-tombstone reconciliation action — knowledge-base Chroma collection delete |
this ticket AC ("auto-tombstone orphans"); the destructive-action conservatism principle |
When reconciliationAutoTombstone is true, the daemon deletes a tenant's actionableIds via collection.delete({ids}). The delete is tenant-scoped — actionableIds derive only from rows fetched with where: {tenantId} (the getTenantRows batched-collection.get pattern); tenant A's reconciliation never touches tenant B's chunks. |
reconciliationAutoTombstone is false (the default) → no delete is ever issued; the daemon detects + emits telemetry only. A collection.delete throw → logger.error, best-effort; the daemon continues to the next tenant. |
Yes — daemon JSDoc |
Unit: delete gated by the opt-in flag; tenant-scoped id set; delete-throw tolerance |
Phase 4A telemetry emission — KBRecorderService.recordIngestionMetric |
#11639 / #11665 Phase 4A; recordIngestionMetric's 'reconcile' event type |
When a tenant has ≥ 1 config-stale orphan, the daemon emits exactly one recordIngestionMetric({tenantId, repoSlug, eventType: 'reconcile', chunksTotal: staleCount, chunksDeleted: <count tombstoned this tick — 0 when auto-tombstone is off>, detail: {staleCount, actionableCount, currentVersion, autoTombstone}}). A clean tenant (zero orphans) emits nothing — so reconcileEvents > 0 genuinely means "drift was found". |
recordIngestionMetric is already best-effort (never throws into the caller). The recorder unavailable → the metric is silently dropped; reconciliation still runs. |
Yes — JSDoc |
Unit: a reconcile metric is emitted for a drifting tenant, suppressed for a clean one |
| Phase 4D alerting integration — telemetry-only seam |
#11642 Phase 4D; KbAlertRuleEngine.KNOWN_METRICS |
The ticket AC "alerts emitted via Phase 4D when drift exceeds threshold" is satisfied with no #11642 code change as drift-presence/frequency alerting. The daemon emits one reconcile event per tenant-with-drift per tick; reconcileEvents is already a KbAlertRuleEngine.KNOWN_METRICS field, so an operator's aiConfig.knowledgeBase.alertRules entry on reconcileEvents fires when a tenant shows persistent / frequent drift across the alert window. chunksDeleted is also KNOWN_METRICS-covered — alertable as action volume, but non-zero only when reconciliationAutoTombstone is on (0 in the default detect-only posture). The existing #11642 KbAlertingService rolls up the events this daemon emits and fires; the reconciliation daemon does not dispatch alerts itself — no duplication of #11642's channel logic. |
Drift-volume thresholding (alert when the stale-chunk count exceeds N) is not available in V1: getTenantIngestionRollup does not aggregate chunks_total, and chunksTotal / staleCount are not in KNOWN_METRICS. The daemon still records chunksTotal: staleCount + a detail payload (raw-row and detail-visible), so the data is captured — but rollup-aggregated volume alerting needs a chunks_total rollup metric + KNOWN_METRICS coverage → Row 6 (V1.x). No alertRules configured → telemetry is still recorded; only alert dispatch is absent. |
Yes — JSDoc cross-ref |
Verified by inspection (@neo-gpt #11640 pre-PR peer review, 2026-05-21): reconcileEvents ∈ KNOWN_METRICS; getTenantIngestionRollup SQL has no chunks_total aggregation |
| V1 scope boundary — manifest / force-push detection + per-tenant threshold override + drift-volume alerting |
this ticket "The Fix" (failure-modes #1/#2/#4) + AC; ticket-intake "challenge prescribed fixes"; @neo-gpt #11640 pre-PR peer review |
V1.x-deferred, documented. (a) Manifest / force-push / partial-push orphan detection requires a persisted claimed-state manifest or a persisted last-received revision boundary — neither exists in Phase 2 substrate. (b) A per-tenant orphanVersionGap override requires extending getTenantConfig's fixed 8-field projection — a #11637-surface change. (c) Drift-volume threshold alerting requires a chunks_total rollup metric in getTenantIngestionRollup + chunksTotal in KbAlertRuleEngine.KNOWN_METRICS — a #11639 / #11642-surface change. V1 ships drift-presence/frequency alerting via reconcileEvents (Row 5) and touches zero merged Phase 2 code (purely additive: 3 new files + the aiConfig template block). |
V1.x is a separate follow-up ticket: it adds the persisted-manifest substrate (or a daemon-side repo walk), the per-tenant override, and the chunks_total rollup metric. The PR body files it. |
Yes — PR body "Deltas" + a V1.x follow-up ticket |
N/A — explicit scope boundary |
Prior Art / Backup-Restore Substrate Cross-References
Substrate-correct V-B-A calibration 2026-05-19 post-PR-#11647 merge: the backup/restore/defrag substrate I underspecified during Phase 0/1A scoping is load-bearing for this daemon's design. Per #10129 Phase 3 peer architecture:
buildScripts/ai/backup.mjs — canonical atomic-bundle orchestrator. Layout .neo-ai-data/backups/backup-<ISO-ts>/{kb,mc,graph,concepts,trajectories}/. Reconciliation daemon should COEXIST with the existing bundle-restore lifecycle, NOT replace it.
buildScripts/ai/restore.mjs — merge-mode preserve-live semantics already shipped. Graph SQLite uses INSERT OR IGNORE (post-#11141 fix; pre-fix used silently-broken INSERT OR REPLACE). Chroma side uses collection.upsert(); #11144 tracks preserve-live parity follow-up. The daemon's tenant-state reconciliation operates ABOVE this layer — it diffs tenant-claimed vs Chroma-actual; restore.mjs operates on the bundle vs collection axis. Both can coexist.
buildScripts/ai/defragChromaDB.mjs — peer of backup.mjs (NOT a delegate). 5-step nuke-and-pave with private pre-nuke physical-copy snapshots at dist/chromadb-backups/<target>/. Operators chain ai:defrag-kb && ai:backup for compacted backups. Phase 4C (#11641 GC daemon) is the closer substrate sibling.
- Existing test substrate to extend, not duplicate:
test/playwright/unit/ai/buildScripts/restore.spec.mjs, restore-hardening.spec.mjs, restore-filters.spec.mjs, backup.spec.mjs.
Operator framing (2026-05-19): "we have more complex restoration scripts, since our live db got wiped 2x already, and we can now merge backups and new live DB data." — merge-mode capability IS prior art, not Phase 4 new substrate. This daemon adds tenant-state reconciliation on top.
Related
- Parent: #11628
- Blocked-by: Phase 4A (observability event emission), Phase 2 Epic (ingestion pipeline must exist)
- Related substrate (cross-references): #10129 atomic-bundle backup orchestrator, #11141 graph preserve-live fix, #11144 Chroma preserve-live parity follow-up
- Sibling backup primitives:
buildScripts/ai/{backup,restore,defragChromaDB}.mjs; tests under test/playwright/unit/ai/buildScripts/
- Daemon pattern precedent:
ai/scripts/orchestrator-daemon.mjs
- Tombstone contract: Phase 0/1A
Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838
Handoff Retrieval Hints
orchestrator-daemon.mjs scheduling pattern is the architectural reference
- Reconciliation = claim-vs-actual diff; the substrate-correct sibling pattern to look at is git's index-vs-working-tree-diff (conceptual reference, not code reuse)
Context
Sub of Phase 4 Epic #11628 (meta-Epic #11624).
Periodic state-reconciliation daemon — catches missed tombstones, handles force-push detection beyond per-push payload.
The Problem
Phase 0/1A defines tombstone / manifest / revision-boundary deletion-signaling, but production failure modes exist:
Without periodic reconciliation, tenant Chroma state drifts from tenant's actual repo state silently.
The Fix
New daemon:
ai/scripts/kb-reconciliation-daemon.mjs(sibling to existing daemons).Per scheduled tick (configurable; default hourly):
KnowledgeBaseTenantConfigPhase 2E + last-received revision boundary)where: {tenantId}filter from Phase 0/1D)Acceptance Criteria
ai/scripts/kb-reconciliation-daemon.mjsexists; follows existing daemon patternaiConfig.knowledgeBase.reconciliationIntervalMs; default 3600000 = 1h)Out of Scope
Contract Ledger
aiConfig.knowledgeBasereconciliation config block — 4 new keysaiConfig.knowledgeBaseprecedentreconciliationEnabled(Boolean, defaultfalse) — master opt-in; the daemon exits early when false.reconciliationIntervalMs(Number, default3600000= 1h) — poll-tick interval.reconciliationAutoTombstone(Boolean, defaultfalse) — opt-in for the destructive auto-tombstone action; default-off ⇒ the daemon detects + reports only.reconciliationOrphanVersionGap(Number, default2) — a config-stale chunk is auto-tombstone-eligible whencurrentVersion − chunk.tenantConfigVersion ≥ this.config.mjspredating #11640 has noknowledgeBasekey (or lacks the reconciliation keys) → every key is read defensively against its default (mirrors #11642's defensive read).ai/config.template.mjsblock + JSDocKbReconciliationEnginepure coretenantConfigVersionstamp;getTenantConfig().versionKbAlertRuleEngine).diffTenantChunks({rows, currentVersion, orphanVersionGap})→ a chunk is a config-stale orphan whentypeof metadata.tenantConfigVersion === 'number' && metadata.tenantConfigVersion < currentVersion; itsversionGap = currentVersion − tenantConfigVersion; it is actionable whenversionGap ≥ orphanVersionGap. Returns{staleOrphans, staleCount, actionableIds, actionableCount}. No I/O, no clock.currentVersion === 0(tenant on the yaml/default config tier — no graph node) → no chunk can be stale (v < 0is never true); zero orphans, no special-case. A chunk with a missing / non-numerictenantConfigVersion(pre-#11637 ingest) → not flagged (fail-safe: never auto-action a chunk we cannot classify).currentVersion: 0no-op, missing-stamp skipknowledge-baseChroma collection deletereconciliationAutoTombstoneistrue, the daemon deletes a tenant'sactionableIdsviacollection.delete({ids}). The delete is tenant-scoped —actionableIdsderive only from rows fetched withwhere: {tenantId}(thegetTenantRowsbatched-collection.getpattern); tenant A's reconciliation never touches tenant B's chunks.reconciliationAutoTombstoneisfalse(the default) → no delete is ever issued; the daemon detects + emits telemetry only. Acollection.deletethrow →logger.error, best-effort; the daemon continues to the next tenant.KBRecorderService.recordIngestionMetricrecordIngestionMetric's'reconcile'event typerecordIngestionMetric({tenantId, repoSlug, eventType: 'reconcile', chunksTotal: staleCount, chunksDeleted: <count tombstoned this tick — 0 when auto-tombstone is off>, detail: {staleCount, actionableCount, currentVersion, autoTombstone}}). A clean tenant (zero orphans) emits nothing — soreconcileEvents > 0genuinely means "drift was found".recordIngestionMetricis already best-effort (never throws into the caller). The recorder unavailable → the metric is silently dropped; reconciliation still runs.reconcilemetric is emitted for a drifting tenant, suppressed for a clean oneKbAlertRuleEngine.KNOWN_METRICSreconcileevent per tenant-with-drift per tick;reconcileEventsis already aKbAlertRuleEngine.KNOWN_METRICSfield, so an operator'saiConfig.knowledgeBase.alertRulesentry onreconcileEventsfires when a tenant shows persistent / frequent drift across the alert window.chunksDeletedis alsoKNOWN_METRICS-covered — alertable as action volume, but non-zero only whenreconciliationAutoTombstoneis on (0in the default detect-only posture). The existing #11642KbAlertingServicerolls up the events this daemon emits and fires; the reconciliation daemon does not dispatch alerts itself — no duplication of #11642's channel logic.getTenantIngestionRollupdoes not aggregatechunks_total, andchunksTotal/staleCountare not inKNOWN_METRICS. The daemon still recordschunksTotal: staleCount+ adetailpayload (raw-row anddetail-visible), so the data is captured — but rollup-aggregated volume alerting needs achunks_totalrollup metric +KNOWN_METRICScoverage → Row 6 (V1.x). NoalertRulesconfigured → telemetry is still recorded; only alert dispatch is absent.reconcileEvents∈KNOWN_METRICS;getTenantIngestionRollupSQL has nochunks_totalaggregationorphanVersionGapoverride requires extendinggetTenantConfig's fixed 8-field projection — a #11637-surface change. (c) Drift-volume threshold alerting requires achunks_totalrollup metric ingetTenantIngestionRollup+chunksTotalinKbAlertRuleEngine.KNOWN_METRICS— a #11639 / #11642-surface change. V1 ships drift-presence/frequency alerting viareconcileEvents(Row 5) and touches zero merged Phase 2 code (purely additive: 3 new files + theaiConfigtemplate block).chunks_totalrollup metric. The PR body files it.Prior Art / Backup-Restore Substrate Cross-References
Substrate-correct V-B-A calibration 2026-05-19 post-PR-#11647 merge: the backup/restore/defrag substrate I underspecified during Phase 0/1A scoping is load-bearing for this daemon's design. Per #10129 Phase 3 peer architecture:
buildScripts/ai/backup.mjs— canonical atomic-bundle orchestrator. Layout.neo-ai-data/backups/backup-<ISO-ts>/{kb,mc,graph,concepts,trajectories}/. Reconciliation daemon should COEXIST with the existing bundle-restore lifecycle, NOT replace it.buildScripts/ai/restore.mjs— merge-mode preserve-live semantics already shipped. Graph SQLite usesINSERT OR IGNORE(post-#11141 fix; pre-fix used silently-brokenINSERT OR REPLACE). Chroma side usescollection.upsert(); #11144 tracks preserve-live parity follow-up. The daemon's tenant-state reconciliation operates ABOVE this layer — it diffs tenant-claimed vs Chroma-actual; restore.mjs operates on the bundle vs collection axis. Both can coexist.buildScripts/ai/defragChromaDB.mjs— peer of backup.mjs (NOT a delegate). 5-step nuke-and-pave with private pre-nuke physical-copy snapshots atdist/chromadb-backups/<target>/. Operators chainai:defrag-kb && ai:backupfor compacted backups. Phase 4C (#11641 GC daemon) is the closer substrate sibling.test/playwright/unit/ai/buildScripts/restore.spec.mjs,restore-hardening.spec.mjs,restore-filters.spec.mjs,backup.spec.mjs.Operator framing (2026-05-19): "we have more complex restoration scripts, since our live db got wiped 2x already, and we can now merge backups and new live DB data." — merge-mode capability IS prior art, not Phase 4 new substrate. This daemon adds tenant-state reconciliation on top.
Related
buildScripts/ai/{backup,restore,defragChromaDB}.mjs; tests undertest/playwright/unit/ai/buildScripts/ai/scripts/orchestrator-daemon.mjsOrigin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838Handoff Retrieval Hints
orchestrator-daemon.mjsscheduling pattern is the architectural reference