LearnNewsExamplesServices
Frontmatter
id11641
titlePhase 4C — Stale-Chunk Garbage Collection Daemon: Orphan Detection + Retention Enforcement
stateClosed
labels
enhancementaiarchitecture
assigneesneo-opus-ada
createdAtMay 19, 2026, 1:57 PM
updatedAtJun 7, 2026, 7:13 PM
githubUrlhttps://github.com/neomjs/neo/issues/11641
authorneo-opus-ada
commentsCount3
parentIssue11628
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[x] 11712 Add a server-stamped ingestedAt timestamp to tenant KB chunk metadata, [x] 11640 Phase 4B — Manifest Reconciliation Daemon: Tenant-State vs Chroma-Actual Sync
blocking[]
closedAtMay 21, 2026, 1:19 PM

Phase 4C — Stale-Chunk Garbage Collection Daemon: Orphan Detection + Retention Enforcement

Closed v13.0.0/archive-v13-0-0-chunk-12 enhancementaiarchitecture
neo-opus-ada
neo-opus-ada commented on May 19, 2026, 1:57 PM

Context

Sub of Phase 4 Epic #11628 (meta-Epic #11624).

Stale-chunk garbage collection — distinct from reconciliation (Phase 4B). Reconciliation diffs claimed-vs-actual; GC enforces RETENTION POLICY (time-based, count-based, or version-based expiration).

The Problem

Without active GC:

  • Tombstoned chunks accumulate Chroma collection size (Chroma soft-deletes; physical reclaim needs explicit GC)
  • Source-config changes (parser swap, source-type removed) orphan chunks that no longer match tenant config
  • Per-tenant retention policies (e.g., "retain only last 90 days of historical chunks") need enforcement
  • ChromaDB defrag patterns (ai:defrag-kb precedent: 10s wall / 321→189MB / 41% reduction on ~10k chunks) — but defrag is operator-triggered, not automatic per-tenant

The Fix

New daemon: ai/scripts/kb-gc-daemon.mjs (sibling to existing daemons).

Per scheduled tick (configurable; default daily):

  1. For each active tenant:
    • Identify retention-expired chunks (per aiConfig.knowledgeBase.tenantRetention policy)
    • Identify config-orphaned chunks (source-config change leaves chunks unmatched)
    • Identify reconciliation-tombstoned chunks past grace period (Phase 4B integration)
  2. Garbage collection actions:
    • Delete from Chroma (via VectorService.delete({ids}))
    • Emit observability events (Phase 4A)
    • Trigger ai:defrag-kb if cumulative deletion > threshold (10% collection size)
  3. Cross-tenant safety: tenant A's GC cannot touch tenant B's chunks (RLS via Phase 0/1D read-side filter)

Acceptance Criteria

  • ai/scripts/kb-gc-daemon.mjs exists; follows existing daemon pattern
  • Retention policy structure: per-tenant config (time-based / count-based / version-based)
  • Config-orphan detection: chunks whose source-config no longer matches tenant config
  • Tombstone-grace-period integration with Phase 4B
  • Defrag trigger when cumulative deletion > threshold (default 10% collection size)
  • Cross-tenant safety: GC respects RLS; tenant A cannot affect tenant B
  • Configurable tick interval (aiConfig.knowledgeBase.gcIntervalMs; default 86400000 = 24h)
  • Unit tests: retention enforcement, orphan detection, RLS isolation
  • Integration test: tenant pushes content, retention expires, GC removes, defrag triggers

Out of Scope

  • Reconciliation logic → Phase 4B
  • Observability event collection → Phase 4A
  • Initial defrag substrate (ai:defrag-kb already exists; this daemon TRIGGERS it conditionally)

Contract Ledger

Added at intake by @neo-opus-ada (Claude Code) 2026-05-21 — satisfies the ticket-intake §7 Contract Completeness gate. The original-author session is inactive; per §7 the claiming maintainer authors the ledger. Tier target: T3 (Explicit Matrix). The ledger is the binding contract; the loose Acceptance Criteria above are refined by these rows.

V1 scope. V1 = retention-policy chunk expiration + physical-reclaim signalling. The intake scope-refinement dropped config-orphan detection — it overlaps #11640's config-invalidation reconciliation (the same tenantConfigVersion signal). Time- and count-based retention are substrate-backed by the ingestedAt chunk stamp (#11712, merged to dev). Full design rationale: the intake-update comment https://github.com/neomjs/neo/issues/11641#issuecomment-4506820763.

Refined 2026-05-21 per @neo-gpt's #11641 pre-branch peer review: explicit OR-expiry semantics, a deterministic count-ordering tie-breaker, {tenantId, repoSlug} count-bucketing, and V1 emits a defrag-recommended signal rather than spawning ai:defrag-kb (Rows 2 + 4 + 6).

Target Surface Source of Authority Proposed Behavior Fallback / Edge Case Docs Evidence
aiConfig.knowledgeBase GC config block — 5 new keys #11628 Phase 4C; this ticket AC; #11640 / #11642's aiConfig.knowledgeBase precedent gcEnabled (Boolean, default false) — master opt-in; the daemon exits early when false. gcIntervalMs (Number, default 86400000 = 24h) — poll-tick interval. gcRetention (Object, default {}) — {maxAgeMs?, maxCount?} retention policy. gcAutoDelete (Boolean, default false) — opt-in for the destructive Chroma delete; default-off ⇒ detect + emit telemetry only. gcDefragThreshold (Number, default 0.10) — the cumulative-deletion fraction above which the daemon emits a defrag-recommended signal; 0 disables the signal. A stale gitignored config.mjs predating #11641 lacks the gc* keys → each read defensively against its default (the #11640 / #11642 defensive-read pattern). An empty gcRetention {} ⇒ no chunk is ever retention-expired (conservative — the ticket's "default conservative" handoff hint). Yes — ai/config.template.mjs block + JSDoc Unit: config-defaulting + the daemon opt-in gate
Retention-expiry classification — KbGarbageCollectionEngine pure core #11712 ingestedAt chunk stamp; this ticket's retention AC; @neo-gpt #11641 peer review The pure, dependency-free classifier (mirrors #11640's KbReconciliationEngine). selectExpiredChunks({rows, retention, now}) → a chunk is retention-expired under OR-expiry: expired if time-expired OR count-expired (the union — the broader set). Time-expiry: typeof metadata.ingestedAt === 'number' && now − ingestedAt > maxAgeMs. Count-expiry: rows are bucketed by {tenantId, repoSlug}, each bucket sorted ingestedAt desc, then chunk id asc (a deterministic tie-breaker — batch-ingested chunks share an ingestedAt); a chunk ranked at or beyond maxCount within its bucket is count-expired. Returns {expiredIds, expiredCount, evaluatedCount}. No I/O, no clock — the caller passes now. A chunk with a missing / non-numeric ingestedAt (a pre-#11712 ingest) is never flagged — fail-safe for both time (age uncomputable) and count (unrankable → excluded from the expired set), mirroring #11640's missing-tenantConfigVersion skip. An empty / absent retention policy → empty result. Yes — JSDoc Unit: time-expiry, count-expiry per {tenantId, repoSlug} bucket, the OR-union, the deterministic tie-break on equal ingestedAt, missing-ingestedAt skip, empty-policy no-op
The destructive GC delete — knowledge-base Chroma collection delete this ticket AC ("GC removes"); the destructive-action conservatism principle When gcAutoDelete is true, the daemon deletes a tenant's expiredIds via collection.delete({ids}). Tenant-scopedexpiredIds derive only from rows fetched with where: {tenantId} (the getTenantRows batched-collection.get pattern); tenant A's GC never touches tenant B's chunks (the ticket's RLS-safety AC). gcAutoDelete is false (the default) → no delete is ever issued; the daemon detects + emits telemetry only. A collection.delete throw → logger.error, best-effort; the daemon continues to the next tenant. Yes — daemon JSDoc Unit: delete gated by the opt-in flag; tenant-scoped id set; delete-throw tolerance
Defrag-recommended signal — physical-reclaim observability this ticket AC ("trigger defrag"); @neo-gpt #11641 peer review When a tick's cumulative deletion (summed across tenants) exceeds gcDefragThreshold of the collection's chunk count, the daemon emits a defrag-recommended signal — a logger.warn plus a telemetry detail flag — surfacing that an operator should run ai:defrag-kb. V1 does not spawn ai:defrag-kb — see Row 6. gcDefragThreshold is 0 → no signal. The daemon spawns no subprocess in V1 → there is no defrag-vs-ingest concurrency surface. Yes — daemon JSDoc Unit: the signal fires when cumulative deletion exceeds the threshold, stays silent below it
Phase 4A telemetry emission — KBRecorderService.recordIngestionMetric #11639 Phase 4A; recordIngestionMetric's 'tombstone' event type When a tenant has ≥ 1 retention-expired chunk, the daemon emits one recordIngestionMetric({tenantId, repoSlug, eventType: 'tombstone', chunksTotal: expiredCount, chunksDeleted: <count deleted this tick — 0 when gcAutoDelete is off>, detail: {expiredCount, deletedCount, retention, gcAutoDelete, defragRecommended}}). A clean tenant (zero expired) emits nothing. recordIngestionMetric is best-effort (never throws into the caller). A GC delete is a 'tombstone'-class event — recordIngestionMetric's taxonomy has no dedicated 'gc' type; 'tombstone' is the honest fit (a logical deletion). Yes — JSDoc Unit: a tombstone metric is emitted for a tenant with expired chunks, suppressed for a clean one
V1 scope boundary — config-orphan detection · per-tenant retention override · auto-defrag spawn this ticket "The Fix" / ACs; the intake scope-refinement; @neo-gpt #11641 peer review Documented V1 deltas. (a) Config-orphan detection is dropped — #11640's KbReconciliationService already detects + opt-in-tombstones config-stale chunks; 4C re-detecting them is double-handling (the intake de-dup). (b) Per-tenant retention override — V1 applies one global gcRetention per tenant; a per-tenant override needs extending getTenantConfig's fixed projection (#11637 surface) → V1.x. (c) Auto-defrag spawn — V1 emits a defrag-recommended signal only (Row 4); the automated ai:defrag-kb spawn + its defrag-vs-ingest concurrency-coordination story are V1.x (auto-spawning a whole-collection nuke-and-pave from a poll-loop daemon is a separable, concurrency-sensitive design). V1.x is a separate follow-up ticket; the PR body "Deltas" documents each. Yes — PR body "Deltas" + a V1.x follow-up N/A — explicit scope boundary

Prior Art / Defrag-Backup Substrate Cross-References

Substrate-correct V-B-A calibration 2026-05-19: per #10129 Phase 3 peer architecture, defragChromaDB.mjs and backup.mjs are peer scripts with orthogonal responsibilities, NOT delegates. Phase 4C extends/triggers existing defrag substrate:

  • buildScripts/ai/defragChromaDB.mjs — 5-step "Nuke and Pave": (1) Pre-Nuke Snapshot via fs.copy() to dist/chromadb-backups/<target>/backup-<numeric-ts>/ (HNSW state preserved); (2) Extract all collections to in-memory; (3) Nuke collections via API; (4) Load (recreate + reinsert; forces HNSW rebuild); (5) Cleanup orphan UUID directories. Existing retention: keep last 3, delete others older than 7 days.
  • buildScripts/ai/backup.mjs — JSONL bundle peer. Operators chain ai:defrag-kb && ai:backup for compacted bundles. Phase 4C daemon triggers defrag automatically based on cumulative-deletion threshold.
  • Tenant-aware GC requires read-side filter — Phase 0/1D's where: {tenantId} filter on QueryService ensures cross-tenant safety; the GC daemon enumerates orphans WITHIN a tenant's scope, never cross-tenant.
  • Existing test substrate to extend, not duplicate: test/playwright/unit/ai/buildScripts/backup.spec.mjs (defrag-trigger logic; retention pattern); KB DatabaseService.backup.spec.mjs (export/import lifecycle).

Related

  • Parent: #11628
  • Blocked-by: Phase 4A (observability), Phase 4B (reconciliation tombstones)
  • Existing defrag substrate: npm run ai:defrag-kb (memory anchor: 10s wall / 321→189MB / 41% reduction on ~10k chunks); peer script buildScripts/ai/defragChromaDB.mjs
  • Related substrate: #10129 atomic-bundle backup peer architecture
  • RLS substrate: Phase 0/1D read-side filter (cross-tenant GC safety)

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • ai:defrag-kb script is the existing defrag substrate to integrate with
  • Retention policies are deployment-specific; default conservative (long retention) acceptable for V1
tobiu referenced in commit 5d64a1f - "feat(ai): KB ingestion telemetry schema + recordIngestionMetric API (#11639) (#11667) on May 20, 2026, 8:01 AM
tobiu referenced in commit d03179a - "feat(ai): stamp ingestedAt on tenant KB chunks (#11712) (#11713) on May 21, 2026, 11:33 AM
tobiu referenced in commit 82ea006 - "feat(ai): KB garbage-collection daemon — Phase 4C (#11641) (#11715) on May 21, 2026, 1:19 PM
tobiu closed this issue on May 21, 2026, 1:19 PM