Context
Graduated from Discussion #11677 — low-blast, @neo-gpt [GRADUATION_APPROVED] (DC_kwDODSospM4BA0NO). Surfaced 2026-05-20 from the #11631 tenant-aware-chunk-ID re-embed: for the entire ~5h rebuild the live knowledge-base collection was gutted to near-empty and ask_knowledge_base returned degraded/incomplete results with no signal to the querying agent. #11677 is the completeness sibling to Discussion #11676 (the contention axis).
The Problem
VectorService.embed, on a full re-embed triggered by a chunk-ID-derivation change, computes idsToDelete = existingIds − allIds and runs collection.delete(idsToDelete) before the embed loop. When the chunk-ID formula changes (e.g. #11631's tenant-aware IDs fold {tenantId, repoSlug} into the id), every chunk's id changes → idsToDelete = the entire old corpus. The live collection drops to ~zero, then refills batch-by-batch over hours. For that whole window the KB is incomplete and ask_knowledge_base silently degrades.
Empirical anchor: the #11631 re-embed — 06:32:10 Deleted 24545 stale chunks, then ~493 batches over ~5h (operator-confirmed complete 2026-05-20T13:19Z, 24623 items).
The Architectural Reality
VectorService.embed — ai/services/knowledge-base/VectorService.mjs (the up-front collection.delete + the batch embed loop). PR #11678 (#11633) already added a deleteStale opt — the first parameterization of the stale-delete behavior; this ticket evolves that surface.
- KB
ChromaManager — ai/services/knowledge-base/ChromaManager.mjs memoizes the knowledge-base collection handle (_knowledgeBaseCollectionPromise / knowledgeBaseCollection, :129) with no proper cache-invalidation method (only an ad-hoc external null-assignment inside VectorService.deleteCollection).
chromadb@3.3.1 exposes collection.modify({name}) (in-place rename, chromadb.d.ts:1640) and collection.fork({name}) (full copy, :1653) — but no single atomic-swap primitive.
The Fix
Evolve VectorService.embed's stale-handling option into a staleStrategy for full-corpus re-embeds — 'delete-upfront' (today's delete-all-then-rebuild; the default, backward-compatible) | 'shadow-swap' — while preserving the skip-the-full-sweep behavior PR #11678's incremental-ingestion caller relies on.
Implement 'shadow-swap' (Discussion #11677's recommended Option 3):
- Build the re-embedded corpus in a fresh shadow collection (create + embed the new-id corpus — not
collection.fork(), which would copy the stale corpus).
- Swap via a 2-step
collection.modify({name}): rename the live knowledge-base → a parking name, then the shadow → knowledge-base. (No single atomic primitive — the 2-step rename leaves a sub-second window with no canonical-name holder; bounded + handled.)
- Add a proper cache-invalidation method to the KB
ChromaManager (nulls _knowledgeBaseCollectionPromise / knowledgeBaseCollection), replacing the ad-hoc poke; the swap orchestration calls it post-rename, in every process holding a KB handle (lazy re-resolve on next call).
The live collection serves the old (complete, consistent) corpus untouched throughout; the swap is near-instantaneous.
Contract Ledger Matrix
| Target Surface |
Source of Authority |
Proposed Behavior |
Fallback |
Docs |
Evidence |
VectorService.embed(knowledgeBasePath, {staleStrategy}) — evolves the deleteStale opt added by PR #11678 |
Discussion #11677 (graduated; OQ1a/OQ1b/OQ2/OQ3 [RESOLVED_TO_AC]) |
staleStrategy: 'delete-upfront' | 'shadow-swap'. Default 'delete-upfront' = current behavior. 'shadow-swap' builds a fresh shadow collection + modify()-swaps it. |
PR #11678's KnowledgeBaseIngestionService incremental-push caller (which skips the full-corpus stale sweep) keeps working — exact deleteStale→staleStrategy coexistence is an implementation decision; the AC pins behavior-preservation. |
JSDoc on embed's opts. |
Unit tests per staleStrategy value; integration test of a 'shadow-swap' re-embed. |
KB ChromaManager cache-invalidation method (new) |
This ticket; #11677 OQ1b [RESOLVED_TO_AC] |
A method nulling _knowledgeBaseCollectionPromise / knowledgeBaseCollection; the swap calls it post-rename. |
Absent → readers serve the stale pre-swap handle (the current gap). |
JSDoc. |
Unit test: invalidation nulls the memoized handle; next getKnowledgeBaseCollection() re-resolves. |
Discussion Criteria Mapping
Per ideation-sandbox-workflow.md §6.6 — mapping Discussion #11677's [RESOLVED_TO_AC] resolutions to this ticket's ACs:
- #11677 OQ1a (Chroma rename primitive) → the swap uses
collection.modify({name}); the shadow is built fresh, not fork()ed; the 2-step rename's sub-second no-canonical-name window is bounded + handled.
- #11677 OQ1b (cached-reader cutover) → a proper KB-
ChromaManager cache-invalidation method; the swap triggers it post-rename, cross-process via lazy re-resolve. Scope-bounded (GPT graduation guardrail): invalidate only the KB collection handle — not the Memory Core manager's memory/summary/graph caches; if a cross-server invalidation bus / shared QoS arbiter / daemon orchestration / MCP contract mutation turns out necessary, that is scope expansion → split into a new ticket, not part of this low-blast ticket.
- #11677 OQ2 (mixed-id ranking skew) → N/A under
'shadow-swap' (no coexistence window). It would apply only if a future ticket implements an 'incremental' strategy.
- #11677 OQ3 (trigger scope) → the strategy applies to every full-corpus re-embed trigger — chunk-ID formula, content-hash, chunk-boundary, parser version,
parsed-chunk-v1 schema, tenant-stamp shape — not just chunk-ID-formula changes.
Acceptance Criteria
Out of Scope
- Discussion #11676's contention axis (the shared QoS arbiter / pending-memory worker) — separate high-blast Epic.
- A full
'incremental' strategy implementation — 'shadow-swap' is this ticket's prescribed strategy; 'incremental' is not added as a speculative enum value (truth-in-code: enum values reflect implemented behavior only).
- Scope-expansion guardrail (GPT graduation condition): a cross-server invalidation bus, a shared QoS arbiter, daemon/process orchestration, or an MCP contract mutation — if implementation needs any of those, split into a new ticket and reclassify; do not expand this low-blast ticket.
Avoided Traps
- Option 2 (incremental delete-after-add) as the strategy — rejected in favor of
'shadow-swap': incremental has a window where old + new chunks of a source coexist, and QueryService (QueryService.mjs:188) accumulates score per source-keyed metadata row → transient ranking skew (#11677 OQ2). Shadow-swap has no coexistence window. Per truth-in-code, 'incremental' is not even added as an unimplemented enum value.
collection.fork() to build the shadow — rejected: fork() copies the stale corpus; the shadow must be the new corpus, built fresh.
Related
- Discussion #11677 — graduated source (OQ1a/OQ1b/OQ2/OQ3
[RESOLVED_TO_AC]; @neo-gpt [GRADUATION_APPROVED]).
- Discussion #11676 — sibling friction (the contention axis); complementary, not a substitute.
- PR #11678 / #11633 — added
VectorService.embed(..., {deleteStale}), the surface this ticket evolves.
- ADR 0003 — Chroma Topology Unified Only — the shadow collection is a second collection in the same unified daemon; no topology mutation.
Origin Session ID
c4505aed-b48d-4aed-ba11-a1db410744df
Handoff Retrieval Hints
query_raw_memories: "11677 shadow-collection swap VectorService.embed staleStrategy KB re-embed"
- Code anchors:
VectorService.mjs embed (the up-front collection.delete); ChromaManager.mjs:129 (_knowledgeBaseCollectionPromise); chromadb@3.3.1 collection.modify / fork.
Context
Graduated from Discussion #11677 —
low-blast, @neo-gpt[GRADUATION_APPROVED](DC_kwDODSospM4BA0NO). Surfaced 2026-05-20 from the #11631 tenant-aware-chunk-ID re-embed: for the entire ~5h rebuild the liveknowledge-basecollection was gutted to near-empty andask_knowledge_basereturned degraded/incomplete results with no signal to the querying agent. #11677 is the completeness sibling to Discussion #11676 (the contention axis).The Problem
VectorService.embed, on a full re-embed triggered by a chunk-ID-derivation change, computesidsToDelete = existingIds − allIdsand runscollection.delete(idsToDelete)before the embed loop. When the chunk-ID formula changes (e.g. #11631's tenant-aware IDs fold{tenantId, repoSlug}into the id), every chunk's id changes →idsToDelete= the entire old corpus. The live collection drops to ~zero, then refills batch-by-batch over hours. For that whole window the KB is incomplete andask_knowledge_basesilently degrades.Empirical anchor: the #11631 re-embed —
06:32:10 Deleted 24545 stale chunks, then ~493 batches over ~5h (operator-confirmed complete 2026-05-20T13:19Z, 24623 items).The Architectural Reality
VectorService.embed—ai/services/knowledge-base/VectorService.mjs(the up-frontcollection.delete+ the batch embed loop). PR #11678 (#11633) already added adeleteStaleopt — the first parameterization of the stale-delete behavior; this ticket evolves that surface.ChromaManager—ai/services/knowledge-base/ChromaManager.mjsmemoizes theknowledge-basecollection handle (_knowledgeBaseCollectionPromise/knowledgeBaseCollection,:129) with no proper cache-invalidation method (only an ad-hoc external null-assignment insideVectorService.deleteCollection).chromadb@3.3.1exposescollection.modify({name})(in-place rename,chromadb.d.ts:1640) andcollection.fork({name})(full copy,:1653) — but no single atomic-swap primitive.The Fix
Evolve
VectorService.embed's stale-handling option into astaleStrategyfor full-corpus re-embeds —'delete-upfront'(today's delete-all-then-rebuild; the default, backward-compatible) |'shadow-swap'— while preserving the skip-the-full-sweep behavior PR #11678's incremental-ingestion caller relies on.Implement
'shadow-swap'(Discussion #11677's recommended Option 3):collection.fork(), which would copy the stale corpus).collection.modify({name}): rename the liveknowledge-base→ a parking name, then the shadow →knowledge-base. (No single atomic primitive — the 2-step rename leaves a sub-second window with no canonical-name holder; bounded + handled.)ChromaManager(nulls_knowledgeBaseCollectionPromise/knowledgeBaseCollection), replacing the ad-hoc poke; the swap orchestration calls it post-rename, in every process holding a KB handle (lazy re-resolve on next call).The live collection serves the old (complete, consistent) corpus untouched throughout; the swap is near-instantaneous.
Contract Ledger Matrix
VectorService.embed(knowledgeBasePath, {staleStrategy})— evolves thedeleteStaleopt added by PR #11678[RESOLVED_TO_AC])staleStrategy: 'delete-upfront' | 'shadow-swap'. Default'delete-upfront'= current behavior.'shadow-swap'builds a fresh shadow collection +modify()-swaps it.KnowledgeBaseIngestionServiceincremental-push caller (which skips the full-corpus stale sweep) keeps working — exactdeleteStale→staleStrategycoexistence is an implementation decision; the AC pins behavior-preservation.embed's opts.staleStrategyvalue; integration test of a'shadow-swap're-embed.ChromaManagercache-invalidation method (new)[RESOLVED_TO_AC]_knowledgeBaseCollectionPromise/knowledgeBaseCollection; the swap calls it post-rename.getKnowledgeBaseCollection()re-resolves.Discussion Criteria Mapping
Per
ideation-sandbox-workflow.md §6.6— mapping Discussion #11677's[RESOLVED_TO_AC]resolutions to this ticket's ACs:collection.modify({name}); the shadow is built fresh, notfork()ed; the 2-step rename's sub-second no-canonical-name window is bounded + handled.ChromaManagercache-invalidation method; the swap triggers it post-rename, cross-process via lazy re-resolve. Scope-bounded (GPT graduation guardrail): invalidate only the KB collection handle — not the Memory Core manager'smemory/summary/graphcaches; if a cross-server invalidation bus / shared QoS arbiter / daemon orchestration / MCP contract mutation turns out necessary, that is scope expansion → split into a new ticket, not part of this low-blast ticket.'shadow-swap'(no coexistence window). It would apply only if a future ticket implements an'incremental'strategy.parsed-chunk-v1schema, tenant-stamp shape — not just chunk-ID-formula changes.Acceptance Criteria
VectorService.embedacceptsstaleStrategy: 'delete-upfront' | 'shadow-swap';'delete-upfront'is the default and preserves current behavior.KnowledgeBaseIngestionServiceincremental-push caller (skip-the-full-sweep) keeps working — behavior-preserved.'shadow-swap'builds a fresh shadow collection, then 2-stepcollection.modify({name})-swaps it toknowledge-base; the live collection is never gutted.ChromaManagergains a proper cache-invalidation method; the swap calls it post-rename; readers re-resolve the new collection.staleStrategyvalue; theChromaManagerinvalidation method.'shadow-swap're-embed leaves the live collection queryable throughout.Out of Scope
'incremental'strategy implementation —'shadow-swap'is this ticket's prescribed strategy;'incremental'is not added as a speculative enum value (truth-in-code: enum values reflect implemented behavior only).Avoided Traps
'shadow-swap': incremental has a window where old + new chunks of asourcecoexist, andQueryService(QueryService.mjs:188) accumulates score per source-keyed metadata row → transient ranking skew (#11677 OQ2). Shadow-swap has no coexistence window. Per truth-in-code,'incremental'is not even added as an unimplemented enum value.collection.fork()to build the shadow — rejected:fork()copies the stale corpus; the shadow must be the new corpus, built fresh.Related
[RESOLVED_TO_AC]; @neo-gpt[GRADUATION_APPROVED]).VectorService.embed(..., {deleteStale}), the surface this ticket evolves.Origin Session ID
c4505aed-b48d-4aed-ba11-a1db410744dfHandoff Retrieval Hints
query_raw_memories: "11677 shadow-collection swap VectorService.embed staleStrategy KB re-embed"VectorService.mjsembed(the up-frontcollection.delete);ChromaManager.mjs:129(_knowledgeBaseCollectionPromise);chromadb@3.3.1collection.modify/fork.