LearnNewsExamplesServices
Frontmatter
id11683
titleKB re-embed: shadow-collection swap, not in-place gut-and-rebuild
stateClosed
labels
enhancementaiarchitecture
assigneesneo-gpt
createdAtMay 20, 2026, 4:08 PM
updatedAtMay 20, 2026, 4:59 PM
githubUrlhttps://github.com/neomjs/neo/issues/11683
authorneo-opus-ada
commentsCount0
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 20, 2026, 4:59 PM

KB re-embed: shadow-collection swap, not in-place gut-and-rebuild

Closed v13.0.0/archive-v13-0-0-chunk-12 enhancementaiarchitecture
neo-opus-ada
neo-opus-ada commented on May 20, 2026, 4:08 PM

Context

Graduated from Discussion #11677low-blast, @neo-gpt [GRADUATION_APPROVED] (DC_kwDODSospM4BA0NO). Surfaced 2026-05-20 from the #11631 tenant-aware-chunk-ID re-embed: for the entire ~5h rebuild the live knowledge-base collection was gutted to near-empty and ask_knowledge_base returned degraded/incomplete results with no signal to the querying agent. #11677 is the completeness sibling to Discussion #11676 (the contention axis).

The Problem

VectorService.embed, on a full re-embed triggered by a chunk-ID-derivation change, computes idsToDelete = existingIds − allIds and runs collection.delete(idsToDelete) before the embed loop. When the chunk-ID formula changes (e.g. #11631's tenant-aware IDs fold {tenantId, repoSlug} into the id), every chunk's id changes → idsToDelete = the entire old corpus. The live collection drops to ~zero, then refills batch-by-batch over hours. For that whole window the KB is incomplete and ask_knowledge_base silently degrades.

Empirical anchor: the #11631 re-embed — 06:32:10 Deleted 24545 stale chunks, then ~493 batches over ~5h (operator-confirmed complete 2026-05-20T13:19Z, 24623 items).

The Architectural Reality

  • VectorService.embedai/services/knowledge-base/VectorService.mjs (the up-front collection.delete + the batch embed loop). PR #11678 (#11633) already added a deleteStale opt — the first parameterization of the stale-delete behavior; this ticket evolves that surface.
  • KB ChromaManagerai/services/knowledge-base/ChromaManager.mjs memoizes the knowledge-base collection handle (_knowledgeBaseCollectionPromise / knowledgeBaseCollection, :129) with no proper cache-invalidation method (only an ad-hoc external null-assignment inside VectorService.deleteCollection).
  • chromadb@3.3.1 exposes collection.modify({name}) (in-place rename, chromadb.d.ts:1640) and collection.fork({name}) (full copy, :1653) — but no single atomic-swap primitive.

The Fix

Evolve VectorService.embed's stale-handling option into a staleStrategy for full-corpus re-embeds — 'delete-upfront' (today's delete-all-then-rebuild; the default, backward-compatible) | 'shadow-swap' — while preserving the skip-the-full-sweep behavior PR #11678's incremental-ingestion caller relies on.

Implement 'shadow-swap' (Discussion #11677's recommended Option 3):

  1. Build the re-embedded corpus in a fresh shadow collection (create + embed the new-id corpus — not collection.fork(), which would copy the stale corpus).
  2. Swap via a 2-step collection.modify({name}): rename the live knowledge-base → a parking name, then the shadow → knowledge-base. (No single atomic primitive — the 2-step rename leaves a sub-second window with no canonical-name holder; bounded + handled.)
  3. Add a proper cache-invalidation method to the KB ChromaManager (nulls _knowledgeBaseCollectionPromise / knowledgeBaseCollection), replacing the ad-hoc poke; the swap orchestration calls it post-rename, in every process holding a KB handle (lazy re-resolve on next call).

The live collection serves the old (complete, consistent) corpus untouched throughout; the swap is near-instantaneous.

Contract Ledger Matrix

Target Surface Source of Authority Proposed Behavior Fallback Docs Evidence
VectorService.embed(knowledgeBasePath, {staleStrategy}) — evolves the deleteStale opt added by PR #11678 Discussion #11677 (graduated; OQ1a/OQ1b/OQ2/OQ3 [RESOLVED_TO_AC]) staleStrategy: 'delete-upfront' | 'shadow-swap'. Default 'delete-upfront' = current behavior. 'shadow-swap' builds a fresh shadow collection + modify()-swaps it. PR #11678's KnowledgeBaseIngestionService incremental-push caller (which skips the full-corpus stale sweep) keeps working — exact deleteStalestaleStrategy coexistence is an implementation decision; the AC pins behavior-preservation. JSDoc on embed's opts. Unit tests per staleStrategy value; integration test of a 'shadow-swap' re-embed.
KB ChromaManager cache-invalidation method (new) This ticket; #11677 OQ1b [RESOLVED_TO_AC] A method nulling _knowledgeBaseCollectionPromise / knowledgeBaseCollection; the swap calls it post-rename. Absent → readers serve the stale pre-swap handle (the current gap). JSDoc. Unit test: invalidation nulls the memoized handle; next getKnowledgeBaseCollection() re-resolves.

Discussion Criteria Mapping

Per ideation-sandbox-workflow.md §6.6 — mapping Discussion #11677's [RESOLVED_TO_AC] resolutions to this ticket's ACs:

  • #11677 OQ1a (Chroma rename primitive) → the swap uses collection.modify({name}); the shadow is built fresh, not fork()ed; the 2-step rename's sub-second no-canonical-name window is bounded + handled.
  • #11677 OQ1b (cached-reader cutover) → a proper KB-ChromaManager cache-invalidation method; the swap triggers it post-rename, cross-process via lazy re-resolve. Scope-bounded (GPT graduation guardrail): invalidate only the KB collection handle — not the Memory Core manager's memory/summary/graph caches; if a cross-server invalidation bus / shared QoS arbiter / daemon orchestration / MCP contract mutation turns out necessary, that is scope expansion → split into a new ticket, not part of this low-blast ticket.
  • #11677 OQ2 (mixed-id ranking skew) → N/A under 'shadow-swap' (no coexistence window). It would apply only if a future ticket implements an 'incremental' strategy.
  • #11677 OQ3 (trigger scope) → the strategy applies to every full-corpus re-embed trigger — chunk-ID formula, content-hash, chunk-boundary, parser version, parsed-chunk-v1 schema, tenant-stamp shape — not just chunk-ID-formula changes.

Acceptance Criteria

  • VectorService.embed accepts staleStrategy: 'delete-upfront' | 'shadow-swap'; 'delete-upfront' is the default and preserves current behavior.
  • PR #11678's KnowledgeBaseIngestionService incremental-push caller (skip-the-full-sweep) keeps working — behavior-preserved.
  • 'shadow-swap' builds a fresh shadow collection, then 2-step collection.modify({name})-swaps it to knowledge-base; the live collection is never gutted.
  • The KB ChromaManager gains a proper cache-invalidation method; the swap calls it post-rename; readers re-resolve the new collection.
  • The 2-step-swap sub-second no-canonical-name window is bounded + handled (read-pause, or accepted with rationale).
  • The re-embed strategy applies to all full-corpus re-embed triggers (per Discussion-Criteria-Mapping OQ3).
  • Unit tests: each staleStrategy value; the ChromaManager invalidation method.
  • Integration test: a 'shadow-swap' re-embed leaves the live collection queryable throughout.

Out of Scope

  • Discussion #11676's contention axis (the shared QoS arbiter / pending-memory worker) — separate high-blast Epic.
  • A full 'incremental' strategy implementation — 'shadow-swap' is this ticket's prescribed strategy; 'incremental' is not added as a speculative enum value (truth-in-code: enum values reflect implemented behavior only).
  • Scope-expansion guardrail (GPT graduation condition): a cross-server invalidation bus, a shared QoS arbiter, daemon/process orchestration, or an MCP contract mutation — if implementation needs any of those, split into a new ticket and reclassify; do not expand this low-blast ticket.

Avoided Traps

  • Option 2 (incremental delete-after-add) as the strategy — rejected in favor of 'shadow-swap': incremental has a window where old + new chunks of a source coexist, and QueryService (QueryService.mjs:188) accumulates score per source-keyed metadata row → transient ranking skew (#11677 OQ2). Shadow-swap has no coexistence window. Per truth-in-code, 'incremental' is not even added as an unimplemented enum value.
  • collection.fork() to build the shadow — rejected: fork() copies the stale corpus; the shadow must be the new corpus, built fresh.

Related

  • Discussion #11677 — graduated source (OQ1a/OQ1b/OQ2/OQ3 [RESOLVED_TO_AC]; @neo-gpt [GRADUATION_APPROVED]).
  • Discussion #11676 — sibling friction (the contention axis); complementary, not a substitute.
  • PR #11678 / #11633 — added VectorService.embed(..., {deleteStale}), the surface this ticket evolves.
  • ADR 0003 — Chroma Topology Unified Only — the shadow collection is a second collection in the same unified daemon; no topology mutation.

Origin Session ID

c4505aed-b48d-4aed-ba11-a1db410744df

Handoff Retrieval Hints

  • query_raw_memories: "11677 shadow-collection swap VectorService.embed staleStrategy KB re-embed"
  • Code anchors: VectorService.mjs embed (the up-front collection.delete); ChromaManager.mjs:129 (_knowledgeBaseCollectionPromise); chromadb@3.3.1 collection.modify / fork.
tobiu referenced in commit 391cb26 - "feat(kb): add shadow-swap re-embed strategy (#11683) (#11684) on May 20, 2026, 4:59 PM
tobiu closed this issue on May 20, 2026, 4:59 PM