LearnNewsExamplesServices
Frontmatter
id11685
titleHarden KB shadow-swap before activation
stateClosed
labels
bugaitestingarchitecture
assigneesneo-gpt
createdAtMay 20, 2026, 4:57 PM
updatedAtMay 20, 2026, 7:11 PM
githubUrlhttps://github.com/neomjs/neo/issues/11685
authorneo-gpt
commentsCount0
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 20, 2026, 7:11 PM

Harden KB shadow-swap before activation

Closed v13.0.0/archive-v13-0-0-chunk-12 bugaitestingarchitecture
neo-gpt
neo-gpt commented on May 20, 2026, 4:57 PM

Context

PR #11684 implements the #11683 Knowledge Base re-embed staleStrategy: 'shadow-swap' path. Claude-family review cycle 1 approved the PR as merge-eligible, but surfaced three non-blocking follow-ups that should be tracked before operators or scripts activate the new strategy broadly.

Evidence checked before filing:

  • Live PR state for #11684: reviewDecision: APPROVED, CI green, state open.
  • Duplicate sweep: live GitHub searches for shadow-swap, staleStrategy, and getOrCreateCollection shadow collection found only parent #11683 and no dedicated hardening issue.
  • Local content sweep across resources/content/issues / resources/content/pulls found no active shadow-swap follow-up beyond #11683.
  • KB ticket semantic sweep for shadow-swap staleStrategy getOrCreateCollection follow-up ticket no canonical window orphan shadow collection did not surface an equivalent ticket.

The Problem

The shadow-swap path is currently opt-in and not wired into the normal KB sync entry point, so #11684 can merge without blocking. Before activation, the promote path needs hardening around two correctness hazards and one explicit activation gap:

  1. Two-rename promote window: embedViaShadowSwap() renames live canonical -> parking, then shadow -> canonical. During the interval where no collection has the canonical name, another process or cold-cache read can call getKnowledgeBaseCollection().
  2. getOrCreateCollection collision risk: ChromaManager.getKnowledgeBaseCollection() uses Chroma getOrCreateCollection(). If called during the no-canonical-name window, it can create an empty canonical collection. The later shadow rename can then collide, and rollback can also collide with the empty canonical.
  3. Pre-promote leak hygiene: if embedChunks() fails after the shadow collection is created but before any rename, the current path can leave an orphaned shadow collection.
  4. Activation gap: #11684 introduces the strategy, but DatabaseService.embedKnowledgeBase() still calls VectorService.embed(aiConfig.dataPath, {viaMcp}) without staleStrategy, so the original #11677 friction is not operationally closed until an entry point opts in.

The Architectural Reality

  • ai/services/knowledge-base/VectorService.mjs:343 defines embedViaShadowSwap({liveCollection, knowledgeBase, idsToDeleteCount}).
  • ai/services/knowledge-base/VectorService.mjs:349-362 creates the shadow collection, embeds the full corpus, then promotes via live -> parking and shadow -> canonical renames.
  • ai/services/knowledge-base/VectorService.mjs:380-389 invalidates cache and attempts rollback only after liveParked is true.
  • ai/services/knowledge-base/ChromaManager.mjs:128-135 resolves the canonical KB collection with client.getOrCreateCollection({name: aiConfig.collectionName, ...}).
  • ai/services/knowledge-base/DatabaseService.mjs:519-520 keeps the normal KB embed entry point on the default strategy because it does not pass staleStrategy.

This is a KB service-layer hardening ticket. It should not reopen the broad Memory Core/Chroma contention design in Discussion #11676, and it should not mutate the MCP server tool shape.

The Fix

Harden shadow-swap before activation:

  1. Add deterministic protection for the no-canonical-name promote window. Acceptable shapes include a scoped promotion marker/lock, an internal promote-aware canonical resolver, avoiding getOrCreateCollection() during known promote windows, or another repo-consistent service-layer guard. The fix must prove that an empty canonical collection cannot strand the KB during promote.
  2. Add tested cleanup or explicit parking semantics for a shadow collection created before embedChunks() fails.
  3. Add an explicit activation path only after the hardening tests pass. The likely owner is the KB bulk re-embed/sync path that reaches DatabaseService.embedKnowledgeBase() / VectorService.embed().
  4. Preserve operator-gated destructive behavior. If cleanup requires deletion, tests must show the guard is constrained to safe/test-owned shadow artifacts, not arbitrary canonical data.

Contract Ledger Matrix

Target Surface Source of Authority Proposed Behavior Fallback Docs Evidence
VectorService.embedViaShadowSwap() promote transaction #11683, PR #11684 review follow-up Promotion cannot create or collide with an empty canonical collection during the live -> parking / shadow -> canonical interval Fail loudly with recoverable parking/shadow names and no silent empty canonical JSDoc near promote path Unit or integration test forcing cold-cache canonical resolve between renames
Shadow collection lifecycle #11684 review follow-up A failed pre-promote embed does not leave an untracked orphan shadow collection Explicitly parked/deleted test-owned shadow artifact with logged recovery handle JSDoc/comment only if behavior is non-obvious Unit test with embedChunks() failure before rename
KB sync activation path #11677 -> #11683 lineage A caller can intentionally opt into staleStrategy: 'shadow-swap' only after hardening is present Default remains non-shadow-swap until the activation caller is explicit PR body / service JSDoc for the opted-in entry point Targeted unit + dockerized integration evidence

Acceptance Criteria

  • A test forces a cold-cache getKnowledgeBaseCollection() call between liveCollection.modify({name: parkingName}) and shadowCollection.modify({name: aiConfig.collectionName}) and proves no empty canonical collection strands the KB.
  • The promote path is hardened so the no-canonical-name interval cannot silently create a conflicting canonical collection, or else fails loudly with operator-recoverable parking/shadow identifiers.
  • A failure in embedChunks() before promotion cleans up or explicitly parks the created shadow collection with tested, constrained semantics.
  • The normal/default KB sync path does not switch to shadow-swap until the promote-window and leak-hygiene hardening are green.
  • At least one explicit activation caller or config path can pass staleStrategy: 'shadow-swap' after hardening, with behavior documented at the owning service boundary.
  • Targeted unit tests and dockerized Chroma integration coverage prove both the cold-cache collision case and the pre-promote failure case.
  • PR #11684 / #11683 closure remains interpretable as "strategy implemented", not "strategy broadly activated"; the activation boundary is linked back to this ticket.

Out of Scope

  • Reopening the broader Memory Core lightweight-operation resilience work from Discussion #11676.
  • Changing MCP tool schemas or OpenAPI/YAML surfaces.
  • Introducing a new Chroma topology.
  • Making shadow-swap the default strategy before the hardening criteria above pass.

Avoided Traps

  • Do not paper over this with retry-only behavior. The failure mode is a semantic collision created by getOrCreateCollection() during a name-gap, not ordinary transient Chroma flakiness.
  • Do not solve it by deleting arbitrary canonical/parking collections. Deletion must remain constrained and auditable.
  • Do not treat #11684 approval as proof of broad activation readiness. The review explicitly approved the implementation with follow-up, not default rollout.

Related

  • Parent implementation ticket: #11683
  • Implementation PR: #11684
  • Source ideation lineage: Discussion #11677
  • Sibling broader resilience discussion: Discussion #11676
  • Review handoff: PR #11684 Claude-family review cycle 1 (PRR_kwDODSospM8AAAABAhDLAA)

Origin Session ID: 019e44ba-d309-7e91-a819-36911fbf4e10

Handoff Retrieval Hints:

  • query_raw_memories("PR 11684 shadow-swap approved follow-up no canonical window orphan shadow activation")
  • query_raw_memories("Harden KB shadow-swap before activation")
  • GitHub search: shadow-swap staleStrategy getOrCreateCollection
tobiu referenced in commit 8711d05 - "fix(kb): harden shadow-swap promotion (#11685) (#11686) on May 20, 2026, 7:11 PM
tobiu closed this issue on May 20, 2026, 7:11 PM