LearnNewsExamplesServices
Frontmatter
id8341
titleRobust Knowledge Base Delta Updates (Hash-based IDs + Deterministic Sorting)
stateClosed
labels
bugenhancementai
assigneestobiu
createdAtJan 5, 2026, 11:27 PM
updatedAtJan 5, 2026, 11:36 PM
githubUrlhttps://github.com/neomjs/neo/issues/8341
authortobiu
commentsCount1
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtJan 5, 2026, 11:36 PM

Robust Knowledge Base Delta Updates (Hash-based IDs + Deterministic Sorting)

Closed v11.19.0 bugenhancementai
tobiu
tobiu commented on Jan 5, 2026, 11:27 PM

The current Knowledge Base embedding logic relies on positional IDs (id_${index}) and OS-dependent file ordering (fs.readdir).

The Problem:

  1. Indeterminism: fs.readdir returns files in arbitrary order. If the order changes (e.g., different OS), indices shift, causing a full re-embedding.
  2. Cascading Invalidations: Inserting a single document shifts the index of all subsequent documents. This forces a re-upload of potentially the entire dataset, defeating the purpose of delta updates.

The Solution:

  1. Hash-based IDs: Update VectorService.mjs to use chunk.hash as the ChromaDB Record ID instead of id_${index}. This makes IDs content-addressable and position-independent.
  2. Deterministic Sorting: Enforce alphabetical sorting in all Source extractors (ApiSource, TicketSource, ReleaseNotesSource, TestSource, LearningSource). This ensures the generated ai-knowledge-base.jsonl file remains stable in Git history.

Implementation Plan:

  1. Refactor VectorService.embed() to index existing documents by ID (which will now be the hash).
  2. Update Sources to sort files/keys before writing.
tobiu added the bug label on Jan 5, 2026, 11:27 PM
tobiu added the enhancement label on Jan 5, 2026, 11:27 PM
tobiu added the ai label on Jan 5, 2026, 11:27 PM
tobiu assigned to @tobiu on Jan 5, 2026, 11:34 PM
tobiu referenced in commit 6100d61 - "Enhancement: Robust Knowledge Base Delta Updates #8341 on Jan 5, 2026, 11:36 PM
tobiu
tobiu Jan 5, 2026, 11:36 PM

Input from Gemini 3 Pro:

✦ I have implemented the proposed changes:

  1. VectorService.mjs:

    • Switched from positional IDs (id_${index}) to Hash-based IDs (chunk.hash). This ensures that inserting a new document doesn't invalidate the IDs of all subsequent documents.
    • Implemented batched fetching (limit: 2000) for retrieving existing IDs from ChromaDB to ensure scalability with large collections (addressing the 10k+ document scenario).
    • Simplified the diffing logic to check for hash existence directly.
  2. Source Extractors:

    • ApiSource.mjs: Added sorting by name.
    • TicketSource.mjs: Added sorting for both release version directories and ticket files.
    • ReleaseNotesSource.mjs: Added sorting for release note files.
    • TestSource.mjs: Added sorting for directory entries.

These changes ensure that the Knowledge Base generation is deterministic and that updates are true deltas, minimizing unnecessary re-embeddings.

tobiu closed this issue on Jan 5, 2026, 11:36 PM