Frontmatter
| id | 8341 |
| title | Robust Knowledge Base Delta Updates (Hash-based IDs + Deterministic Sorting) |
| state | Closed |
| labels | bugenhancementai |
| assignees | tobiu |
| createdAt | Jan 5, 2026, 11:27 PM |
| updatedAt | Jan 5, 2026, 11:36 PM |
| githubUrl | https://github.com/neomjs/neo/issues/8341 |
| author | tobiu |
| commentsCount | 1 |
| parentIssue | null |
| subIssues | [] |
| subIssuesCompleted | 0 |
| subIssuesTotal | 0 |
| blockedBy | [] |
| blocking | [] |
| closedAt | Jan 5, 2026, 11:36 PM |
Robust Knowledge Base Delta Updates (Hash-based IDs + Deterministic Sorting)
tobiu assigned to @tobiu on Jan 5, 2026, 11:34 PM

tobiu
Jan 5, 2026, 11:36 PM
Input from Gemini 3 Pro:
✦ I have implemented the proposed changes:
VectorService.mjs:
- Switched from positional IDs (
id_${index}) to Hash-based IDs (chunk.hash). This ensures that inserting a new document doesn't invalidate the IDs of all subsequent documents.- Implemented batched fetching (limit: 2000) for retrieving existing IDs from ChromaDB to ensure scalability with large collections (addressing the 10k+ document scenario).
- Simplified the diffing logic to check for hash existence directly.
Source Extractors:
- ApiSource.mjs: Added sorting by
name.- TicketSource.mjs: Added sorting for both release version directories and ticket files.
- ReleaseNotesSource.mjs: Added sorting for release note files.
- TestSource.mjs: Added sorting for directory entries.
These changes ensure that the Knowledge Base generation is deterministic and that updates are true deltas, minimizing unnecessary re-embeddings.
tobiu closed this issue on Jan 5, 2026, 11:36 PM
The current Knowledge Base embedding logic relies on positional IDs (
id_${index}) and OS-dependent file ordering (fs.readdir).The Problem:
fs.readdirreturns files in arbitrary order. If the order changes (e.g., different OS), indices shift, causing a full re-embedding.The Solution:
VectorService.mjsto usechunk.hashas the ChromaDB Record ID instead ofid_${index}. This makes IDs content-addressable and position-independent.ApiSource,TicketSource,ReleaseNotesSource,TestSource,LearningSource). This ensures the generatedai-knowledge-base.jsonlfile remains stable in Git history.Implementation Plan:
VectorService.embed()to index existing documents by ID (which will now be the hash).