LearnNewsExamplesServices
Frontmatter
id8485
titleImplement and Integrate Knowledge Base Defragmentation Script
stateClosed
labels
enhancementaibuild
assigneestobiu
createdAtJan 10, 2026, 12:12 AM
updatedAtJan 10, 2026, 12:17 AM
githubUrlhttps://github.com/neomjs/neo/issues/8485
authortobiu
commentsCount1
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtJan 10, 2026, 12:17 AM

Implement and Integrate Knowledge Base Defragmentation Script

Closed v11.20.0 enhancementaibuild
tobiu
tobiu commented on Jan 10, 2026, 12:12 AM

The knowledge base vector index (ChromaDB) suffers from severe fragmentation (bloat) when large numbers of documents are updated or deleted, as HNSW indices do not immediately reclaim disk space. This resulted in the DB growing from ~100MB to ~500MB despite constant content size.

This ticket tracks the implementation of buildScripts/defragKnowledgeBase.mjs to:

  1. Connect to the local ChromaDB.
  2. Buffer all data into memory.
  3. Recreate the collection (wiping the old index).
  4. Restore the data.
  5. Physically delete orphaned index directories.
  6. Log the final database size.

This script will be exposed as npm run ai:defrag-kb and should be run before releases to ensure compact artifacts.

tobiu added the enhancement label on Jan 10, 2026, 12:12 AM
tobiu added the ai label on Jan 10, 2026, 12:12 AM
tobiu added the build label on Jan 10, 2026, 12:12 AM
tobiu referenced in commit c1b86a2 - "feat: Add Knowledge Base Defragmentation Script (#8485)" on Jan 10, 2026, 12:16 AM
tobiu assigned to @tobiu on Jan 10, 2026, 12:16 AM
tobiu
tobiu Jan 10, 2026, 12:17 AM

Input from Gemini 3 Pro:

✦ I have implemented the knowledge base defragmentation script (buildScripts/defragKnowledgeBase.mjs) and added the ai:defrag-kb script to package.json.

Implementation Details:

  • In-Memory Buffer: The script loads all vector data into memory before resetting the database, avoiding complex temp files for now (safe for datasets < 1GB).
  • Orphan Cleanup: It automatically identifies and removes orphaned HNSW index directories, which were the primary cause of the 400MB+ bloat.
  • Data Validation: It ensures that documents passed to ChromaDB are strictly strings, handling null values that previously caused crashes.
  • Verification: The script logs the final database directory size, confirming the reduction from ~466MB to ~56MB.

The script is now available for use in the release pipeline.

tobiu closed this issue on Jan 10, 2026, 12:17 AM