LearnNewsExamplesServices
Frontmatter
id11653
titleKB importDatabase rejects backups with `document: null` (steady-state KB shape)
stateClosed
labels
bugai
assigneesneo-opus-ada
createdAtMay 19, 2026, 5:37 PM
updatedAtJun 7, 2026, 7:13 PM
githubUrlhttps://github.com/neomjs/neo/issues/11653
authorneo-opus-ada
commentsCount0
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 20, 2026, 2:50 AM

KB importDatabase rejects backups with document: null (steady-state KB shape)

neo-opus-ada
neo-opus-ada commented on May 19, 2026, 5:37 PM

Context

KB restore from backup-2026-05-16T13-08-06.565Z and backup-2026-05-19T13-08-14.283Z fails with:

❌ Restore failed: Error: DATABASE_IMPORT_ERROR: Expected each document to be a string, but got object
    at DatabaseService.importDatabase (file:///Users/Shared/github/neomjs/neo/ai/services/knowledge-base/DatabaseService.mjs:345:33)

Discovered while executing operator-directed MC recovery on 2026-05-19. KB restore is the first embedded substrate in the bundle order; the error fired before MC could be touched. Worked around with --only-substrate=mc for the urgent recovery, but KB restore is now structurally broken for any bundle.

The Problem

KB chunks are stored in Chroma with the chunk body in metadata.content, NOT in Chroma's primary document field. This is the intentional architectural pattern — verified empirically:

Inspection of backup backup-2026-05-19T13-08-14.283Z/kb/...jsonl (24,418 records)
document: null — 24,418 records (100%)
document: string — 0 records
metadata contains content, source, name, hash, kind, className, etc.

The export side at ai/services/knowledge-base/DatabaseService.mjs:200 correctly writes document: batch.documents[i] (the actual Chroma state — null). The import side at DatabaseService.mjs:332 forwards null into Chroma's upsert() documents array; Chroma rejects with the error above (Chroma requires document strings, not nulls).

This means every KB backup ever taken cannot be restored with the current importDatabase code path, since KB's steady-state Chroma shape always has null documents.

The Architectural Reality

Export-side (correct, no change):

// ai/services/knowledge-base/DatabaseService.mjs:195-201
const record = {
    id       : batch.ids[i],
    embedding: batch.embeddings[i],
    metadata : batch.metadatas[i],
    document : batch.documents[i]  // null for KB chunks — by design
};

Import-side (broken at the boundary):

// ai/services/knowledge-base/DatabaseService.mjs:326-334
const BATCH_SIZE = 500;
for (let i = 0; i < records.length; i += BATCH_SIZE) {
    const batch = records.slice(i, i + BATCH_SIZE);
    await collection.upsert({
        ids       : batch.map(r => r.id),
        embeddings: batch.map(r => r.embedding),
        metadatas : batch.map(r => r.metadata),
        documents : batch.map(r => r.document)  // ← null array → Chroma rejects
    });
    imported += batch.length;
}

Architectural anchor: KB chunks live alongside the recent Phase 0/1A parsed-chunk-v1 contract which also routes chunk text through metadata.content. The pattern is uniform — Chroma's document field is unused on the KB side. MC memories are the inverse (document is the memory body string), which is why MC restore works fine.

The Fix

Make importDatabase Chroma-shape-agnostic at the boundary: pass documents only when at least one record has a non-null document; otherwise omit the field. This handles both shapes uniformly:

// ai/services/knowledge-base/DatabaseService.mjs:326-334
const BATCH_SIZE = 500;
for (let i = 0; i < records.length; i += BATCH_SIZE) {
    const batch    = records.slice(i, i + BATCH_SIZE);
    const upsertArgs = {
        ids       : batch.map(r => r.id),
        embeddings: batch.map(r => r.embedding),
        metadatas : batch.map(r => r.metadata)
    };
    const hasDocs = batch.some(r => r.document != null);
    if (hasDocs) {
        upsertArgs.documents = batch.map(r => r.document ?? '');
    }
    await collection.upsert(upsertArgs);
    imported += batch.length;
}

The empty-string fallback (?? '') inside the hasDocs branch handles mixed batches (rare but possible — a single non-null forces the field; remaining nulls become empty strings to satisfy Chroma's per-array-element string requirement). Pure-null batches skip the documents key entirely so Chroma never sees a null.

Acceptance Criteria

  • ai/services/knowledge-base/DatabaseService.mjs:326-334 importDatabase handles document: null records without throwing.
  • End-to-end: npm run ai:restore -- <bundle-path> --mode merge --only-substrate=kb succeeds against backup-2026-05-19T13-08-14.283Z (24,418 KB chunks restored). [L3-deferred — operator handoff needed] (agent sandbox cannot drive a real chromadb-restore against the live bundle at PR-build time; PR #11657 ships L2 unit-test coverage at the import-boundary including a real-backup-shape reproducer).
  • Unit test: new spec under test/playwright/unit/ai/services/knowledge-base/ covers three batch shapes — all-null documents (KB shape), all-string documents (MC-style hypothetical), mixed null+string.
  • Round-trip parity: export → restore on a fresh test collection produces byte-identical chunk count + identity-tuple membership. [L3-deferred — operator handoff needed] (sandbox cannot drive live KB collection export-then-restore round-trip at PR-build time; PR #11657 verifies the import-boundary semantics that the round-trip depends on).
  • No regression to MC restore path (Memory_DatabaseService.importDatabase is a peer-method; verify whichever import path MC uses has the same fix if it has the same shape).

Out of Scope

  • Re-engineering KB to store chunk content in Chroma's document field instead of metadata.content (the current pattern is intentional + symmetric with parsed-chunk-v1 / Phase 0/1A contracts).
  • Backup-shape schema validation (backup-record-v1.schema.json introduced in #11647 already declares document as nullable; this ticket fixes the consumer, not the contract).
  • The unrelated MC wipe incident (separate recovery executed this turn; substrate hardening tracked in #11652).

Avoided Traps

Trap Why rejected
Fix export side to substitute null'' Mutates a true representation of Chroma state into a synthetic one. Round-trip identity-tuple checks would diverge. Import-side fix preserves source-of-truth.
Require document non-null in backup-record-v1.schema.json Would invalidate every backup ever taken (KB shape always has null). The schema correctly declares document as ["string", "null"].
Throw a clearer error message but require manual fix per backup Backups are operator-recovery substrate. They must restore without manual intervention.

Related

  • Sibling architectural pattern: Phase 0/1A parsed-chunk-v1 schema (#11625 / #11647 merged) — chunk text routes through metadata, not Chroma's document.
  • Backup-record schema: ai/services/knowledge-base/parser/backup-record-v1.schema.json (shipped in #11647)
  • Export side (no change needed): ai/services/knowledge-base/DatabaseService.mjs:195-201
  • Discovered during: MC wipe recovery this turn (#11652 hardening covers the underlying wipe-prevention).

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • query_raw_memories({query: 'KB importDatabase document null Chroma upsert error'})
  • ask_knowledge_base({query: 'knowledge base chunk content metadata not document', type: 'src'})
  • Empirical anchor: 24,418/24,418 KB backup records have document: null in backup-2026-05-19T13-08-14.283Z
  • Restore reproducer: npm run ai:restore -- .neo-ai-data/backups/<any-bundle> --mode merge --only-substrate=kbDATABASE_IMPORT_ERROR
tobiu referenced in commit fdfb48f - "fix(ai): KB importDatabase tolerates null document field (#11653) (#11657) on May 20, 2026, 2:50 AM
tobiu closed this issue on May 20, 2026, 2:50 AM