LearnNewsExamplesServices
Frontmatter
id11625
titleKB Ingestion Phase 0/1: Contracts, Source/Parser Registry, KB Tenant Isolation
stateClosed
labels
epicaiarchitecture
assigneesneo-opus-ada
createdAtMay 19, 2026, 1:34 PM
updatedAtJun 7, 2026, 7:13 PM
githubUrlhttps://github.com/neomjs/neo/issues/11625
authorneo-opus-ada
commentsCount3
parentIssue11624
subIssues
11629 Phase 0/1A — Ingestion Contracts: parsed-chunk-v1 + backup-record-v1 + Path-Identity Tuple + Tombstone Spec
11630 Phase 0/1B — Source/Parser Registry Extraction + Per-Source Path Externalization + Byte-Equivalence Fixture
11631 Phase 0/1C — KB Tenant Isolation Write-Side: VectorService Server-Stamping + Tenant-Aware chunkId + Spoof-Rejection
11632 Phase 0/1D — KB Tenant Isolation Read-Side: QueryService/SearchService where-Filter + Fail-Closed Test Suite
11658 KB Ingestion Phase 0/1B: Source/Parser registry + useDefaultSources/useDefaultParsers configs
11660 KB Ingestion Phase 0/1B-β: Externalize per-source paths to aiConfig.sourcePaths
subIssuesCompleted6
subIssuesTotal6
blockedBy[]
blocking[x] 11626 KB Ingestion Phase 2: Ingestion Service + MCP Small-Batch Facade + Bulk Facade
closedAtMay 20, 2026, 12:57 PM

KB Ingestion Phase 0/1: Contracts, Source/Parser Registry, KB Tenant Isolation

Closed v13.0.0/archive-v13-0-0-chunk-12 epicaiarchitecture
neo-opus-ada
neo-opus-ada commented on May 19, 2026, 1:34 PM

Context

Phase 0/1 Epic (parent of meta-Epic #11624 — Cloud-Native KB Ingestion for External Workspaces). Graduated from Discussion #11623. This phase ships the contracts (parsed-chunk-v1, backup-record-v1, path-identity tuple, tombstone/manifest/revision-boundary, registry, KB tenant isolation = memorySharing pattern applied to the knowledge-base Chroma collection) BEFORE the Phase 2 ingestion service implementation — substrate-correct shape per cross-family peer convergence.

Topology anchor: Per ADR 0003 — Chroma Topology Unified Only, KB + MC are SEPARATE MCP servers sharing ONE Chroma daemon but maintaining SEPARATE collections (knowledge-base, neo-agent-memory, neo-agent-sessions). This Phase adds tenant-scoping metadata + filter logic to the knowledge-base collection ONLY — no topology mutation, no storage relocation, no collection sharing.

Standalone win: enables same-server custom workspaces (useDefaultSources / useDefaultParsers configurability + custom source/parser registration) without network substrate. Sets the stable contract floor that Phase 2 facades can build on.

The Problem

Current KB substrate hardcodes single-repo assumption at multiple layers (Epic #11624 "Architectural Reality" section). This phase removes the structural blockers in the substrate FIRST, defining stable contracts before any new transport endpoint exists:

  1. Hardcoded source-class array at DatabaseService.mjs:460-471
  2. Per-Source hardcoded paths (e.g., ApiSource.sourceMap maps Neo-specific paths)
  3. memorySharing enum is Memory-Core-only today (0 KB references; verified via grep). Pattern reused, infrastructure new.
  4. Path-determinism baked into chunk emit + hydration (ApiSource.mjs:101-105 + SearchService.mjs:118-120)
  5. Content-hash delta only deletes under full-corpus sync (VectorService.mjs:198-207); incremental push needs explicit deletion-signaling
  6. importDatabase conflated with ingest — actually RESTORE-only (skips re-embedding); distinct contract from ingest

The Architectural Reality

This phase touches:

File Change
DatabaseService.mjs Replace hardcoded source array with data-driven registry; thread useDefaultSources / useDefaultParsers config
source/Base.mjs Abstract extract(writeStream, createHashFn) contract PRESERVED — already clean
source/ApiSource.mjs + 9 sibling sources Externalize hardcoded paths to config; data-driven registration
VectorService.mjs Write-side: inject server-derived {tenantId, visibility, originAgentIdentity?} at embed upsert; tenant-aware chunkId hash derivation; reject/overwrite client-supplied tenant fields
QueryService.mjs Read-side: inject where: {tenantId: {$in: [<requester>, '<team-namespace>']}} from authenticated AgentIdentity into every collection.query() call
SearchService.mjs Tenant-aware hydration per Q12 (chunk-metadata-embedded vs server-mirror — choice deferred to Phase 2; Phase 0/1 only marks the boundary)
NEW ai/services/knowledge-base/parser/parsed-chunk-v1.schema.json JSON Schema for client-side parser output
NEW ai/services/knowledge-base/parser/backup-record-v1.schema.json JSON Schema formalizing existing importDatabase {id, embedding, metadata, document} shape

The Fix

1. Source/Parser Registry Extraction

Replace hardcoded sources array in DatabaseService.createKnowledgeBase() with data-driven registry consumed by aiConfig.knowledgeBase.sources (or similar). Default value preserves current 10-source set when useDefaultSources: true. useDefaultParsers: true similarly preserves current parser binding. Custom sources/parsers register via explicit API.

2. parsed-chunk-v1 JSON Schema

Define at ai/services/knowledge-base/parser/parsed-chunk-v1.schema.json:

{
  "$id": "neo:parsed-chunk-v1",
  "schemaVersion": "1.0.0",
  "tenantId": "string",
  "repoSlug": "string",
  "rootKind": "neo-workspace | bare-repo | ...",
  "sourcePath": "string",
  "content": "string",
  "hashInputs": ["type","name","content","..."],
  "parserId": "string",
  "parserVersion": "semver-like",
  "kind": "module-context | class-properties | class-config | method | doc-section | skill-section | ...",
  "name": "string",
  "line_start": "integer?",
  "line_end": "integer?",
  "className": "string?",
  "extends": "string?",
  "customMeta": "object?"
}

Server-side validator rejects records missing required identity fields OR carrying an embedding field (routed to restore-only path).

3. backup-record-v1 JSON Schema

Define at ai/services/knowledge-base/parser/backup-record-v1.schema.json formalizing existing {id, embedding, metadata, document} shape. Distinct contract from ingest. Used only by manageDatabaseBackup({action: 'import'}) (restore-only path).

4. Path-Identity Tuple

Document {tenantId, repoSlug, rootKind, sourcePath} semantics in learn/agentos/cloud-deployment/ (placeholder doc landed Phase 3; Phase 0/1 lands inline JSDoc references). chunk.metadata.source becomes the structured tuple, NOT the bare neoRootDir-relative string.

5. Tombstone / Manifest / Revision-Boundary Contract

Spec'd in same schema directory: three mutually-supporting deletion-signaling mechanisms:

  • Explicit tombstones ({deleted: [paths]})
  • Manifest snapshot ({manifestSnapshot: {pathsAfterPush}})
  • Revision boundary ({baseRevision, headRevision})

Phase 0/1 spec only; Phase 2 wires into endpoint.

6. memorySharing KB Port — Two Halves

Write-side (server-stamping invariant):

  • VectorService.embed upsert injects {tenantId, visibility, originAgentIdentity?} from server-authenticated AgentIdentity context
  • Ingestion path REJECTS or server-OVERWRITES client-supplied tenant/visibility fields (configurable: overwrite + warning log; REJECT escalation on spoof-rate telemetry threshold)
  • Tenant-aware chunkId hash derivation: hash includes tenantId + repoSlug so same source content under two tenants yields distinct ids
  • Neo's curated content tagged with shared tenantId constant (e.g., 'neo-shared'); per-tenant content tagged with <tenantId>

Read-side (retrieval filter):

  • QueryService.queryDocuments + SearchService inject where: {tenantId: {$in: [<requester>, '<team-namespace>']}} into every Chroma collection.query call
  • Filter context derived from server-side authenticated AgentIdentity, NOT untrusted client payload
  • Mirrors MemoryService.mjs:391-410 query-time policy filter pattern

7. Byte-Equivalence Fixture

Test fixture: current Neo source output (10 sources × current parsers × current paths) BEFORE registry extraction === output AFTER. Verifies migration doesn't regress retrieval quality + adding tenantId field doesn't perturb chunk-hash semantics for existing content (hash-derivation function backwards-compatible for Neo's tenantId).

8. Fail-Closed Test Suite

Required tenant-isolation tests:

  • Forged client tenantId rejected/overwritten through every public KB query facade
  • Forged client visibility rejected/overwritten
  • Tenant A cannot retrieve tenant B private chunks
  • Neo team (shared) chunks visible across tenants
  • Same sourcePath under two tenants → distinct chunk ids (no shadow attack)
  • Chunk-schema-v1 validation: rejects records carrying embedding field outside restore mode (forces routing through manageDatabaseBackup)

Acceptance Criteria

  • useDefaultSources / useDefaultParsers boolean configs in aiConfig (default true for zero-config Neo deployments)
  • Hardcoded source array at DatabaseService.mjs:460-471 replaced with data-driven registry
  • Per-source paths (ApiSource.sourceMap etc.) externalized to config
  • Custom source / custom parser registration API documented + tested
  • ai/services/knowledge-base/parser/parsed-chunk-v1.schema.json created with validator at service boundary
  • ai/services/knowledge-base/parser/backup-record-v1.schema.json created formalizing existing shape
  • Path-identity tuple {tenantId, repoSlug, rootKind, sourcePath} in chunk metadata
  • Tombstone / manifest / revision-boundary contract spec'd in parsed-chunk-v1 companion schema
  • VectorService.embed write-side server-stamping ({tenantId, visibility, originAgentIdentity?}) from authenticated AgentIdentity
  • VectorService.embed rejects or server-OVERWRITES client-supplied tenant fields (mode configurable; default: overwrite + warning log)
  • Tenant-aware chunkId hash derivation (hash includes tenantId + repoSlug)
  • QueryService.queryDocuments injects tenant/visibility where clause from authenticated AgentIdentity
  • SearchService hydration tenant-aware (Phase 2 may not implement retrieval flow before choosing Q12 hydration mode)
  • Byte-equivalence fixture passes: current Neo source output stable before/after registry extraction
  • Fail-closed test suite (8 cases above) passes
  • Unit tests under test/playwright/unit/ai/knowledge-base/
  • Integration tests with synthetic external-workspace fixtures (at minimum: mini-neo-workspace/, mini-custom-source/); ES5 + C++ fixtures can defer to Phase 2 if they require client-side parser-runner infrastructure

Out of Scope

  • KnowledgeBaseIngestionService singleton implementation → Phase 2
  • MCP tool ingestSourceFiles → Phase 2
  • Bulk facade (CLI/HTTP/streaming) → Phase 2
  • Q12 hydration mode choice (chunk-metadata-embedded vs server-mirror vs hybrid) → Phase 2 sub-ticket
  • ES5 + C++ workspace fixtures requiring client-side parser-runner → Phase 2 (depends on push pipeline)
  • Cloud deployment guide → Phase 3
  • Runtime tenant-registered server-side parser code (operator-installed/Neo-shipped/signed-package only; future WASM/tree-sitter sandboxing is a separate Discussion)

Avoided Traps

Trap Why rejected
Implementing Phase 0/1 + Phase 2 in same PR Contract-risk: endpoint-shape can drift before contracts stabilize. Substrate-correct ordering: contracts first.
Skipping byte-equivalence fixture Adding tenantId field to chunk-hash inputs without verification = silent retrieval-quality regression for existing content
Application-layer retrieval filter as final shape Vulnerable to bug-bypass (forgotten filter call = data leak). V1 application-layer is fine; Phase 2 reconsiders Chroma-layer hardening per Q13b lean Option C (hybrid).
Skipping fail-closed test suite Server-stamping + spoof-rejection is a load-bearing SECURITY invariant; untested = unverified

Related

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • query_raw_memories({query: 'parsed-chunk-v1 schema KB ingestion contract'})
  • query_raw_memories({query: 'memorySharing KB port write-side stamping read-side filter'})
  • ask_knowledge_base({query: 'KB source parser registry default inheritance', type: 'src'})
  • Discussion #11623 §7 Phase 0/1 + §8 test substrate + §10 Graduation Criteria #12+#13 are the architectural source-of-authority
  • Begin with byte-equivalence fixture authoring + parsed-chunk-v1.schema.json draft — these have lowest-implementation-risk + serve as the substrate floor everything else builds on