Context
Phase 0/1 Epic (parent of meta-Epic #11624 — Cloud-Native KB Ingestion for External Workspaces). Graduated from Discussion #11623. This phase ships the contracts (parsed-chunk-v1, backup-record-v1, path-identity tuple, tombstone/manifest/revision-boundary, registry, KB tenant isolation = memorySharing pattern applied to the knowledge-base Chroma collection) BEFORE the Phase 2 ingestion service implementation — substrate-correct shape per cross-family peer convergence.
Topology anchor: Per ADR 0003 — Chroma Topology Unified Only, KB + MC are SEPARATE MCP servers sharing ONE Chroma daemon but maintaining SEPARATE collections (knowledge-base, neo-agent-memory, neo-agent-sessions). This Phase adds tenant-scoping metadata + filter logic to the knowledge-base collection ONLY — no topology mutation, no storage relocation, no collection sharing.
Standalone win: enables same-server custom workspaces (useDefaultSources / useDefaultParsers configurability + custom source/parser registration) without network substrate. Sets the stable contract floor that Phase 2 facades can build on.
The Problem
Current KB substrate hardcodes single-repo assumption at multiple layers (Epic #11624 "Architectural Reality" section). This phase removes the structural blockers in the substrate FIRST, defining stable contracts before any new transport endpoint exists:
- Hardcoded source-class array at
DatabaseService.mjs:460-471
- Per-Source hardcoded paths (e.g.,
ApiSource.sourceMap maps Neo-specific paths)
memorySharing enum is Memory-Core-only today (0 KB references; verified via grep). Pattern reused, infrastructure new.
- Path-determinism baked into chunk emit + hydration (
ApiSource.mjs:101-105 + SearchService.mjs:118-120)
- Content-hash delta only deletes under full-corpus sync (
VectorService.mjs:198-207); incremental push needs explicit deletion-signaling
importDatabase conflated with ingest — actually RESTORE-only (skips re-embedding); distinct contract from ingest
The Architectural Reality
This phase touches:
| File |
Change |
DatabaseService.mjs |
Replace hardcoded source array with data-driven registry; thread useDefaultSources / useDefaultParsers config |
source/Base.mjs |
Abstract extract(writeStream, createHashFn) contract PRESERVED — already clean |
source/ApiSource.mjs + 9 sibling sources |
Externalize hardcoded paths to config; data-driven registration |
VectorService.mjs |
Write-side: inject server-derived {tenantId, visibility, originAgentIdentity?} at embed upsert; tenant-aware chunkId hash derivation; reject/overwrite client-supplied tenant fields |
QueryService.mjs |
Read-side: inject where: {tenantId: {$in: [<requester>, '<team-namespace>']}} from authenticated AgentIdentity into every collection.query() call |
SearchService.mjs |
Tenant-aware hydration per Q12 (chunk-metadata-embedded vs server-mirror — choice deferred to Phase 2; Phase 0/1 only marks the boundary) |
NEW ai/services/knowledge-base/parser/parsed-chunk-v1.schema.json |
JSON Schema for client-side parser output |
NEW ai/services/knowledge-base/parser/backup-record-v1.schema.json |
JSON Schema formalizing existing importDatabase {id, embedding, metadata, document} shape |
The Fix
1. Source/Parser Registry Extraction
Replace hardcoded sources array in DatabaseService.createKnowledgeBase() with data-driven registry consumed by aiConfig.knowledgeBase.sources (or similar). Default value preserves current 10-source set when useDefaultSources: true. useDefaultParsers: true similarly preserves current parser binding. Custom sources/parsers register via explicit API.
2. parsed-chunk-v1 JSON Schema
Define at ai/services/knowledge-base/parser/parsed-chunk-v1.schema.json:
{
"$id": "neo:parsed-chunk-v1",
"schemaVersion": "1.0.0",
"tenantId": "string",
"repoSlug": "string",
"rootKind": "neo-workspace | bare-repo | ...",
"sourcePath": "string",
"content": "string",
"hashInputs": ["type","name","content","..."],
"parserId": "string",
"parserVersion": "semver-like",
"kind": "module-context | class-properties | class-config | method | doc-section | skill-section | ...",
"name": "string",
"line_start": "integer?",
"line_end": "integer?",
"className": "string?",
"extends": "string?",
"customMeta": "object?"
}Server-side validator rejects records missing required identity fields OR carrying an embedding field (routed to restore-only path).
3. backup-record-v1 JSON Schema
Define at ai/services/knowledge-base/parser/backup-record-v1.schema.json formalizing existing {id, embedding, metadata, document} shape. Distinct contract from ingest. Used only by manageDatabaseBackup({action: 'import'}) (restore-only path).
4. Path-Identity Tuple
Document {tenantId, repoSlug, rootKind, sourcePath} semantics in learn/agentos/cloud-deployment/ (placeholder doc landed Phase 3; Phase 0/1 lands inline JSDoc references). chunk.metadata.source becomes the structured tuple, NOT the bare neoRootDir-relative string.
5. Tombstone / Manifest / Revision-Boundary Contract
Spec'd in same schema directory: three mutually-supporting deletion-signaling mechanisms:
- Explicit tombstones (
{deleted: [paths]})
- Manifest snapshot (
{manifestSnapshot: {pathsAfterPush}})
- Revision boundary (
{baseRevision, headRevision})
Phase 0/1 spec only; Phase 2 wires into endpoint.
6. memorySharing KB Port — Two Halves
Write-side (server-stamping invariant):
VectorService.embed upsert injects {tenantId, visibility, originAgentIdentity?} from server-authenticated AgentIdentity context
- Ingestion path REJECTS or server-OVERWRITES client-supplied tenant/visibility fields (configurable: overwrite + warning log; REJECT escalation on spoof-rate telemetry threshold)
- Tenant-aware
chunkId hash derivation: hash includes tenantId + repoSlug so same source content under two tenants yields distinct ids
- Neo's curated content tagged with shared
tenantId constant (e.g., 'neo-shared'); per-tenant content tagged with <tenantId>
Read-side (retrieval filter):
QueryService.queryDocuments + SearchService inject where: {tenantId: {$in: [<requester>, '<team-namespace>']}} into every Chroma collection.query call
- Filter context derived from server-side authenticated AgentIdentity, NOT untrusted client payload
- Mirrors
MemoryService.mjs:391-410 query-time policy filter pattern
7. Byte-Equivalence Fixture
Test fixture: current Neo source output (10 sources × current parsers × current paths) BEFORE registry extraction === output AFTER. Verifies migration doesn't regress retrieval quality + adding tenantId field doesn't perturb chunk-hash semantics for existing content (hash-derivation function backwards-compatible for Neo's tenantId).
8. Fail-Closed Test Suite
Required tenant-isolation tests:
- Forged client
tenantId rejected/overwritten through every public KB query facade
- Forged client
visibility rejected/overwritten
- Tenant A cannot retrieve tenant B
private chunks
- Neo
team (shared) chunks visible across tenants
- Same
sourcePath under two tenants → distinct chunk ids (no shadow attack)
- Chunk-schema-v1 validation: rejects records carrying
embedding field outside restore mode (forces routing through manageDatabaseBackup)
Acceptance Criteria
Out of Scope
KnowledgeBaseIngestionService singleton implementation → Phase 2
- MCP tool
ingestSourceFiles → Phase 2
- Bulk facade (CLI/HTTP/streaming) → Phase 2
- Q12 hydration mode choice (chunk-metadata-embedded vs server-mirror vs hybrid) → Phase 2 sub-ticket
- ES5 + C++ workspace fixtures requiring client-side parser-runner → Phase 2 (depends on push pipeline)
- Cloud deployment guide → Phase 3
- Runtime tenant-registered server-side parser code (operator-installed/Neo-shipped/signed-package only; future WASM/tree-sitter sandboxing is a separate Discussion)
Avoided Traps
| Trap |
Why rejected |
| Implementing Phase 0/1 + Phase 2 in same PR |
Contract-risk: endpoint-shape can drift before contracts stabilize. Substrate-correct ordering: contracts first. |
| Skipping byte-equivalence fixture |
Adding tenantId field to chunk-hash inputs without verification = silent retrieval-quality regression for existing content |
| Application-layer retrieval filter as final shape |
Vulnerable to bug-bypass (forgotten filter call = data leak). V1 application-layer is fine; Phase 2 reconsiders Chroma-layer hardening per Q13b lean Option C (hybrid). |
| Skipping fail-closed test suite |
Server-stamping + spoof-rejection is a load-bearing SECURITY invariant; untested = unverified |
Related
Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838
Handoff Retrieval Hints
query_raw_memories({query: 'parsed-chunk-v1 schema KB ingestion contract'})
query_raw_memories({query: 'memorySharing KB port write-side stamping read-side filter'})
ask_knowledge_base({query: 'KB source parser registry default inheritance', type: 'src'})
- Discussion #11623 §7 Phase 0/1 + §8 test substrate + §10 Graduation Criteria #12+#13 are the architectural source-of-authority
- Begin with byte-equivalence fixture authoring +
parsed-chunk-v1.schema.json draft — these have lowest-implementation-risk + serve as the substrate floor everything else builds on
Context
Phase 0/1 Epic (parent of meta-Epic #11624 — Cloud-Native KB Ingestion for External Workspaces). Graduated from Discussion #11623. This phase ships the contracts (parsed-chunk-v1, backup-record-v1, path-identity tuple, tombstone/manifest/revision-boundary, registry, KB tenant isolation =
memorySharingpattern applied to theknowledge-baseChroma collection) BEFORE the Phase 2 ingestion service implementation — substrate-correct shape per cross-family peer convergence.Topology anchor: Per ADR 0003 — Chroma Topology Unified Only, KB + MC are SEPARATE MCP servers sharing ONE Chroma daemon but maintaining SEPARATE collections (
knowledge-base,neo-agent-memory,neo-agent-sessions). This Phase adds tenant-scoping metadata + filter logic to theknowledge-basecollection ONLY — no topology mutation, no storage relocation, no collection sharing.Standalone win: enables same-server custom workspaces (
useDefaultSources/useDefaultParsersconfigurability + custom source/parser registration) without network substrate. Sets the stable contract floor that Phase 2 facades can build on.The Problem
Current KB substrate hardcodes single-repo assumption at multiple layers (Epic #11624 "Architectural Reality" section). This phase removes the structural blockers in the substrate FIRST, defining stable contracts before any new transport endpoint exists:
DatabaseService.mjs:460-471ApiSource.sourceMapmaps Neo-specific paths)memorySharingenum is Memory-Core-only today (0 KB references; verified via grep). Pattern reused, infrastructure new.ApiSource.mjs:101-105+SearchService.mjs:118-120)VectorService.mjs:198-207); incremental push needs explicit deletion-signalingimportDatabaseconflated with ingest — actually RESTORE-only (skips re-embedding); distinct contract from ingestThe Architectural Reality
This phase touches:
DatabaseService.mjsuseDefaultSources/useDefaultParsersconfigsource/Base.mjsextract(writeStream, createHashFn)contract PRESERVED — already cleansource/ApiSource.mjs+ 9 sibling sourcesVectorService.mjs{tenantId, visibility, originAgentIdentity?}atembedupsert; tenant-awarechunkIdhash derivation; reject/overwrite client-supplied tenant fieldsQueryService.mjswhere: {tenantId: {$in: [<requester>, '<team-namespace>']}}from authenticated AgentIdentity into everycollection.query()callSearchService.mjsai/services/knowledge-base/parser/parsed-chunk-v1.schema.jsonai/services/knowledge-base/parser/backup-record-v1.schema.jsonimportDatabase{id, embedding, metadata, document}shapeThe Fix
1. Source/Parser Registry Extraction
Replace hardcoded
sourcesarray inDatabaseService.createKnowledgeBase()with data-driven registry consumed byaiConfig.knowledgeBase.sources(or similar). Default value preserves current 10-source set whenuseDefaultSources: true.useDefaultParsers: truesimilarly preserves current parser binding. Custom sources/parsers register via explicit API.2.
parsed-chunk-v1JSON SchemaDefine at
ai/services/knowledge-base/parser/parsed-chunk-v1.schema.json:{ "$id": "neo:parsed-chunk-v1", "schemaVersion": "1.0.0", "tenantId": "string", "repoSlug": "string", "rootKind": "neo-workspace | bare-repo | ...", "sourcePath": "string", "content": "string", "hashInputs": ["type","name","content","..."], "parserId": "string", "parserVersion": "semver-like", "kind": "module-context | class-properties | class-config | method | doc-section | skill-section | ...", "name": "string", "line_start": "integer?", "line_end": "integer?", "className": "string?", "extends": "string?", "customMeta": "object?" }Server-side validator rejects records missing required identity fields OR carrying an
embeddingfield (routed to restore-only path).3.
backup-record-v1JSON SchemaDefine at
ai/services/knowledge-base/parser/backup-record-v1.schema.jsonformalizing existing{id, embedding, metadata, document}shape. Distinct contract from ingest. Used only bymanageDatabaseBackup({action: 'import'})(restore-only path).4. Path-Identity Tuple
Document
{tenantId, repoSlug, rootKind, sourcePath}semantics inlearn/agentos/cloud-deployment/(placeholder doc landed Phase 3; Phase 0/1 lands inline JSDoc references).chunk.metadata.sourcebecomes the structured tuple, NOT the bareneoRootDir-relative string.5. Tombstone / Manifest / Revision-Boundary Contract
Spec'd in same schema directory: three mutually-supporting deletion-signaling mechanisms:
{deleted: [paths]}){manifestSnapshot: {pathsAfterPush}}){baseRevision, headRevision})Phase 0/1 spec only; Phase 2 wires into endpoint.
6. memorySharing KB Port — Two Halves
Write-side (server-stamping invariant):
VectorService.embedupsert injects{tenantId, visibility, originAgentIdentity?}from server-authenticated AgentIdentity contextchunkIdhash derivation: hash includestenantId+repoSlugso same source content under two tenants yields distinct idstenantIdconstant (e.g.,'neo-shared'); per-tenant content tagged with<tenantId>Read-side (retrieval filter):
QueryService.queryDocuments+SearchServiceinjectwhere: {tenantId: {$in: [<requester>, '<team-namespace>']}}into every Chromacollection.querycallMemoryService.mjs:391-410query-time policy filter pattern7. Byte-Equivalence Fixture
Test fixture: current Neo source output (10 sources × current parsers × current paths) BEFORE registry extraction === output AFTER. Verifies migration doesn't regress retrieval quality + adding
tenantIdfield doesn't perturb chunk-hash semantics for existing content (hash-derivation function backwards-compatible for Neo's tenantId).8. Fail-Closed Test Suite
Required tenant-isolation tests:
tenantIdrejected/overwritten through every public KB query facadevisibilityrejected/overwrittenprivatechunksteam(shared) chunks visible across tenantssourcePathunder two tenants → distinct chunk ids (no shadow attack)embeddingfield outside restore mode (forces routing throughmanageDatabaseBackup)Acceptance Criteria
useDefaultSources/useDefaultParsersboolean configs inaiConfig(defaulttruefor zero-config Neo deployments)DatabaseService.mjs:460-471replaced with data-driven registryApiSource.sourceMapetc.) externalized to configai/services/knowledge-base/parser/parsed-chunk-v1.schema.jsoncreated with validator at service boundaryai/services/knowledge-base/parser/backup-record-v1.schema.jsoncreated formalizing existing shape{tenantId, repoSlug, rootKind, sourcePath}in chunk metadataparsed-chunk-v1companion schemaVectorService.embedwrite-side server-stamping ({tenantId, visibility, originAgentIdentity?}) from authenticated AgentIdentityVectorService.embedrejects or server-OVERWRITES client-supplied tenant fields (mode configurable; default: overwrite + warning log)chunkIdhash derivation (hash includestenantId+repoSlug)QueryService.queryDocumentsinjects tenant/visibilitywhereclause from authenticated AgentIdentitySearchServicehydration tenant-aware (Phase 2 may not implement retrieval flow before choosing Q12 hydration mode)test/playwright/unit/ai/knowledge-base/mini-neo-workspace/,mini-custom-source/); ES5 + C++ fixtures can defer to Phase 2 if they require client-side parser-runner infrastructureOut of Scope
KnowledgeBaseIngestionServicesingleton implementation → Phase 2ingestSourceFiles→ Phase 2Avoided Traps
tenantIdfield to chunk-hash inputs without verification = silent retrieval-quality regression for existing contentRelated
Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838Handoff Retrieval Hints
query_raw_memories({query: 'parsed-chunk-v1 schema KB ingestion contract'})query_raw_memories({query: 'memorySharing KB port write-side stamping read-side filter'})ask_knowledge_base({query: 'KB source parser registry default inheritance', type: 'src'})parsed-chunk-v1.schema.jsondraft — these have lowest-implementation-risk + serve as the substrate floor everything else builds on