Context
Graduated from Discussion #11623 on 2026-05-19 after 4-cycle cross-family convergence (3/3 explicit [GRADUATION_APPROVED] signals — see Signal Ledger). For v13 cloud deployments of Agent OS, external client workspaces must be able to ingest their own code into the Knowledge Base while continuing to leverage Neo's curated content (guides, ADRs, skills, tickets, demo apps). The current KB substrate hardcodes the assumption "source repo = host repo of the KB."
Sibling of Epic #9999, not sub. #9999 owns READ-side multi-tenancy (memorySharing enum, AgentIdentity provenance, RLS) + identity-on-write. This Epic owns the WRITE-side cross-repo INGESTION substrate — the missing pillar for v13 cloud deployments.
Topology Anchor — Unified Chroma (ADR 0003)
Per ADR 0003 — Chroma Topology Unified Only (Accepted 2026-05-09, Resolves #11011, Unblocks v13 Substrate Migration):
- One ChromaDB daemon (KB substrate manages lifecycle; MC defers as downstream client via
engines.chroma coordinates)
- Two MCP servers — KB MCP server + Memory Core MCP server — connect to the same Chroma instance
- Three collections within the same Chroma instance:
knowledge-base — owned by KB MCP server
neo-agent-memory — owned by MC MCP server
neo-agent-sessions — owned by MC MCP server
chromaUnified flag is permanently REMOVED
ChromaLifecycleService is observability-passthrough only (no daemon spawning)
HealthService statically reports 'unified'
This Epic's tenant-isolation work (Phase 0/1C/D) adds tenant-scoping metadata fields + filter logic to the knowledge-base collection ONLY. It does NOT:
- introduce a new Chroma instance
- alter the unified topology
- move Memory Core storage into KB or vice-versa
- share collections cross-server (
knowledge-base stays KB-exclusive; neo-agent-memory/neo-agent-sessions stay MC-exclusive)
The memorySharing enum (legacy | private | team) from #10010 is reused as a POLICY pattern; KB's INFRASTRUCTURE (metadata fields + where-clause filter on the knowledge-base collection) is new work.
The Problem
The configurability gap is mechanically located. Substrate audit at dev=6f513ac24 confirms:
External workspaces in this epic are Neo workspaces created via npx neo app (where neo is a node_module dependency rather than the repo) AND non-Neo repositories with ES5 / C++ / other-language code that needs to be discoverable by client agents through the same MCP server.
Substrate-grounded blockers caught during Discussion (cross-family peer review compounds empirical anchor):
- GPT Cycle 1:
importDatabase ≠ ingest transport (RESTORE-only, skips re-embedding); single-root path determinism breaks cross-tenant; content-hash delta only deletes under full-corpus sync; MCP work-volume gate refuses bulk; Phase 1 carried implicit-contract-risk
- Gemini Cycle 1:
memorySharing enum is Memory-Core-only today (0 KB hits via grep); cross-substrate pattern reuse ≠ infrastructure reuse
- GPT Cycle 2: write-side server-stamping invariant + spoof-rejection security boundary;
QueryService.mjs:116-128 builds Chroma where only from type, no tenant predicate
The Architectural Reality
Existing substrate touched (consumer sweep from Discussion §6, expanded per peer cycles):
| Substrate |
Role |
Epic-side change |
DatabaseService.createKnowledgeBase() |
Hardcoded source-class array orchestration |
Replace with data-driven source registry |
source/Base.mjs |
Abstract extract(writeStream, createHashFn) contract |
Preserved; substrate-correct extension point already clean |
source/*.mjs (10 concrete sources) |
Per-source paths + parser binding hardcoded |
Externalize paths to config; data-driven registration |
parser/SourceParser.mjs, DocumentationParser.mjs, TestParser.mjs |
Acorn / markdown / test parsers |
Default parsers shipped; client-side parsers emit parsed-chunk-v1 |
DatabaseService.manageDatabaseBackup({action: 'import'}) |
RESTORE-only JSONL {id, embedding, metadata, document} |
Reframed as backup-record-v1 precedent; distinct from ingest contract |
VectorService.mjs:188-274 |
Content-hash delta + embedding + #10572 MCP volume gate |
New write-side tenantId stamping; existing delete-logic preserved for full-corpus path |
QueryService.mjs:116-128 |
Builds Chroma where from type only |
Inject tenant/visibility where from authenticated AgentIdentity context |
SearchService.mjs:118-120 |
Single-neoRootDir path resolution for file hydration |
Tenant-aware hydration via Q12 options (chunk-metadata-embedded vs server-mirror vs hybrid) |
memorySharing enum (#10010) |
Memory-Core-only today; KB has 0 references |
PATTERN REUSED, INFRASTRUCTURE NEW — chunk metadata tenantId injection + retrieval-time filter |
| AgentIdentity (#9999, #10011) |
Memory Core graph + RLS |
PATTERN REUSED, INFRASTRUCTURE NEW — server-stamped chunk provenance |
| #10572 MCP work-volume gate |
Refuses viaMcp syncs > mcpSyncMaxChunks |
LOAD-BEARING — forces bulk facade structurally |
The Fix — Five-Phase Decomposition (updated 2026-05-19 per operator-directed sub-decomposition; Phase 5 added when integration/unit test parity gap surfaced)
Phase 0/1 Epic #11625 — Ingestion Contract + Registry Extraction + KB Tenant Isolation (contracts before implementation; memorySharing pattern applied to the knowledge-base Chroma collection per ADR 0003 unified topology)
- #11629 Phase 0/1A — Ingestion contracts (parsed-chunk-v1 + backup-record-v1 schemas + path-identity tuple + tombstone spec)
- #11630 Phase 0/1B — Source/Parser registry extraction + byte-equivalence fixture
- #11631 Phase 0/1C — KB Tenant Isolation write-side (VectorService server-stamping + tenant-aware chunkId + spoof-rejection on the
knowledge-base collection)
- #11632 Phase 0/1D — KB Tenant Isolation read-side (QueryService/SearchService where-filter on
knowledge-base collection) + fail-closed test suite
Standalone win: enables same-server custom workspaces without network substrate.
Phase 2 Epic #11626 — Ingestion Service + MCP Small-Batch Facade + Bulk Facade (blocked-by Phase 0/1)
- #11633 Phase 2A — KnowledgeBaseIngestionService core (orchestrator + validation)
- #11634 Phase 2B — MCP facade ingestSourceFiles + #10572 gate threading
- #11635 Phase 2C — Bulk facade (CLI ai:ingest-tenant + streaming)
- #11636 Phase 2D — Q12 search-hydration mode resolution + SearchService implementation
- #11637 Phase 2E — Q5 tenant config storage (Native Edge Graph extension)
- #11638 Phase 2F — Test fixture infrastructure (synthetic external workspaces × 4) + E2E multi-tenant suite
Phase 3 sub-ticket #11627 — Cloud Deployment Guide + Worked Examples (blocked-by Phase 2)
Single sub-ticket (same-layer doc work; PR-level decomposition sufficient). New guide tree learn/agentos/cloud-deployment/ (Overview, Configuration, CustomSources, CustomParsers, HookWiring, Security, MigrationPath) + minimal worked example external workspace.
Phase 4 Epic #11628 — Operations + Observability for Cloud-Native Deployments (NEW per operator post-graduation review; blocked-by Phase 2)
- #11639 Phase 4A — Per-tenant ingestion observability daemon (extends KBRecorderService)
- #11640 Phase 4B — Manifest reconciliation daemon
- #11641 Phase 4C — Stale-chunk garbage collection daemon
- #11642 Phase 4D — Operator alerting surface (telemetry thresholds → A2A + external notification)
Phase 4 was not in Discussion #11623 §7 original decomposition; surfaced 2026-05-19 during operator-directed post-graduation review (rationale: cloud deployments need operability substrate; KBRecorderService is daemon-adjacent and extensible).
Phase 5 Epic #11643 — Integration + Unit Test Parity with Memory Core (NEW per operator post-graduation review; Phase 5A independent / 5B blocked-by Phase 0/1D / 5C cross-cutting)
- #11644 Phase 5A — KB MCP Server Integration Test Parity (pre-Phase-0/1 shippable; parallels MC's RemoteMcpTransport / AuthRejection / OidcAuth / BackupRestoreWipe / HeartbeatPropagation specs)
- #11645 Phase 5B — KB Tenant Isolation Integration Tests (blocked-by Phase 0/1D #11632; parallels MC's TeamPrivateRetrieval / CrossTenantIsolation specs)
- #11646 Phase 5C — KB Unit Test Coverage Expansion (cross-cutting; close 32-vs-12 parity gap; target ≥25 specs ≈ 80% of MC's 32)
Phase 5 was not in Discussion #11623 §7 original; surfaced 2026-05-19 during operator's test-substrate review. Empirical parity gap V-B-A'd: MC has 32 unit specs + 8 integration specs; KB has 12 unit specs + 1 integration spec (healthcheck.spec.mjs per #10805 Lane A — dual-server coverage). Operator framing: "this epic needs new tests here. and /unit-tests too."
Acceptance Criteria
Cross-phase ACs:
Phase 0/1 ACs (filed as sub-ticket):
Phase 2 ACs (deferred, filed as sub-ticket):
Phase 3 ACs (deferred, filed as sub-ticket):
Out of Scope
- Client-side embeddings (operator invariant; KB still owns embeddings)
- Pull-model (KB clones tenant repos) — inverts #9999 push-substrate identity direction
- Per-tenant Chroma storage split (rejected by #9999 Avoided Traps; re-audit only if cross-repo INGEST load empirically forces)
- Macro DB / full-file vectors (rejected by #10030)
- Server-side runtime parser execution from tenant-supplied code (untrusted-code risk; eliminated by client-side fallback for non-default parsers)
- Future WASM/tree-sitter sandboxing lane for server-side custom parsers — separate Discussion when needed
Avoided Traps / Gold Standards Rejected
| Trap |
Why rejected |
Conflating backup-record-v1 (restore) with ingest contract |
importDatabase preserves embeddings, skips TextEmbeddingService.embedTexts() — distinct from server-embeds ingest path. Two distinct contracts. |
| Content-hash delta as sole deletion mechanism |
Only works under full-corpus sync (VectorService.mjs:198-207); incremental push MUST use tombstone/manifest/revision-boundary |
| Pure-MCP transport for bulk imports |
#10572 work-volume gate refuses syncs > mcpSyncMaxChunks via MCP; bulk facade is structurally necessary |
Implicit single-neoRootDir source-path semantics |
SearchService.mjs:118-120 resolves against single root; cross-tenant needs explicit identity tuple |
Assumed cross-substrate reuse of memorySharing enum |
Memory-Core-only today (0 KB references); pattern reuse ≠ infrastructure reuse. Validate cross-substrate assumptions via grep, not memory of design-intent. |
| Client-supplied tenant/visibility trust |
Server stamps from authenticated AgentIdentity; clients may not spoof. Load-bearing security invariant. |
Conflating memorySharing enum (Memory Core policy pattern) with shared-DB topology (already unified per ADR 0003 since 2026-05-09) |
KB + MC are SEPARATE MCP servers sharing ONE Chroma DB but maintaining SEPARATE collections. Phase 0/1 adds tenant-scoping metadata to KB's knowledge-base collection only; it does NOT alter unified topology or move storage between servers. Terminology: "KB Tenant Isolation (memorySharing pattern reuse)" — NOT "memorySharing KB Port" which suggests storage relocation. |
Discussion Criteria Mapping
Per ideation-sandbox-workflow.md §6.6, mapping Discussion #11623 §10 Graduation Criteria → Epic ACs:
| Discussion #11623 §10 criterion |
Epic AC |
| 1. §5.1 Double Diamond Matrix in body |
Discussion body archaeological source; not Epic AC |
| 2. §5.2 Step 2.5 cross-substrate sweep |
Discussion #11623 GPT/Gemini STEP_BACK comments; Epic ACs reflect sweep findings |
| 3. §6 Consensus Mandate 3× APPROVED |
Achieved (see Signal Ledger below) |
| 4. Q1 Parser-Locality (2-axis) |
Phase 0/1 ACs: parser-locality dispatch via tenant config + parser-protocol contract |
| 5. Q3 Push-endpoint protocol |
Phase 2 ACs: MCP small-batch + bulk facade |
| 6. Q4 Parser-protocol contract |
Phase 0/1 AC: parsed-chunk-v1.schema.json |
| 7. Q11 Tombstone/manifest semantics |
Phase 0/1 AC: deletion-signaling contract |
| 8. Q12 Search hydration |
Phase 2 AC: hydration mode chosen before retrieval flow |
| 9. Test substrate scope |
Cross-phase + Phase 0/1 fail-closed test suite ACs |
| 10. Guide deliverable |
Phase 3 ACs |
| 11. Sub-ticket boundaries |
This Epic + 3 phase sub-tickets |
| 12. Q13a Write-side stamping invariant |
Phase 0/1 AC: server-stamped {tenantId, visibility, originAgentIdentity?} + spoof-rejection |
| 13. Q13b Read-side enforcement layer |
Phase 0/1 AC: QueryService + SearchService where filter; enforcement-layer hybrid lean (V1 application-layer → V2 Chroma-layer if leak class manifests) |
Signal Ledger
Unresolved Dissent
(empty — positive signal; 3/3 APPROVED, no DEFERRED at convergence)
Unresolved Liveness
(empty — positive signal; all 3 cross-family signals explicit)
Related
- Discussion #11623 — origin Discussion (archaeological source post-graduation)
- #9999 — Cloud-Native Knowledge & Multi-Tenant Memory Core Epic (sibling; this Epic = write-side complement)
- #10010 — Memory Core team/private retrieval policy (
memorySharing enum source; pattern ported to KB)
- #10011 — Native Edge Graph RLS (read-side enforcement; KB equivalent in this Epic)
- #10016 — Multi-Tenant Identity & Data Privacy parent (upstream identity substrate)
- #10030 — Concept Ontology (potential downstream; cross-repo ingestion may affect concept extraction)
- #10097 — distributed Chroma zips (related; portability precedent)
- #10572 — MCP work-volume gate (LOAD-BEARING; forces bulk facade)
- #10129 — atomic backup bundle (
backup-record-v1 precedent)
Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838
Handoff Retrieval Hints
query_raw_memories({query: 'cloud-native KB ingestion external workspace parser locality'})
query_raw_memories({query: 'KB tenant isolation memorySharing pattern unified Chroma'})
query_summaries({query: '#11623 cross-family graduation'})
ask_knowledge_base({query: 'KB ingestion source parser registry tenant', type: 'ticket'})
- Discussion #11623 + Phase 0/1 sub-ticket are the entry points for resuming this workstream
- Cross-family peer-cycle empirical anchors: GPT Cycle 1 DC_kwDODSospM4BAwRM + Gemini Cycle 1 DC_kwDODSospM4BAwRW + GPT Cycle 2 DC_kwDODSospM4BAwS0
Context
Graduated from Discussion #11623 on 2026-05-19 after 4-cycle cross-family convergence (3/3 explicit
[GRADUATION_APPROVED]signals — see Signal Ledger). For v13 cloud deployments of Agent OS, external client workspaces must be able to ingest their own code into the Knowledge Base while continuing to leverage Neo's curated content (guides, ADRs, skills, tickets, demo apps). The current KB substrate hardcodes the assumption "source repo = host repo of the KB."Sibling of Epic #9999, not sub. #9999 owns READ-side multi-tenancy (
memorySharingenum, AgentIdentity provenance, RLS) + identity-on-write. This Epic owns the WRITE-side cross-repo INGESTION substrate — the missing pillar for v13 cloud deployments.Topology Anchor — Unified Chroma (ADR 0003)
Per ADR 0003 — Chroma Topology Unified Only (Accepted 2026-05-09, Resolves #11011, Unblocks v13 Substrate Migration):
engines.chromacoordinates)knowledge-base— owned by KB MCP serverneo-agent-memory— owned by MC MCP serverneo-agent-sessions— owned by MC MCP serverchromaUnifiedflag is permanently REMOVEDChromaLifecycleServiceis observability-passthrough only (no daemon spawning)HealthServicestatically reports'unified'This Epic's tenant-isolation work (Phase 0/1C/D) adds tenant-scoping metadata fields + filter logic to the
knowledge-basecollection ONLY. It does NOT:knowledge-basestays KB-exclusive;neo-agent-memory/neo-agent-sessionsstay MC-exclusive)The
memorySharingenum (legacy | private | team) from #10010 is reused as a POLICY pattern; KB's INFRASTRUCTURE (metadata fields +where-clause filter on theknowledge-basecollection) is new work.The Problem
The configurability gap is mechanically located. Substrate audit at dev=
6f513ac24confirms:ai/services/knowledge-base/DatabaseService.mjs:460-471ApiSource.sourceMapmapssrc/apps/examples/docs/app/ai— Neo-specific)aiConfig.neoRootDir(one source repo per KB instance) propagated throughApiSource.mjs:101-105(neoRootDir-relative chunk metadata) andSearchService.mjs:118-120(single-root path resolution)External workspaces in this epic are Neo workspaces created via
npx neo app(whereneois anode_moduledependency rather than the repo) AND non-Neo repositories with ES5 / C++ / other-language code that needs to be discoverable by client agents through the same MCP server.Substrate-grounded blockers caught during Discussion (cross-family peer review compounds empirical anchor):
importDatabase≠ ingest transport (RESTORE-only, skips re-embedding); single-root path determinism breaks cross-tenant; content-hash delta only deletes under full-corpus sync; MCP work-volume gate refuses bulk; Phase 1 carried implicit-contract-riskmemorySharingenum is Memory-Core-only today (0 KB hits via grep); cross-substrate pattern reuse ≠ infrastructure reuseQueryService.mjs:116-128builds Chromawhereonly fromtype, no tenant predicateThe Architectural Reality
Existing substrate touched (consumer sweep from Discussion §6, expanded per peer cycles):
DatabaseService.createKnowledgeBase()source/Base.mjsextract(writeStream, createHashFn)contractsource/*.mjs(10 concrete sources)parser/SourceParser.mjs,DocumentationParser.mjs,TestParser.mjsparsed-chunk-v1DatabaseService.manageDatabaseBackup({action: 'import'}){id, embedding, metadata, document}backup-record-v1precedent; distinct from ingest contractVectorService.mjs:188-274tenantIdstamping; existing delete-logic preserved for full-corpus pathQueryService.mjs:116-128wherefromtypeonlywherefrom authenticated AgentIdentity contextSearchService.mjs:118-120neoRootDirpath resolution for file hydrationmemorySharingenum (#10010)tenantIdinjection + retrieval-time filterviaMcpsyncs >mcpSyncMaxChunksThe Fix — Five-Phase Decomposition (updated 2026-05-19 per operator-directed sub-decomposition; Phase 5 added when integration/unit test parity gap surfaced)
Phase 0/1 Epic #11625 — Ingestion Contract + Registry Extraction + KB Tenant Isolation (contracts before implementation;
memorySharingpattern applied to theknowledge-baseChroma collection per ADR 0003 unified topology)knowledge-basecollection)knowledge-basecollection) + fail-closed test suiteStandalone win: enables same-server custom workspaces without network substrate.
Phase 2 Epic #11626 — Ingestion Service + MCP Small-Batch Facade + Bulk Facade (blocked-by Phase 0/1)
Phase 3 sub-ticket #11627 — Cloud Deployment Guide + Worked Examples (blocked-by Phase 2) Single sub-ticket (same-layer doc work; PR-level decomposition sufficient). New guide tree
learn/agentos/cloud-deployment/(Overview, Configuration, CustomSources, CustomParsers, HookWiring, Security, MigrationPath) + minimal worked example external workspace.Phase 4 Epic #11628 — Operations + Observability for Cloud-Native Deployments (NEW per operator post-graduation review; blocked-by Phase 2)
Phase 4 was not in Discussion #11623 §7 original decomposition; surfaced 2026-05-19 during operator-directed post-graduation review (rationale: cloud deployments need operability substrate; KBRecorderService is daemon-adjacent and extensible).
Phase 5 Epic #11643 — Integration + Unit Test Parity with Memory Core (NEW per operator post-graduation review; Phase 5A independent / 5B blocked-by Phase 0/1D / 5C cross-cutting)
Phase 5 was not in Discussion #11623 §7 original; surfaced 2026-05-19 during operator's test-substrate review. Empirical parity gap V-B-A'd: MC has 32 unit specs + 8 integration specs; KB has 12 unit specs + 1 integration spec (
healthcheck.spec.mjsper #10805 Lane A — dual-server coverage). Operator framing: "this epic needs new tests here. and /unit-tests too."Acceptance Criteria
Cross-phase ACs:
learn/agentos/cloud-deployment/guide tree exists at Phase 3 closeout; cross-links to #9999, #10010, #10011, #10030Phase 0/1 ACs (filed as sub-ticket):
parsed-chunk-v1JSON Schema atai/services/knowledge-base/parser/parsed-chunk-v1.schema.jsonbackup-record-v1JSON Schema atai/services/knowledge-base/parser/backup-record-v1.schema.json(formalizing existingimportDatabaseshape){tenantId, repoSlug, rootKind, sourcePath}specuseDefaultSources/useDefaultParsersboolean configs inaiConfigApiSource.sourceMapetc. → config)VectorService.embedupsert path injects server-derived{tenantId, visibility, originAgentIdentity?}from authenticated AgentIdentity context; rejects or server-overwrites client-supplied tenant fields; tenant-awarechunkIdhash derivationQueryService.queryDocuments+SearchServiceinjectwhere: {tenantId: {$in: [<requester>, '<team-namespace>']}}from server-side authenticated AgentIdentity, NOT client payloadtenantIdfield doesn't perturb chunk-hash semantics for existing content)tenantId/visibilityrejected/overwritten; tenant A cannot retrieve tenant Bprivatechunks; Neoteamchunks cross-tenant-visible; samesourcePathunder two tenants isolated; spoof-rejection tested through every public KB query facadePhase 2 ACs (deferred, filed as sub-ticket):
KnowledgeBaseIngestionServicesingleton behind shared service layeringestSourceFilesaccepts batches, gated byaiConfig.mcpSyncMaxChunks(#10572)npm run ai:ingest-tenant <tenantId>+ HTTP/streaming)parsed-chunk-v1validation (rejects records withembeddingfield outside restore mode)Phase 3 ACs (deferred, filed as sub-ticket):
learn/agentos/cloud-deployment/guide tree (Overview, Configuration, CustomSources, CustomParsers, HookWiring, Security, MigrationPath)pre-pushhook in shell demonstratingingestSourceFilescontractOut of Scope
Avoided Traps / Gold Standards Rejected
backup-record-v1(restore) with ingest contractimportDatabasepreserves embeddings, skipsTextEmbeddingService.embedTexts()— distinct from server-embeds ingest path. Two distinct contracts.VectorService.mjs:198-207); incremental push MUST use tombstone/manifest/revision-boundarymcpSyncMaxChunksvia MCP; bulk facade is structurally necessaryneoRootDirsource-path semanticsSearchService.mjs:118-120resolves against single root; cross-tenant needs explicit identity tuplememorySharingenummemorySharingenum (Memory Core policy pattern) with shared-DB topology (already unified per ADR 0003 since 2026-05-09)knowledge-basecollection only; it does NOT alter unified topology or move storage between servers. Terminology: "KB Tenant Isolation (memorySharing pattern reuse)" — NOT "memorySharing KB Port" which suggests storage relocation.Discussion Criteria Mapping
Per
ideation-sandbox-workflow.md §6.6, mapping Discussion #11623 §10 Graduation Criteria → Epic ACs:parsed-chunk-v1.schema.json{tenantId, visibility, originAgentIdentity?}+ spoof-rejectionQueryService+SearchServicewherefilter; enforcement-layer hybrid lean (V1 application-layer → V2 Chroma-layer if leak class manifests)Signal Ledger
updatedAt: 2026-05-19T11:25:13Z)Unresolved Dissent
(empty — positive signal; 3/3 APPROVED, no DEFERRED at convergence)
Unresolved Liveness
(empty — positive signal; all 3 cross-family signals explicit)
Related
memorySharingenum source; pattern ported to KB)backup-record-v1precedent)Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838Handoff Retrieval Hints
query_raw_memories({query: 'cloud-native KB ingestion external workspace parser locality'})query_raw_memories({query: 'KB tenant isolation memorySharing pattern unified Chroma'})query_summaries({query: '#11623 cross-family graduation'})ask_knowledge_base({query: 'KB ingestion source parser registry tenant', type: 'ticket'})