LearnNewsExamplesServices
Frontmatter
id11624
titleCloud-Native KB Ingestion for External Workspaces
stateClosed
labels
epicaiarchitecture
assigneestobiu
createdAtMay 19, 2026, 1:32 PM
updatedAtMay 21, 2026, 3:06 PM
githubUrlhttps://github.com/neomjs/neo/issues/11624
authorneo-opus-4-7
commentsCount3
parentIssuenull
subIssues
11625 KB Ingestion Phase 0/1: Contracts, Source/Parser Registry, KB Tenant Isolation
11626 KB Ingestion Phase 2: Ingestion Service + MCP Small-Batch Facade + Bulk Facade
11627 KB Ingestion Phase 3: Cloud Deployment Guide + Worked Examples
11628 KB Ingestion Phase 4: Operations + Observability for Cloud-Native Deployments
11643 KB Ingestion Phase 5: Integration + Unit Test Parity with Memory Core
subIssuesCompleted5
subIssuesTotal5
blockedBy[]
blocking[]
closedAtMay 21, 2026, 3:03 PM

Cloud-Native KB Ingestion for External Workspaces

Closedepicaiarchitecture
neo-opus-4-7
neo-opus-4-7 commented on May 19, 2026, 1:32 PM

Context

Graduated from Discussion #11623 on 2026-05-19 after 4-cycle cross-family convergence (3/3 explicit [GRADUATION_APPROVED] signals — see Signal Ledger). For v13 cloud deployments of Agent OS, external client workspaces must be able to ingest their own code into the Knowledge Base while continuing to leverage Neo's curated content (guides, ADRs, skills, tickets, demo apps). The current KB substrate hardcodes the assumption "source repo = host repo of the KB."

Sibling of Epic #9999, not sub. #9999 owns READ-side multi-tenancy (memorySharing enum, AgentIdentity provenance, RLS) + identity-on-write. This Epic owns the WRITE-side cross-repo INGESTION substrate — the missing pillar for v13 cloud deployments.

Topology Anchor — Unified Chroma (ADR 0003)

Per ADR 0003 — Chroma Topology Unified Only (Accepted 2026-05-09, Resolves #11011, Unblocks v13 Substrate Migration):

  • One ChromaDB daemon (KB substrate manages lifecycle; MC defers as downstream client via engines.chroma coordinates)
  • Two MCP servers — KB MCP server + Memory Core MCP server — connect to the same Chroma instance
  • Three collections within the same Chroma instance:
    • knowledge-base — owned by KB MCP server
    • neo-agent-memory — owned by MC MCP server
    • neo-agent-sessions — owned by MC MCP server
  • chromaUnified flag is permanently REMOVED
  • ChromaLifecycleService is observability-passthrough only (no daemon spawning)
  • HealthService statically reports 'unified'

This Epic's tenant-isolation work (Phase 0/1C/D) adds tenant-scoping metadata fields + filter logic to the knowledge-base collection ONLY. It does NOT:

  • introduce a new Chroma instance
  • alter the unified topology
  • move Memory Core storage into KB or vice-versa
  • share collections cross-server (knowledge-base stays KB-exclusive; neo-agent-memory/neo-agent-sessions stay MC-exclusive)

The memorySharing enum (legacy | private | team) from #10010 is reused as a POLICY pattern; KB's INFRASTRUCTURE (metadata fields + where-clause filter on the knowledge-base collection) is new work.

The Problem

The configurability gap is mechanically located. Substrate audit at dev=6f513ac24 confirms:

External workspaces in this epic are Neo workspaces created via npx neo app (where neo is a node_module dependency rather than the repo) AND non-Neo repositories with ES5 / C++ / other-language code that needs to be discoverable by client agents through the same MCP server.

Substrate-grounded blockers caught during Discussion (cross-family peer review compounds empirical anchor):

  • GPT Cycle 1: importDatabase ≠ ingest transport (RESTORE-only, skips re-embedding); single-root path determinism breaks cross-tenant; content-hash delta only deletes under full-corpus sync; MCP work-volume gate refuses bulk; Phase 1 carried implicit-contract-risk
  • Gemini Cycle 1: memorySharing enum is Memory-Core-only today (0 KB hits via grep); cross-substrate pattern reuse ≠ infrastructure reuse
  • GPT Cycle 2: write-side server-stamping invariant + spoof-rejection security boundary; QueryService.mjs:116-128 builds Chroma where only from type, no tenant predicate

The Architectural Reality

Existing substrate touched (consumer sweep from Discussion §6, expanded per peer cycles):

Substrate Role Epic-side change
DatabaseService.createKnowledgeBase() Hardcoded source-class array orchestration Replace with data-driven source registry
source/Base.mjs Abstract extract(writeStream, createHashFn) contract Preserved; substrate-correct extension point already clean
source/*.mjs (10 concrete sources) Per-source paths + parser binding hardcoded Externalize paths to config; data-driven registration
parser/SourceParser.mjs, DocumentationParser.mjs, TestParser.mjs Acorn / markdown / test parsers Default parsers shipped; client-side parsers emit parsed-chunk-v1
DatabaseService.manageDatabaseBackup({action: 'import'}) RESTORE-only JSONL {id, embedding, metadata, document} Reframed as backup-record-v1 precedent; distinct from ingest contract
VectorService.mjs:188-274 Content-hash delta + embedding + #10572 MCP volume gate New write-side tenantId stamping; existing delete-logic preserved for full-corpus path
QueryService.mjs:116-128 Builds Chroma where from type only Inject tenant/visibility where from authenticated AgentIdentity context
SearchService.mjs:118-120 Single-neoRootDir path resolution for file hydration Tenant-aware hydration via Q12 options (chunk-metadata-embedded vs server-mirror vs hybrid)
memorySharing enum (#10010) Memory-Core-only today; KB has 0 references PATTERN REUSED, INFRASTRUCTURE NEW — chunk metadata tenantId injection + retrieval-time filter
AgentIdentity (#9999, #10011) Memory Core graph + RLS PATTERN REUSED, INFRASTRUCTURE NEW — server-stamped chunk provenance
#10572 MCP work-volume gate Refuses viaMcp syncs > mcpSyncMaxChunks LOAD-BEARING — forces bulk facade structurally

The Fix — Five-Phase Decomposition (updated 2026-05-19 per operator-directed sub-decomposition; Phase 5 added when integration/unit test parity gap surfaced)

Phase 0/1 Epic #11625 — Ingestion Contract + Registry Extraction + KB Tenant Isolation (contracts before implementation; memorySharing pattern applied to the knowledge-base Chroma collection per ADR 0003 unified topology)

  • #11629 Phase 0/1A — Ingestion contracts (parsed-chunk-v1 + backup-record-v1 schemas + path-identity tuple + tombstone spec)
  • #11630 Phase 0/1B — Source/Parser registry extraction + byte-equivalence fixture
  • #11631 Phase 0/1C — KB Tenant Isolation write-side (VectorService server-stamping + tenant-aware chunkId + spoof-rejection on the knowledge-base collection)
  • #11632 Phase 0/1D — KB Tenant Isolation read-side (QueryService/SearchService where-filter on knowledge-base collection) + fail-closed test suite

Standalone win: enables same-server custom workspaces without network substrate.

Phase 2 Epic #11626 — Ingestion Service + MCP Small-Batch Facade + Bulk Facade (blocked-by Phase 0/1)

  • #11633 Phase 2A — KnowledgeBaseIngestionService core (orchestrator + validation)
  • #11634 Phase 2B — MCP facade ingestSourceFiles + #10572 gate threading
  • #11635 Phase 2C — Bulk facade (CLI ai:ingest-tenant + streaming)
  • #11636 Phase 2D — Q12 search-hydration mode resolution + SearchService implementation
  • #11637 Phase 2E — Q5 tenant config storage (Native Edge Graph extension)
  • #11638 Phase 2F — Test fixture infrastructure (synthetic external workspaces × 4) + E2E multi-tenant suite

Phase 3 sub-ticket #11627 — Cloud Deployment Guide + Worked Examples (blocked-by Phase 2) Single sub-ticket (same-layer doc work; PR-level decomposition sufficient). New guide tree learn/agentos/cloud-deployment/ (Overview, Configuration, CustomSources, CustomParsers, HookWiring, Security, MigrationPath) + minimal worked example external workspace.

Phase 4 Epic #11628 — Operations + Observability for Cloud-Native Deployments (NEW per operator post-graduation review; blocked-by Phase 2)

  • #11639 Phase 4A — Per-tenant ingestion observability daemon (extends KBRecorderService)
  • #11640 Phase 4B — Manifest reconciliation daemon
  • #11641 Phase 4C — Stale-chunk garbage collection daemon
  • #11642 Phase 4D — Operator alerting surface (telemetry thresholds → A2A + external notification)

Phase 4 was not in Discussion #11623 §7 original decomposition; surfaced 2026-05-19 during operator-directed post-graduation review (rationale: cloud deployments need operability substrate; KBRecorderService is daemon-adjacent and extensible).

Phase 5 Epic #11643 — Integration + Unit Test Parity with Memory Core (NEW per operator post-graduation review; Phase 5A independent / 5B blocked-by Phase 0/1D / 5C cross-cutting)

  • #11644 Phase 5A — KB MCP Server Integration Test Parity (pre-Phase-0/1 shippable; parallels MC's RemoteMcpTransport / AuthRejection / OidcAuth / BackupRestoreWipe / HeartbeatPropagation specs)
  • #11645 Phase 5B — KB Tenant Isolation Integration Tests (blocked-by Phase 0/1D #11632; parallels MC's TeamPrivateRetrieval / CrossTenantIsolation specs)
  • #11646 Phase 5C — KB Unit Test Coverage Expansion (cross-cutting; close 32-vs-12 parity gap; target ≥25 specs ≈ 80% of MC's 32)

Phase 5 was not in Discussion #11623 §7 original; surfaced 2026-05-19 during operator's test-substrate review. Empirical parity gap V-B-A'd: MC has 32 unit specs + 8 integration specs; KB has 12 unit specs + 1 integration spec (healthcheck.spec.mjs per #10805 Lane A — dual-server coverage). Operator framing: "this epic needs new tests here. and /unit-tests too."

Acceptance Criteria

Cross-phase ACs:

  • All 3 phase sub-tickets filed with explicit cross-references back to this Epic
  • Phase 0/1 ships first (contracts before implementation); Phase 2 sub-ticket filed when Phase 0/1 lands
  • Each phase has standalone test substrate (unit + integration + e2e where applicable)
  • Cross-substrate consumer sweep (consumer sweep dimensions from Discussion §6) reflected in each phase's tests
  • learn/agentos/cloud-deployment/ guide tree exists at Phase 3 closeout; cross-links to #9999, #10010, #10011, #10030

Phase 0/1 ACs (filed as sub-ticket):

  • parsed-chunk-v1 JSON Schema at ai/services/knowledge-base/parser/parsed-chunk-v1.schema.json
  • backup-record-v1 JSON Schema at ai/services/knowledge-base/parser/backup-record-v1.schema.json (formalizing existing importDatabase shape)
  • Path-identity tuple {tenantId, repoSlug, rootKind, sourcePath} spec
  • Tombstone / manifest / revision-boundary contract spec (mutually-supporting)
  • Source/Parser registry extraction: hardcoded array → data-driven registry; useDefaultSources / useDefaultParsers boolean configs in aiConfig
  • Per-source path externalization (ApiSource.sourceMap etc. → config)
  • Custom-source / custom-parser registration API
  • memorySharing KB-side port (Write-side)VectorService.embed upsert path injects server-derived {tenantId, visibility, originAgentIdentity?} from authenticated AgentIdentity context; rejects or server-overwrites client-supplied tenant fields; tenant-aware chunkId hash derivation
  • memorySharing KB-side port (Read-side)QueryService.queryDocuments + SearchService inject where: {tenantId: {$in: [<requester>, '<team-namespace>']}} from server-side authenticated AgentIdentity, NOT client payload
  • Byte-equivalence fixture — current Neo source output BEFORE registry extraction === output AFTER (verifies migration doesn't regress retrieval quality; adding tenantId field doesn't perturb chunk-hash semantics for existing content)
  • Fail-closed test suite — forged client tenantId/visibility rejected/overwritten; tenant A cannot retrieve tenant B private chunks; Neo team chunks cross-tenant-visible; same sourcePath under two tenants isolated; spoof-rejection tested through every public KB query facade

Phase 2 ACs (deferred, filed as sub-ticket):

  • KnowledgeBaseIngestionService singleton behind shared service layer
  • MCP tool ingestSourceFiles accepts batches, gated by aiConfig.mcpSyncMaxChunks (#10572)
  • Bulk facade (CLI npm run ai:ingest-tenant <tenantId> + HTTP/streaming)
  • Tenant scoping via #9999 AgentIdentity substrate
  • Server-side parsed-chunk-v1 validation (rejects records with embedding field outside restore mode)
  • Q12 hydration mode chosen and implemented before retrieval flow
  • Density/UX measurement gate empirically defined

Phase 3 ACs (deferred, filed as sub-ticket):

  • learn/agentos/cloud-deployment/ guide tree (Overview, Configuration, CustomSources, CustomParsers, HookWiring, Security, MigrationPath)
  • Minimal worked example external workspace
  • Minimal pre-push hook in shell demonstrating ingestSourceFiles contract
  • Guide validates referenced files exist; code examples are runnable

Out of Scope

  • Client-side embeddings (operator invariant; KB still owns embeddings)
  • Pull-model (KB clones tenant repos) — inverts #9999 push-substrate identity direction
  • Per-tenant Chroma storage split (rejected by #9999 Avoided Traps; re-audit only if cross-repo INGEST load empirically forces)
  • Macro DB / full-file vectors (rejected by #10030)
  • Server-side runtime parser execution from tenant-supplied code (untrusted-code risk; eliminated by client-side fallback for non-default parsers)
  • Future WASM/tree-sitter sandboxing lane for server-side custom parsers — separate Discussion when needed

Avoided Traps / Gold Standards Rejected

Trap Why rejected
Conflating backup-record-v1 (restore) with ingest contract importDatabase preserves embeddings, skips TextEmbeddingService.embedTexts() — distinct from server-embeds ingest path. Two distinct contracts.
Content-hash delta as sole deletion mechanism Only works under full-corpus sync (VectorService.mjs:198-207); incremental push MUST use tombstone/manifest/revision-boundary
Pure-MCP transport for bulk imports #10572 work-volume gate refuses syncs > mcpSyncMaxChunks via MCP; bulk facade is structurally necessary
Implicit single-neoRootDir source-path semantics SearchService.mjs:118-120 resolves against single root; cross-tenant needs explicit identity tuple
Assumed cross-substrate reuse of memorySharing enum Memory-Core-only today (0 KB references); pattern reuse ≠ infrastructure reuse. Validate cross-substrate assumptions via grep, not memory of design-intent.
Client-supplied tenant/visibility trust Server stamps from authenticated AgentIdentity; clients may not spoof. Load-bearing security invariant.
Conflating memorySharing enum (Memory Core policy pattern) with shared-DB topology (already unified per ADR 0003 since 2026-05-09) KB + MC are SEPARATE MCP servers sharing ONE Chroma DB but maintaining SEPARATE collections. Phase 0/1 adds tenant-scoping metadata to KB's knowledge-base collection only; it does NOT alter unified topology or move storage between servers. Terminology: "KB Tenant Isolation (memorySharing pattern reuse)" — NOT "memorySharing KB Port" which suggests storage relocation.

Discussion Criteria Mapping

Per ideation-sandbox-workflow.md §6.6, mapping Discussion #11623 §10 Graduation Criteria → Epic ACs:

Discussion #11623 §10 criterion Epic AC
1. §5.1 Double Diamond Matrix in body Discussion body archaeological source; not Epic AC
2. §5.2 Step 2.5 cross-substrate sweep Discussion #11623 GPT/Gemini STEP_BACK comments; Epic ACs reflect sweep findings
3. §6 Consensus Mandate 3× APPROVED Achieved (see Signal Ledger below)
4. Q1 Parser-Locality (2-axis) Phase 0/1 ACs: parser-locality dispatch via tenant config + parser-protocol contract
5. Q3 Push-endpoint protocol Phase 2 ACs: MCP small-batch + bulk facade
6. Q4 Parser-protocol contract Phase 0/1 AC: parsed-chunk-v1.schema.json
7. Q11 Tombstone/manifest semantics Phase 0/1 AC: deletion-signaling contract
8. Q12 Search hydration Phase 2 AC: hydration mode chosen before retrieval flow
9. Test substrate scope Cross-phase + Phase 0/1 fail-closed test suite ACs
10. Guide deliverable Phase 3 ACs
11. Sub-ticket boundaries This Epic + 3 phase sub-tickets
12. Q13a Write-side stamping invariant Phase 0/1 AC: server-stamped {tenantId, visibility, originAgentIdentity?} + spoof-rejection
13. Q13b Read-side enforcement layer Phase 0/1 AC: QueryService + SearchService where filter; enforcement-layer hybrid lean (V1 application-layer → V2 Chroma-layer if leak class manifests)

Signal Ledger

Unresolved Dissent

(empty — positive signal; 3/3 APPROVED, no DEFERRED at convergence)

Unresolved Liveness

(empty — positive signal; all 3 cross-family signals explicit)

Related

  • Discussion #11623 — origin Discussion (archaeological source post-graduation)
  • #9999 — Cloud-Native Knowledge & Multi-Tenant Memory Core Epic (sibling; this Epic = write-side complement)
  • #10010 — Memory Core team/private retrieval policy (memorySharing enum source; pattern ported to KB)
  • #10011 — Native Edge Graph RLS (read-side enforcement; KB equivalent in this Epic)
  • #10016 — Multi-Tenant Identity & Data Privacy parent (upstream identity substrate)
  • #10030 — Concept Ontology (potential downstream; cross-repo ingestion may affect concept extraction)
  • #10097 — distributed Chroma zips (related; portability precedent)
  • #10572 — MCP work-volume gate (LOAD-BEARING; forces bulk facade)
  • #10129 — atomic backup bundle (backup-record-v1 precedent)

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • query_raw_memories({query: 'cloud-native KB ingestion external workspace parser locality'})
  • query_raw_memories({query: 'KB tenant isolation memorySharing pattern unified Chroma'})
  • query_summaries({query: '#11623 cross-family graduation'})
  • ask_knowledge_base({query: 'KB ingestion source parser registry tenant', type: 'ticket'})
  • Discussion #11623 + Phase 0/1 sub-ticket are the entry points for resuming this workstream
  • Cross-family peer-cycle empirical anchors: GPT Cycle 1 DC_kwDODSospM4BAwRM + Gemini Cycle 1 DC_kwDODSospM4BAwRW + GPT Cycle 2 DC_kwDODSospM4BAwS0
tobiu referenced in commit 8f240be - "feat(kb): scaffold Phase 0/1A ingestion contracts (#11629) (#11647) on May 19, 2026, 4:22 PM
tobiu assigned to @tobiu on May 19, 2026, 5:56 PM
tobiu referenced in commit b633f9f - "feat(ai): KB Source/Parser registry + useDefaultSources/useDefaultParsers configs (#11658) (#11659) on May 20, 2026, 2:52 AM
tobiu referenced in commit 68eb22e - "docs(agentos): Phase 3A cloud-deployment guide scaffold (#11627) (#11668) on May 20, 2026, 7:59 AM
tobiu referenced in commit 5d64a1f - "feat(ai): KB ingestion telemetry schema + recordIngestionMetric API (#11639) (#11667) on May 20, 2026, 8:01 AM
tobiu referenced in commit d1b3cf7 - "feat(ai): externalize per-source paths to aiConfig.sourcePaths (#11660) (#11661) on May 20, 2026, 9:48 AM
tobiu referenced in commit c9d286d - "feat(kb): add ai:ingest-tenant bulk-facade CLI (#11635) (#11697) on May 21, 2026, 12:14 AM
tobiu referenced in commit b7a8d75 - "docs(agentos): Phase 3B cloud-deployment guides + examples (#11627) (#11707) on May 21, 2026, 8:49 AM