LearnNewsExamplesServices
Frontmatter
id11626
titleKB Ingestion Phase 2: Ingestion Service + MCP Small-Batch Facade + Bulk Facade
stateClosed
labels
epicaiarchitecture
assignees[]
createdAtMay 19, 2026, 1:40 PM
updatedAtJun 7, 2026, 7:13 PM
githubUrlhttps://github.com/neomjs/neo/issues/11626
authorneo-opus-ada
commentsCount5
parentIssue11624
subIssues
11633 Phase 2A — KnowledgeBaseIngestionService Core: Orchestrator + parsed-chunk-v1 Validation + Delta Integration
11634 Phase 2B — MCP Facade: ingestSourceFiles Tool + #10572 Work-Volume Gate Threading
11635 Phase 2C — Bulk Facade: CLI ai:ingest-tenant + Streaming Ingestion Path
11636 Phase 2D — Q12 Search-Hydration Mode Resolution + SearchService Implementation
11637 Phase 2E — Q5 Tenant Config Storage: Native Edge Graph Extension
11638 Phase 2F — Test Fixture Infrastructure: Synthetic External Workspaces + Multi-Tenant E2E Suite
11679 Phase 2C — Bulk KB Ingestion Facade with bulk-mode Gate Bypass
subIssuesCompleted7
subIssuesTotal7
blockedBy[x] 11625 KB Ingestion Phase 0/1: Contracts, Source/Parser Registry, KB Tenant Isolation
blocking[x] 11628 KB Ingestion Phase 4: Operations + Observability for Cloud-Native Deployments, [x] 11627 KB Ingestion Phase 3: Cloud Deployment Guide + Worked Examples
closedAtMay 21, 2026, 11:47 AM

KB Ingestion Phase 2: Ingestion Service + MCP Small-Batch Facade + Bulk Facade

Closed v13.0.0/archive-v13-0-0-chunk-12 epicaiarchitecture
neo-opus-ada
neo-opus-ada commented on May 19, 2026, 1:40 PM

Context

Phase 2 sub-ticket of Epic #11624 (Cloud-Native KB Ingestion for External Workspaces). Graduated from Discussion #11623. Blocked by Phase 0/1 #11625 — contracts MUST stabilize before endpoint implementation per cross-family peer convergence (substrate-correct ordering: contracts before transport).

This phase implements the cross-server push pipeline — the actual ingestion endpoints that consume Phase 0/1 contracts. Two facades behind a shared service, plus the Q12 search-hydration mode decision.

The Problem

After Phase 0/1 lands, the substrate has:

  • Stable parsed-chunk-v1 + backup-record-v1 schemas
  • Source/Parser registry with useDefaultSources / useDefaultParsers
  • Path-identity tuple {tenantId, repoSlug, rootKind, sourcePath}
  • Tombstone / manifest / revision-boundary deletion-signaling contract
  • memorySharing KB port (write-side stamping + read-side filter)

What's missing for cloud deployments: the actual ingestion endpoint clients call from their git hooks. Pure-MCP cannot handle bulk-initial-imports (VectorService.mjs:216-240 refuses syncs > mcpSyncMaxChunks); pure-bulk-only loses the agent-native command-plane affordance for small batches. Two facades behind one service is the structurally-necessary shape.

Also: Q12 search hydration (chunk-metadata-embedded vs server-mirror vs hybrid) MUST be resolved before retrieval flow lands — Phase 0/1 marks the boundary; Phase 2 chooses the implementation.

The Architectural Reality

This phase touches:

File Change
NEW ai/services/knowledge-base/KnowledgeBaseIngestionService.mjs New singleton service behind shared service layer; orchestrates parsing + content-hash delta + tenant-scoping + Chroma upsert via existing VectorService.embed path
NEW ai/mcp/server/knowledge-base/tools/ingestSourceFiles.mjs (or equivalent registration) MCP tool — small-batch command-plane facade; threads viaMcp per #10572 work-volume gate; returns structured KB_INGEST_VOLUME_EXCEEDED response when batch > threshold, pointing to bulk facade
NEW buildScripts/ai/ingest-tenant.mjs (or ai/scripts/ingest-tenant.mjs after structural-pre-flight) CLI facade for bulk imports + hook bursts; bypasses MCP volume gate; streams parsed-chunk-v1 records
NEW ai/mcp/server/knowledge-base/tools/ingestStream.mjs (or HTTP endpoint registration TBD) HTTP/streaming facade for cross-server tenant push (final transport shape decided during Phase 2 implementation; MCP-tool-only acceptable for V1 if HTTP defers to a follow-up)
SearchService.mjs Implement chosen Q12 hydration mode (chunk-metadata-embedded vs server-mirror vs hybrid)
QueryService.mjs (Already updated in Phase 0/1 with where filter; Phase 2 verifies retrieval flow correctness against new hydration mode)
Tenant config storage Per Q5 from Discussion: tenant-config-node in Native Edge Graph (#10011 substrate) OR kb-config.yaml bootstrap OR both — choice deferred to Phase 2 implementation review

The Fix

1. KnowledgeBaseIngestionService (singleton)

Neo.ai.services.knowledge-base.KnowledgeBaseIngestionService
  ├─ ingestSourceFiles({tenantId, files: [{path, content, parser?}], deleted?, manifestSnapshot?, baseRevision?, headRevision?})
  │   ├─ Validate tenant via AgentIdentity context (#9999)
  │   ├─ Apply tenant source/parser config (from Phase 0/1 registry)
  │   ├─ Server-shipped parsers → parse raw content server-side
  │   ├─ Client-side parsers → validate incoming `parsed-chunk-v1` records (reject `embedding` field)
  │   ├─ Server-stamp `{tenantId, visibility, originAgentIdentity?}` into chunk metadata
  │   ├─ Apply tombstone/manifest/revision-boundary deletion contract
  │   ├─ Route to VectorService.embed (server embeds + Chroma upsert)
  │   └─ Return ingestion summary `{ingested, deleted, embeddingsGenerated, errors}`
  └─ ingestSourceFilesBulk(...) — same contract, no MCP gate, streams response

2. MCP Facade

ingestSourceFiles MCP tool registered via toolService.mjs:

  • Accepts batches up to aiConfig.mcpSyncMaxChunks (default 50; gated by #10572)
  • When exceeded: returns structured {error: 'KB_INGEST_VOLUME_EXCEEDED', message: '...', bulkPath: 'npm run ai:ingest-tenant <tenantId>'} response
  • AgentIdentity-authenticated via standard MCP auth substrate

3. Bulk Facade

CLI command npm run ai:ingest-tenant <tenantId>:

  • Reads parsed-chunk-v1 records from file or stdin
  • Bypasses MCP volume gate (CLI path; viaMcp: false)
  • Streams progress
  • Suitable for initial tenant onboarding + hook bursts > MCP threshold

Optional V1.5: HTTP/streaming endpoint for true cross-server push (deferred decision based on V1 deployment shape).

4. Q12 Search Hydration Resolution

Required architectural choice before Phase 2 retrieval flow lands. Three options from Discussion §4 Q12:

  • Option A — chunk-metadata-embedded: store full content in chunk.metadata.content (already partially the case); hydrate from chunk, not filesystem. Pro: works regardless of FS layout. Con: chunk metadata size grows ~2-3x.
  • Option B — server-mirror: KB server mirrors tenant content locally (per-tenant mount under /tenants/<tenantId>/<repoSlug>/). Pro: filesystem-native, preserves existing hydration path. Con: storage cost, sync coordination.
  • Option C — hybrid: chunk metadata carries content for small chunks; large chunks reference an on-server mirror. Pro: optimizes both axes. Con: dual-path complexity.

Open — Phase 2 implementation owner decides + V-B-A against per-tenant chunk-size distribution before committing.

5. Tenant Config Storage Resolution (Q5)

Phase 2 implementation also resolves Q5: Native Edge Graph tenant-config-node (#10011) vs kb-config.yaml vs both. Lean (from Discussion): tenant-config-as-graph-node for canonical state, with optional kb-config.yaml bootstrap for first-deploy.

6. Density / UX Measurement Gate

Per Discussion §6 sweep point 5: Phase 2 ships an empirical-evidence-threshold for when per-tenant Chroma sharding would be reopened (e.g., "if median tenant chunk count > X AND query p95 latency > Y at Z tenants → file follow-up Discussion to re-audit per-tenant storage split"). Threshold values empirically tunable during Phase 2 implementation; ticket AC asserts the threshold EXISTS and is documented.

Acceptance Criteria

  • KnowledgeBaseIngestionService singleton implemented behind shared service layer
  • MCP tool ingestSourceFiles registered + gated by aiConfig.mcpSyncMaxChunks (#10572)
  • MCP tool returns structured volume-gate response pointing to bulk facade when threshold exceeded
  • Bulk facade: npm run ai:ingest-tenant <tenantId> CLI command implemented; streams parsed-chunk-v1 records
  • Tenant scoping via #9999 AgentIdentity substrate (auth-on-ingest)
  • Server-side parsed-chunk-v1 validation (Phase 0/1 schema; rejects records with embedding field outside restore mode)
  • Q12 search-hydration mode chosen + implemented in SearchService; AC body documents rationale + per-tenant chunk-size distribution V-B-A used
  • Q5 tenant config storage resolved + documented
  • Density / UX measurement gate threshold documented in code comments + cross-link to follow-up trigger
  • Integration tests: synthetic external workspace fixtures (mini-neo-workspace/, mini-es5-workspace/, mini-cpp-workspace/) push via ingestSourceFiles → ingestion → query → tenant-isolation verified
  • Integration test: MCP threshold gate → structured volume-gate response observed
  • Integration test: bulk facade CLI ingests 500+ chunks successfully
  • E2E test: multi-tenant query — tenant A and tenant B both push content; A queries see A's + Neo's; B queries see B's + Neo's; A cannot see B's private content
  • E2E test: force-push reconciliation — tenant pushes branch rewrite → revision-boundary reconciliation removes orphaned chunks

Out of Scope

  • Cloud deployment guide → Phase 3 (#TBD, blocked-by this ticket)
  • HTTP/streaming endpoint if MCP + CLI proves sufficient for V1 deployment shape (operator-deferred decision)
  • WASM/tree-sitter server-side custom-parser sandboxing → future Discussion
  • Per-tenant Chroma sharding implementation (Phase 2 ships the measurement gate; sharding triggered only if gate's threshold trips per follow-up Discussion)

Avoided Traps

Trap Why rejected
MCP-only ingestion path #10572 work-volume gate refuses bulk; structurally cannot handle initial tenant onboarding
Bulk-only ingestion path Loses agent-native command-plane affordance for small hook batches
Skipping Q12 resolution Per Phase 0/1 AC: retrieval flow MUST NOT land before hydration mode chosen; phase-dependency invariant
Building HTTP endpoint without V1 shape evidence Premature optimization; MCP + CLI may suffice for V1; HTTP can land as V1.5 if measurement justifies
Skipping density measurement gate Re-auditing per-tenant storage split needs empirical anchor; baking in the trigger now prevents re-derivation

Related

  • Parent Epic: #11624
  • Blocked-by: #11625 (Phase 0/1 — contracts must stabilize)
  • Origin Discussion: #11623 (archaeological source post-graduation)
  • Sibling Epic: #9999 (read-side multi-tenancy substrate)
  • Load-bearing dependency: #10572 (MCP work-volume gate threading)
  • Identity substrate: #10011 (Native Edge Graph RLS), #9999 (AgentIdentity)

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • query_raw_memories({query: 'KnowledgeBaseIngestionService MCP small-batch bulk facade'})
  • query_raw_memories({query: 'Q12 search hydration chunk-metadata-embedded server-mirror'})
  • ask_knowledge_base({query: 'MCP work-volume gate ingestion threshold', type: 'src'})
  • Discussion #11623 §7 Phase 2 + §4 Q3 + §4 Q12 + §6 sweep point 5 are the architectural source-of-authority
  • Phase 0/1 (#11625) MUST be merged before this ticket's implementation begins; verify parsed-chunk-v1.schema.json + useDefaultSources/useDefaultParsers configs + memorySharing KB port + byte-equivalence fixture all present in dev
tobiu referenced in commit 68eb22e - "docs(agentos): Phase 3A cloud-deployment guide scaffold (#11627) (#11668) on May 20, 2026, 7:59 AM
tobiu referenced in commit 5d64a1f - "feat(ai): KB ingestion telemetry schema + recordIngestionMetric API (#11639) (#11667) on May 20, 2026, 8:01 AM