LearnNewsExamplesServices
Frontmatter
id11633
titlePhase 2A — KnowledgeBaseIngestionService Core: Orchestrator + parsed-chunk-v1 Validation + Delta Integration
stateClosed
labels
enhancementaiarchitecture
assigneesneo-gpt
createdAtMay 19, 2026, 1:55 PM
updatedAtJun 7, 2026, 7:13 PM
githubUrlhttps://github.com/neomjs/neo/issues/11633
authorneo-opus-ada
commentsCount2
parentIssue11626
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[x] 11635 Phase 2C — Bulk Facade: CLI ai:ingest-tenant + Streaming Ingestion Path, [x] 11634 Phase 2B — MCP Facade: ingestSourceFiles Tool + #10572 Work-Volume Gate Threading
closedAtMay 20, 2026, 2:04 PM

Phase 2A — KnowledgeBaseIngestionService Core: Orchestrator + parsed-chunk-v1 Validation + Delta Integration

Closed v13.0.0/archive-v13-0-0-chunk-12 enhancementaiarchitecture
neo-opus-ada
neo-opus-ada commented on May 19, 2026, 1:55 PM

Context

Sub of Phase 2 Epic #11626 (meta-Epic #11624). Graduated from Discussion #11623.

Substrate floor for Phase 2 facades. Both MCP facade (Phase 2B) + bulk facade (Phase 2C) consume this service.

The Problem

After Phase 0/1 contracts land, the substrate has stable schemas + registry + memorySharing port, but no actual ingestion entrypoint. Clients can't push code yet. Phase 2 needs a service-layer orchestrator BEFORE facades wire to it.

The Fix

New service: ai/services/knowledge-base/KnowledgeBaseIngestionService.mjs (Neo.core.Base extension; singleton).

Responsibilities:

  1. Validate tenant via AgentIdentity context (#9999 substrate)
  2. Apply tenant source/parser config from Phase 0/1B registry
  3. Server-side parsing for raw-file-delta payloads (when tenant's source uses server-shipped parser)
  4. parsed-chunk-v1 validation for client-side-parsed payloads (rejects records with embedding field outside explicit restore mode)
  5. Server-stamp {tenantId, visibility, originAgentIdentity?} per Phase 0/1C
  6. Apply tombstone/manifest/revision-boundary deletion-signaling per Phase 0/1A spec
  7. Route to VectorService.embed (existing content-hash delta + server embeds + Chroma upsert)
  8. Return structured ingestion summary {ingested, deleted, embeddingsGenerated, errors, tenantId, durationMs}

Contract Ledger Matrix

KnowledgeBaseIngestionService introduces a service surface consumed by Phase 2B (MCP facade) and Phase 2C (bulk facade). Per learn/agentos/contract-ledger.md:

Target Surface Source of Authority Proposed Behavior Fallback Docs Evidence
KnowledgeBaseIngestionService.ingestSourceFiles({tenantId, files, deleted?, manifestSnapshot?, baseRevision?, headRevision?}) Discussion #11623 §7 Phase 2; Phase 0/1 contracts — parsed-chunk-v1 schema (#11629), source/parser registry (#11658), write-side stamping (#11631) Orchestrates one ingestion: tenant validation via AgentIdentity → tenant source/parser config from the registry → server-parse raw-file deltas OR validate parsed-chunk-v1 for client-parsed payloads → server-stamp {tenantId, visibility, originAgentIdentity?} → apply tombstone/manifest/revision-boundary deletion-signaling → route to VectorService.embed. No AgentIdentity context (single-tenant / offline daemon) → defaults to the neo-shared tenant, consistent with the Phase 0/1 single-tenant fallthrough. Service JSDoc (Anchor & Echo); Phase 3 learn/agentos/cloud-deployment/HookWiring.md (#11627) documents the call contract for hook authors. Per-AC happy-path + error-path unit tests; end-to-end integration test against a mock tenant fixture.
ingestSourceFiles return value — {ingested, deleted, embeddingsGenerated, errors, tenantId, durationMs} #11633 "The Fix" §8 Returns a structured ingestion summary — counts + a structured errors array + tenantId + durationMs. Partial failures populate errors[]; the summary is never lost to a thrown exception. A fully-failed ingestion still returns the summary with errors[] populated and zero counts — callers branch on errors, not on a thrown exception. JSDoc on the return shape; HookWiring.md (#11627). Unit tests asserting the summary shape on happy + error paths.
embedding-field rejection — parsed-chunk-v1 payloads carrying an embedding field Phase 0/1A parsed-chunk-v1 schema (#11629); #11633 AC A record carrying an embedding field is REJECTED with a structured error — never silently routed. Ingestion is for un-embedded content; pre-embedded records belong to the restore path. The rejection error names the restore path — manageDatabaseBackup({action: 'import'}) — so the caller is routed correctly. JSDoc; Phase 3 CustomParsers.md (#11627). Error-path unit test asserting the rejection + the structured error shape.

Backfilled 2026-05-20 by @neo-opus-ada. #11633 was filed (origin session 7360e917) before the Contract Ledger discipline was applied to it; @neo-gpt's #11626 epic-review surfaced the gap. The ledger restates the KnowledgeBaseIngestionService consumed surface already specified in "The Fix" above — it adds no new scope.

Acceptance Criteria

  • KnowledgeBaseIngestionService class extends Neo.core.Base (per Neo conventions); singleton
  • Method ingestSourceFiles({tenantId, files, deleted?, manifestSnapshot?, baseRevision?, headRevision?}) implemented
  • AgentIdentity validation via service-boundary auth substrate (mock-injectable for tests)
  • Tenant source/parser config consumed from Phase 0/1B registry
  • parsed-chunk-v1 validation at service boundary (uses Phase 0/1A schema)
  • Records carrying embedding field REJECTED with structured error (NOT routed silently)
  • Server-stamping via Phase 0/1C invocation
  • Deletion-signaling: tombstones + manifest + revision-boundary all functional (3 mutually-supporting paths)
  • Routing to VectorService.embed preserves existing content-hash delta
  • Structured ingestion summary returned
  • Unit tests: each AC has happy-path + error-path coverage
  • Integration tests: end-to-end ingestion against mock tenant fixture

Out of Scope

  • MCP facade wiring → Phase 2B
  • Bulk facade (CLI/HTTP) → Phase 2C
  • Q12 hydration mode → Phase 2D
  • Q5 tenant config storage → Phase 2E
  • Synthetic external-workspace fixtures → Phase 2F

Related

  • Parent: #11626
  • Blocked-by: Phase 0/1 Epic #11625 completion
  • Blocks: Phase 2B, 2C (facades consume this service)
  • Discussion source: #11623 §1 #8 + §7 Phase 2

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • VectorService.embed (lines 188-274) is the downstream call target
  • KBRecorderService.mjs is sibling-service architectural pattern reference
  • MemoryService.mjs queryMemories method is the AgentIdentity-context propagation reference pattern
tobiu referenced in commit 700223d - "feat(kb): add ingestion service (#11633) (#11678) on May 20, 2026, 2:04 PM
tobiu closed this issue on May 20, 2026, 2:04 PM