Context
Follow-up from PR #11678 review Cycle 1 (Approve+Follow-Up) after #11633 merged. Claude's review approved the Phase 2A KnowledgeBaseIngestionService floor, but surfaced one forward seam that must have a Phase 2C home: KnowledgeBaseIngestionService.embedChunkGroups() currently calls VectorService.embed(..., {deleteStale: false, tenantContext, viaMcp: true}), and ingestSourceFiles() exposes no caller override.
That is correct for Phase 2A/2B small-batch MCP ingestion, where the #10572 work-volume gate should protect the agent command plane. It is wrong for Phase 2C bulk ingestion: a whole external repository can contain thousands of parsed chunks per repoSlug group and would hit KB_SYNC_VOLUME_EXCEEDED on every group if routed through the same hardcoded MCP mode.
Duplicate sweep performed before filing:
gh issue list --search "Phase 2C bulk facade 11626" surfaced #11626, #11634, #11633, #11624, but no Phase 2C leaf ticket.
gh issue list --search "viaMcp bulkMode KnowledgeBaseIngestionService" returned no open issues.
ask_knowledge_base(..., type:'ticket') surfaced #10572 work-volume gate history, not an equivalent bulk-ingestion follow-up.
rg "Phase 2C|bulk facade|bulkMode|viaMcp|ingest-tenant|KB_INGEST_VOLUME_EXCEEDED" resources/content/... found #10572 references only.
The Problem
The Phase 2 parent #11626 intentionally splits ingestion into two facades behind one shared service:
- Phase 2B: MCP small-batch command-plane facade,
viaMcp: true, volume-gated by #10572.
- Phase 2C: bulk facade for initial tenant onboarding and hook bursts, bypassing the MCP volume gate via CLI/bulk mode.
After #11633, the shared service exists, but its only embedding route hardcodes MCP semantics. That means a Phase 2C caller cannot reuse the service for large external-workspace pushes without being rejected by VectorService.embed()'s MCP work-volume gate. The review also noted the PR #11678 Post-Merge Validation phrase "Phase 2C bulk facade can reuse the same service" is only true after this ticket adds the caller-supplied bulk-mode seam.
The Architectural Reality
Relevant current surfaces:
| Surface |
Current behavior |
Phase 2C implication |
KnowledgeBaseIngestionService.ingestSourceFiles() |
Public service entrypoint; no viaMcp / bulkMode option |
Bulk caller cannot declare non-MCP mode |
KnowledgeBaseIngestionService.embedChunkGroups() |
Calls VectorService.embed(..., {deleteStale: false, tenantContext, viaMcp: true}) |
Every group is treated as synchronous MCP work |
VectorService.embed() |
Defaults deleteStale: true; enforces #10572 gate when viaMcp === true and chunksToProcess > aiConfig.mcpSyncMaxChunks |
Correct protection for MCP, incorrect for CLI/bulk imports |
buildScripts/ai/syncKnowledgeBase.mjs |
One-shot CLI build script that bypasses MCP gate by calling service code outside MCP dispatch |
Sibling pattern for a bulk-ingestion CLI wrapper |
| #11634 |
Phase 2B MCP facade; explicitly routes bulk to Phase 2C |
Confirms this work is out of scope for 2B |
Structural pre-flight: ticket prescription introduces a likely new CLI script. Fast-path applies because buildScripts/ai/ingestTenant.mjs matches the sibling one-shot AI build-script pattern of buildScripts/ai/syncKnowledgeBase.mjs; service logic stays in KnowledgeBaseIngestionService, while the build script remains a thin facade. No novel directory choice; map-maintenance not needed.
The Fix
Implement Phase 2C as the bulk facade over the already-merged ingestion service:
- Add a caller-supplied mode option to
KnowledgeBaseIngestionService.ingestSourceFiles(), e.g. viaMcp or bulkMode, with the safe default preserving current small-batch MCP behavior.
- Thread the mode into
embedChunkGroups() so Phase 2B can pass viaMcp: true and Phase 2C can pass viaMcp: false.
- Add a CLI facade, likely
buildScripts/ai/ingestTenant.mjs, plus an npm script such as ai:ingest-tenant, for external-workspace bulk pushes.
- Keep
deleteStale: false for incremental parsed-source ingestion; this ticket must not reintroduce full-corpus stale deletion into tenant pushes.
- Return progress/error output that distinguishes MCP volume-gate rejection from bulk-mode ingestion failures.
Contract Ledger Matrix
| Target Surface |
Source of Authority |
Proposed Behavior |
Fallback |
Docs |
Evidence |
KnowledgeBaseIngestionService.ingestSourceFiles() |
PR #11678 review follow-up; #11626 Phase 2 parent; #11634 Phase 2B split |
Accept caller-supplied viaMcp / bulkMode option; default remains MCP-safe |
Existing behavior (viaMcp: true) remains default for callers that omit it |
JSDoc @summary + params updated |
Unit test proves default and bulk override |
embedChunkGroups() to VectorService.embed() |
#10572 work-volume gate; #11633 service implementation |
Thread the chosen mode into VectorService.embed() |
Bulk caller receives explicit failure if mode is invalid or rejected |
Inline Anchor & Echo comment explaining small-batch vs bulk mode |
Test verifies viaMcp:false reaches VectorService for bulk path |
| Bulk CLI facade |
#11626 Phase 2C; sibling buildScripts/ai/syncKnowledgeBase.mjs |
Read parsed-chunk-v1 records from file/stdin and call service in bulk mode |
Print actionable error and non-zero exit on invalid input or ingestion failure |
package.json script name + CLI usage comment |
CLI/integration test ingests > mcpSyncMaxChunks chunks without gate rejection |
Acceptance Criteria
Out of Scope
- Reopening #11633 or changing the already-merged Phase 2A service floor except for the mode seam needed here.
- MCP facade registration itself; that is Phase 2B (#11634).
- HTTP/streaming endpoint; #11626 keeps that as a deferred deployment-shape decision.
- Search hydration Q12 and tenant-config storage Q5 unless the Phase 2C CLI needs minimal config discovery to run.
- Shadow-collection / stale-delete strategy work from Discussion #11677.
Avoided Traps
| Trap |
Why rejected |
Reusing the MCP path for bulk by raising mcpSyncMaxChunks |
Weakens #10572's command-plane safety gate and turns a bulk import into a synchronous MCP freeze risk |
| Adding a second ingestion service for bulk |
Splits validation/deletion/telemetry logic that #11633 intentionally centralized |
Hardcoding viaMcp:false globally |
Breaks Phase 2B's small-batch safety semantics |
| Leaving the follow-up only in PR #11678 review prose |
Not graph-native enough for Phase 2C pickup; future agents would re-derive the seam |
Related
- Parent Epic: #11626
- Meta Epic: #11624
- Predecessor: #11633 / PR #11678 (merged at
700223dfccd30292c86bb630dd8a5dbc71da42d5)
- Sibling: #11634 (Phase 2B MCP facade)
- Load-bearing dependency: #10572 (MCP work-volume gate)
- Sibling Discussion: #11677 (separate stale-delete strategy axis)
Origin Session ID
d13c94dd-e721-4e28-ac9e-4d0b3c0f66de
Handoff Retrieval Hints
query_raw_memories({query: 'PR 11678 Approve Follow-Up viaMcp bulkMode Phase 2C'})
query_raw_memories({query: 'KnowledgeBaseIngestionService embedChunkGroups viaMcp true bulk facade'})
ask_knowledge_base({query: 'VectorService embed viaMcp work-volume gate', type: 'src'})
- Review anchor: PR #11678 review
PRR_kwDODSospM8AAAABAfuHDg
Context
Follow-up from PR #11678 review Cycle 1 (Approve+Follow-Up) after #11633 merged. Claude's review approved the Phase 2A
KnowledgeBaseIngestionServicefloor, but surfaced one forward seam that must have a Phase 2C home:KnowledgeBaseIngestionService.embedChunkGroups()currently callsVectorService.embed(..., {deleteStale: false, tenantContext, viaMcp: true}), andingestSourceFiles()exposes no caller override.That is correct for Phase 2A/2B small-batch MCP ingestion, where the #10572 work-volume gate should protect the agent command plane. It is wrong for Phase 2C bulk ingestion: a whole external repository can contain thousands of parsed chunks per
repoSluggroup and would hitKB_SYNC_VOLUME_EXCEEDEDon every group if routed through the same hardcoded MCP mode.Duplicate sweep performed before filing:
gh issue list --search "Phase 2C bulk facade 11626"surfaced #11626, #11634, #11633, #11624, but no Phase 2C leaf ticket.gh issue list --search "viaMcp bulkMode KnowledgeBaseIngestionService"returned no open issues.ask_knowledge_base(..., type:'ticket')surfaced #10572 work-volume gate history, not an equivalent bulk-ingestion follow-up.rg "Phase 2C|bulk facade|bulkMode|viaMcp|ingest-tenant|KB_INGEST_VOLUME_EXCEEDED" resources/content/...found #10572 references only.The Problem
The Phase 2 parent #11626 intentionally splits ingestion into two facades behind one shared service:
viaMcp: true, volume-gated by #10572.After #11633, the shared service exists, but its only embedding route hardcodes MCP semantics. That means a Phase 2C caller cannot reuse the service for large external-workspace pushes without being rejected by
VectorService.embed()'s MCP work-volume gate. The review also noted the PR #11678 Post-Merge Validation phrase "Phase 2C bulk facade can reuse the same service" is only true after this ticket adds the caller-supplied bulk-mode seam.The Architectural Reality
Relevant current surfaces:
KnowledgeBaseIngestionService.ingestSourceFiles()viaMcp/bulkModeoptionKnowledgeBaseIngestionService.embedChunkGroups()VectorService.embed(..., {deleteStale: false, tenantContext, viaMcp: true})VectorService.embed()deleteStale: true; enforces #10572 gate whenviaMcp === trueandchunksToProcess > aiConfig.mcpSyncMaxChunksbuildScripts/ai/syncKnowledgeBase.mjsStructural pre-flight: ticket prescription introduces a likely new CLI script. Fast-path applies because
buildScripts/ai/ingestTenant.mjsmatches the sibling one-shot AI build-script pattern ofbuildScripts/ai/syncKnowledgeBase.mjs; service logic stays inKnowledgeBaseIngestionService, while the build script remains a thin facade. No novel directory choice; map-maintenance not needed.The Fix
Implement Phase 2C as the bulk facade over the already-merged ingestion service:
KnowledgeBaseIngestionService.ingestSourceFiles(), e.g.viaMcporbulkMode, with the safe default preserving current small-batch MCP behavior.embedChunkGroups()so Phase 2B can passviaMcp: trueand Phase 2C can passviaMcp: false.buildScripts/ai/ingestTenant.mjs, plus an npm script such asai:ingest-tenant, for external-workspace bulk pushes.deleteStale: falsefor incremental parsed-source ingestion; this ticket must not reintroduce full-corpus stale deletion into tenant pushes.Contract Ledger Matrix
KnowledgeBaseIngestionService.ingestSourceFiles()viaMcp/bulkModeoption; default remains MCP-safeviaMcp: true) remains default for callers that omit it@summary+ params updatedembedChunkGroups()toVectorService.embed()VectorService.embed()viaMcp:falsereaches VectorService for bulk pathbuildScripts/ai/syncKnowledgeBase.mjspackage.jsonscript name + CLI usage commentmcpSyncMaxChunkschunks without gate rejectionAcceptance Criteria
KnowledgeBaseIngestionService.ingestSourceFiles()accepts a caller-suppliedviaMcporbulkModeoption; omitted option preserves currentviaMcp: truebehavior.embedChunkGroups()passes the caller-selected mode toVectorService.embed()instead of hardcodingviaMcp: true.viaMcp: trueand still returns a structured volume-gate response when aboveaiConfig.mcpSyncMaxChunks.viaMcp: falseand can ingest a fixture larger thanaiConfig.mcpSyncMaxChunkswithoutKB_SYNC_VOLUME_EXCEEDED/KB_INGEST_VOLUME_EXCEEDED.buildScripts/ai/unless implementation V-B-A finds a better sibling-approved home) and registered inpackage.json.Out of Scope
Avoided Traps
mcpSyncMaxChunksviaMcp:falsegloballyRelated
700223dfccd30292c86bb630dd8a5dbc71da42d5)Origin Session ID
d13c94dd-e721-4e28-ac9e-4d0b3c0f66deHandoff Retrieval Hints
query_raw_memories({query: 'PR 11678 Approve Follow-Up viaMcp bulkMode Phase 2C'})query_raw_memories({query: 'KnowledgeBaseIngestionService embedChunkGroups viaMcp true bulk facade'})ask_knowledge_base({query: 'VectorService embed viaMcp work-volume gate', type: 'src'})PRR_kwDODSospM8AAAABAfuHDg