LearnNewsExamplesServices
Frontmatter
id10572
titleVectorService.embed: branch on chunksToProcess.length to prevent agent freeze on bulk resync
stateClosed
labels
enhancementaiarchitecture
assigneesneo-opus-4-7
createdAtMay 1, 2026, 1:37 PM
updatedAtMay 1, 2026, 3:15 PM
githubUrlhttps://github.com/neomjs/neo/issues/10572
authorneo-opus-4-7
commentsCount1
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 1, 2026, 3:15 PM

VectorService.embed: branch on chunksToProcess.length to prevent agent freeze on bulk resync

Closedenhancementaiarchitecture
neo-opus-4-7
neo-opus-4-7 commented on May 1, 2026, 1:37 PM

Context

Empirical anchor 2026-05-01 ~11:00Z: @neo-gemini-3-1-pro triggered a manage_knowledge_base sync that ran 10+ minutes and locked the harness; @neo-gpt's parallel MCP attempt timed out at 120s; @neo-opus-4-7 (me) and @neo-gpt held all KB operations until @tobiu could resolve. Cause: today's #10003 / #10558 KB embedding-provider unification (TextEmbeddingService routing) requires a full re-embedding cycle of all existing KB chunks — the canonical workflow for embedding-provider migration is deleteCollectionembed → all 4,200+ chunks go through the API rate limit (~1hr).

Graduates from Discussion #10448 Section 3 (MCP vs. Executable Boundary) — specifically the empirical-corrected proposal posted as a follow-up after reading VectorService.embed() end-to-end (Claude's 2026-05-01 ~11:26Z comment on #10448Self-correction: empirical grounding on the actual KB sync code).

The Problem

manage_knowledge_base sync (the MCP tool path) runs VectorService.embed(knowledgeBasePath) synchronously inside the agent's tool-call execution. Existing delta logic is correct: chunk-hash comparison at lines 151-163 dedups against ChromaDB's existing-IDs set; line 177-181 fast-paths exit when zero changes. So delta-only resyncs ARE fast. The freeze risk surfaces ONLY when chunksToProcess.length is large.

The architectural gap: the MCP execution shape is uniform regardless of work-volume. An agent calling the tool with no idea whether it's about to do a 5-chunk update or a 4,200-chunk full-resync gets the same "synchronous-block-until-done" shape. Today's incident is exactly that failure mode — the agent had no way to know the work-volume up front, the call hung the harness, and the only recovery path was @tobiu manually waiting it out.

The Architectural Reality

Pre-execution work-volume IS observable inside VectorService.embed() at line 175 (after the existing-IDs fetch + delta dedup, before any embedding API call). The implementation has a natural choke point where the volume decision can branch.

Files in scope:

  • ai/mcp/server/knowledge-base/services/VectorService.mjs:79-235embed() method; line 175 is the post-delta-pre-embed branch point
  • ai/mcp/server/knowledge-base/services/DatabaseService.mjs / toolService.mjs — MCP tool dispatch; receives the typed error and surfaces it through the MCP isError path
  • ai/mcp/server/knowledge-base/openapi.yamlmanage_knowledge_base tool description (update operator-facing semantics)
  • Test: test/playwright/unit/ai/mcp/server/knowledge-base/services/VectorService.WorkVolumeBranching.spec.mjs (new)

The Fix

Single concrete prescription. Add a work-volume threshold check at the post-delta-pre-embed boundary. When chunksToProcess.length exceeds a configurable threshold (aiConfig.mcpSyncMaxChunks, default 50 — aligned with current KB batchSize), refuse synchronous execution at the MCP layer by throwing a typed KbSyncVolumeExceededError that the MCP dispatch path converts into the standard isError response. CLI invocations (via buildScripts/ai/syncKnowledgeBase.mjs) bypass the threshold (explicit opt-in to long-running work).

Critical implementation note (per @neo-gpt's review guardrail #2): the rejection must travel through an actual error path. Returning {error, code, ...} from VectorService.embed() is NOT sufficient — the MCP layer would treat it as a successful tool payload that happens to contain {error: ...}. Either throw a typed exception that bubbles through to the MCP isError mechanism, OR have DatabaseService.manageKnowledgeBase() / toolService explicitly convert the volume-exceeded result into the MCP failure path. The AC list asserts this empirically (an MCP caller observes a failure response, not a success-shape).

// ai/mcp/server/knowledge-base/services/VectorService.mjs around line 175
if (chunksToProcess.length === 0) {
    const message = 'No changes detected. Knowledge base is up to date.';
    logger.log(message);
    return {message};
}

const mcpThreshold = aiConfig.mcpSyncMaxChunks ?? 50;
if (this.viaMcp && chunksToProcess.length > mcpThreshold) {
    throw new KbSyncVolumeExceededError({
        chunksToProcess: chunksToProcess.length,
        threshold      : mcpThreshold,
        remediation    : 'Run via CLI: `npm run ai:sync-kb`. Or trigger via the Sync Daemon (per Discussion #10448).'
    });
}

// Existing line 188+: for-loop that does the actual embedding work
for (let i = 0; i < chunksToProcess.length; i += batchSize) { ... }

The viaMcp flag is threaded into embed() from the call site: MCP dispatch sets true, CLI script (syncKnowledgeBase.mjs) sets false. Simplest implementation; no global state.

Threshold rationale (per @neo-gpt's guardrail #1): the default 50 matches current KB batchSize — one batch of work is the floor for a "small enough to run synchronously" decision. Real latency depends on provider, batch size, API tier, retry state, and local-vs-openAiCompatible routing — making absolute timing claims (e.g., "50 chunks ≈ 50s") wrong-by-construction. The threshold is a tunable empirical guardrail; the AC asserts the BRANCH decision (not a timing claim).

Acceptance Criteria

  • (AC1) VectorService.embed() adds a work-volume threshold check immediately after the existing fast-path-exit at line 177
  • (AC2) Threshold is configurable via aiConfig.mcpSyncMaxChunks (default 50, aligned with batchSize); tunable empirically per deployment without source change
  • (AC3) MCP invocations exceeding threshold THROW a typed KbSyncVolumeExceededError with structured metadata (chunksToProcess, threshold, remediation); the error bubbles to the MCP isError response path
  • (AC4) AC test verifies that an MCP caller observes a failure response (not a success-shape payload containing an error object)
  • (AC5) CLI invocations (via buildScripts/ai/syncKnowledgeBase.mjs) bypass the threshold (explicit opt-in to long work)
  • (AC6) manage_knowledge_base openapi.yaml description updates the operator-facing semantics — call documents what threshold-exceeded outcome looks like
  • (AC7) Permanent Playwright unit spec covers: (a) zero-changes fast-path unchanged, (b) small chunksToProcess succeeds via MCP path, (c) large chunksToProcess throws typed error → MCP failure response via MCP path, (d) CLI path bypasses threshold

Out of Scope

  • Daemonized Sync Supervisor implementation. Discussion #10448 covers the Daemon Isolation framing; this ticket only adds the volume-aware gate + error path. The downstream daemon-launching is its own ticket once the daemon substrate exists.
  • Generalizing the pattern to other MCP tools (e.g., summarize_sessions with includeAll: true). Worth a follow-up sweep but keep this ticket KB-scoped.
  • Changing chunk hashing logic to be embedding-provider-aware. Different problem (would prevent the embedding-provider-migration trigger entirely); orthogonal to the freeze-risk gate.

Avoided Traps

  • Trap: expose a mode: 'delta' | 'full' parameter on the MCP tool. Rejected after empirical reading of embed() — the delta logic already exists (chunk-hash comparison at lines 151-163; fast-path at 177). The mode framing was speculative; the actual gap is work-volume-aware execution shape, not input-shape disambiguation. Adding a mode param would force operators to predict work-volume, which they often can't (post-embedding-provider-migration sync looks like delta but executes full).
  • Trap: daemonize KB sync entirely (remove from MCP). Rejected because delta resyncs ARE small + fast + agent-callable-without-freeze-risk. Removing the MCP path entirely loses the "small change → quick incremental update via agent flow" use case.
  • Trap: hardcode the threshold (50) in source. Rejected — aiConfig.mcpSyncMaxChunks lets operators tune for their deployment's API rate-limit profile + batch sizing.
  • Trap: justify the threshold via timing math (50 chunks × ~1s/call ≈ 50s). Rejected per @neo-gpt's guardrail #1 — real latency depends on provider, batch size, API tier, retry state, routing. Absolute timing claims are wrong-by-construction. The threshold is empirically tunable; ACs cover the BRANCH decision, not timing.
  • Trap: return {error, code} from embed() and trust the MCP layer to map it. Rejected per @neo-gpt's guardrail #2 — MCP would treat the return value as a success-shaped payload that happens to contain error. Must throw a typed exception OR explicitly convert at the dispatch layer to the MCP isError path.
  • Trap: bundle daemon execution into this ticket. Rejected — Daemon Sync Supervisor is its own substrate (per Discussion #10448's Section 1+2). This ticket is the gate + error path only.

Related

  • Origin Discussion: #10448 — Agent OS Tool Boundaries & Daemon Isolation. My follow-up comment with the empirically-grounded proposal landed 2026-05-01 ~11:26Z (after the original speculative mode=delta/mode=full proposal at ~11:22Z, self-corrected after reading the actual VectorService code).
  • Empirical anchor PR: #10558 — TextEmbeddingService unification + StorageRouter guard; the embedding-provider migration that triggered today's full-resync workflow
  • Adjacent (downstream): #10088 — Automate post-merge knowledge-base sync trigger; depends on reliable sync execution + observability that this gate enables
  • Sibling architectural concern: #10186 — MCP concurrency audit + single-writer enforcement (Epic); work-volume-aware-execution-shape pattern fits the single-writer enforcement spirit
  • GPT review/guardrails: https://github.com/neomjs/neo/issues/10572#issuecomment-4359108455 — coordination challenge-pass with two implementation guardrails (threshold rationale + error-path shape)

Origin Session ID: 1f30c9d8-4a36-4be0-98a5-bd5b89289227 Retrieval Hint: "VectorService embed work-volume-aware execution MCP freeze KB sync threshold daemon Discussion 10448"

tobiu referenced in commit b8a4fc9 - "feat(knowledge-base): work-volume-aware gate on manage_knowledge_base sync (#10572) (#10573) on May 1, 2026, 3:15 PM
tobiu closed this issue on May 1, 2026, 3:15 PM