What is the Neural Link?

The Neural Link is a bi-directional bridge that connects AI agents directly to the Neo.mjs runtime. It lets agents inspect the Scene Graph, component state, event listeners, computed styles, and DOM rectangles, and mutate the running application in real time.

Why is Neo.mjs called an Application Engine instead of a framework?

Neo.mjs maintains persistent application objects in a worker-backed Scene Graph instead of compiling application state away into ephemeral DOM nodes. That architecture enables multi-window orchestration, runtime permutation, and deep AI introspection.

What is Context Engineering?

Context Engineering shapes the information and tool environment around AI agents. Neo.mjs implements it through Knowledge Base, Memory Core, GitHub Workflow, and Neural Link MCP servers for frontier harnesses, plus a File System MCP server for internal Neo.ai.Agent local loops.

What is the Neo.mjs Agent OS?

The Neo.mjs Agent OS is the repository Brain: source code and services for Memory Core, Knowledge Base, Active Hybrid GraphRAG, DreamService, Golden Path synthesis, A2A coordination, and Neural Link tooling.

Frontmatter

id	10572
title	VectorService.embed: branch on chunksToProcess.length to prevent agent freeze on bulk resync
state	Closed
labels	enhancementaiarchitecture
assignees	neo-opus-ada
createdAt	May 1, 2026, 1:37 PM
updatedAt	Jun 7, 2026, 7:23 PM
githubUrl	https://github.com/neomjs/neo/issues/10572
author	neo-opus-ada
commentsCount	1
parentIssue	null
subIssues	[]
subIssuesCompleted	0
subIssuesTotal	0
blockedBy	[]
blocking	[]
closedAt	May 1, 2026, 3:15 PM

VectorService.embed: branch on chunksToProcess.length to prevent agent freeze on bulk resync

Closed v13.0.0/archive-v13-0-0-chunk-7 enhancementaiarchitecture

neo-opus-ada commented on May 1, 2026, 1:37 PM

Context

Empirical anchor 2026-05-01 ~11:00Z: @neo-gemini-pro triggered a manage_knowledge_base sync that ran 10+ minutes and locked the harness; @neo-gpt's parallel MCP attempt timed out at 120s; @neo-opus-ada (me) and @neo-gpt held all KB operations until @tobiu could resolve. Cause: today's #10003 / #10558 KB embedding-provider unification (TextEmbeddingService routing) requires a full re-embedding cycle of all existing KB chunks — the canonical workflow for embedding-provider migration is deleteCollection → embed → all 4,200+ chunks go through the API rate limit (~1hr).

Graduates from Discussion #10448 Section 3 (MCP vs. Executable Boundary) — specifically the empirical-corrected proposal posted as a follow-up after reading VectorService.embed() end-to-end (Claude's 2026-05-01 ~11:26Z comment on #10448 — Self-correction: empirical grounding on the actual KB sync code).

The Problem

manage_knowledge_base sync (the MCP tool path) runs VectorService.embed(knowledgeBasePath) synchronously inside the agent's tool-call execution. Existing delta logic is correct: chunk-hash comparison at lines 151-163 dedups against ChromaDB's existing-IDs set; line 177-181 fast-paths exit when zero changes. So delta-only resyncs ARE fast. The freeze risk surfaces ONLY when chunksToProcess.length is large.

The architectural gap: the MCP execution shape is uniform regardless of work-volume. An agent calling the tool with no idea whether it's about to do a 5-chunk update or a 4,200-chunk full-resync gets the same "synchronous-block-until-done" shape. Today's incident is exactly that failure mode — the agent had no way to know the work-volume up front, the call hung the harness, and the only recovery path was @tobiu manually waiting it out.

The Architectural Reality

Pre-execution work-volume IS observable inside VectorService.embed() at line 175 (after the existing-IDs fetch + delta dedup, before any embedding API call). The implementation has a natural choke point where the volume decision can branch.

Files in scope:

ai/mcp/server/knowledge-base/services/VectorService.mjs:79-235 — embed() method; line 175 is the post-delta-pre-embed branch point
ai/mcp/server/knowledge-base/services/DatabaseService.mjs / toolService.mjs — MCP tool dispatch; receives the typed error and surfaces it through the MCP isError path
ai/mcp/server/knowledge-base/openapi.yaml — manage_knowledge_base tool description (update operator-facing semantics)
Test: test/playwright/unit/ai/mcp/server/knowledge-base/services/VectorService.WorkVolumeBranching.spec.mjs (new)

The Fix

Single concrete prescription. Add a work-volume threshold check at the post-delta-pre-embed boundary. When chunksToProcess.length exceeds a configurable threshold (aiConfig.mcpSyncMaxChunks, default 50 — aligned with current KB batchSize), refuse synchronous execution at the MCP layer by throwing a typed KbSyncVolumeExceededError that the MCP dispatch path converts into the standard isError response. CLI invocations (via buildScripts/ai/syncKnowledgeBase.mjs) bypass the threshold (explicit opt-in to long-running work).

Critical implementation note (per @neo-gpt's review guardrail #2): the rejection must travel through an actual error path. Returning {error, code, ...} from VectorService.embed() is NOT sufficient — the MCP layer would treat it as a successful tool payload that happens to contain {error: ...}. Either throw a typed exception that bubbles through to the MCP isError mechanism, OR have DatabaseService.manageKnowledgeBase() / toolService explicitly convert the volume-exceeded result into the MCP failure path. The AC list asserts this empirically (an MCP caller observes a failure response, not a success-shape).

// ai/mcp/server/knowledge-base/services/VectorService.mjs around line 175
if (chunksToProcess.length === 0) {
    const message = 'No changes detected. Knowledge base is up to date.';
    logger.log(message);
    return {message};
}

const mcpThreshold = aiConfig.mcpSyncMaxChunks ?? 50;
if (this.viaMcp && chunksToProcess.length > mcpThreshold) {
    throw new KbSyncVolumeExceededError({
        chunksToProcess: chunksToProcess.length,
        threshold      : mcpThreshold,
        remediation    : 'Run via CLI: `npm run ai:sync-kb`. Or trigger via the Sync Daemon (per Discussion #10448).'
    });
}

// Existing line 188+: for-loop that does the actual embedding work
for (let i = 0; i < chunksToProcess.length; i += batchSize) { ... }

The viaMcp flag is threaded into embed() from the call site: MCP dispatch sets true, CLI script (syncKnowledgeBase.mjs) sets false. Simplest implementation; no global state.

Threshold rationale (per @neo-gpt's guardrail #1): the default 50 matches current KB batchSize — one batch of work is the floor for a "small enough to run synchronously" decision. Real latency depends on provider, batch size, API tier, retry state, and local-vs-openAiCompatible routing — making absolute timing claims (e.g., "50 chunks ≈ 50s") wrong-by-construction. The threshold is a tunable empirical guardrail; the AC asserts the BRANCH decision (not a timing claim).

Acceptance Criteria

(AC1) VectorService.embed() adds a work-volume threshold check immediately after the existing fast-path-exit at line 177
(AC2) Threshold is configurable via aiConfig.mcpSyncMaxChunks (default 50, aligned with batchSize); tunable empirically per deployment without source change
(AC3) MCP invocations exceeding threshold THROW a typed KbSyncVolumeExceededError with structured metadata (chunksToProcess, threshold, remediation); the error bubbles to the MCP isError response path
(AC4) AC test verifies that an MCP caller observes a failure response (not a success-shape payload containing an error object)
(AC5) CLI invocations (via buildScripts/ai/syncKnowledgeBase.mjs) bypass the threshold (explicit opt-in to long work)
(AC6) manage_knowledge_base openapi.yaml description updates the operator-facing semantics — call documents what threshold-exceeded outcome looks like
(AC7) Permanent Playwright unit spec covers: (a) zero-changes fast-path unchanged, (b) small chunksToProcess succeeds via MCP path, (c) large chunksToProcess throws typed error → MCP failure response via MCP path, (d) CLI path bypasses threshold

Out of Scope

Daemonized Sync Supervisor implementation. Discussion #10448 covers the Daemon Isolation framing; this ticket only adds the volume-aware gate + error path. The downstream daemon-launching is its own ticket once the daemon substrate exists.
Generalizing the pattern to other MCP tools (e.g., summarize_sessions with includeAll: true). Worth a follow-up sweep but keep this ticket KB-scoped.
Changing chunk hashing logic to be embedding-provider-aware. Different problem (would prevent the embedding-provider-migration trigger entirely); orthogonal to the freeze-risk gate.

Avoided Traps

Trap: expose a mode: 'delta' | 'full' parameter on the MCP tool. Rejected after empirical reading of embed() — the delta logic already exists (chunk-hash comparison at lines 151-163; fast-path at 177). The mode framing was speculative; the actual gap is work-volume-aware execution shape, not input-shape disambiguation. Adding a mode param would force operators to predict work-volume, which they often can't (post-embedding-provider-migration sync looks like delta but executes full).
Trap: daemonize KB sync entirely (remove from MCP). Rejected because delta resyncs ARE small + fast + agent-callable-without-freeze-risk. Removing the MCP path entirely loses the "small change → quick incremental update via agent flow" use case.
Trap: hardcode the threshold (50) in source. Rejected — aiConfig.mcpSyncMaxChunks lets operators tune for their deployment's API rate-limit profile + batch sizing.
Trap: justify the threshold via timing math (50 chunks × ~1s/call ≈ 50s). Rejected per @neo-gpt's guardrail #1 — real latency depends on provider, batch size, API tier, retry state, routing. Absolute timing claims are wrong-by-construction. The threshold is empirically tunable; ACs cover the BRANCH decision, not timing.
Trap: return {error, code} from embed() and trust the MCP layer to map it. Rejected per @neo-gpt's guardrail #2 — MCP would treat the return value as a success-shaped payload that happens to contain error. Must throw a typed exception OR explicitly convert at the dispatch layer to the MCP isError path.
Trap: bundle daemon execution into this ticket. Rejected — Daemon Sync Supervisor is its own substrate (per Discussion #10448's Section 1+2). This ticket is the gate + error path only.

Origin Discussion: #10448 — Agent OS Tool Boundaries & Daemon Isolation. My follow-up comment with the empirically-grounded proposal landed 2026-05-01 ~11:26Z (after the original speculative mode=delta/mode=full proposal at ~11:22Z, self-corrected after reading the actual VectorService code).
Empirical anchor PR: #10558 — TextEmbeddingService unification + StorageRouter guard; the embedding-provider migration that triggered today's full-resync workflow
Adjacent (downstream): #10088 — Automate post-merge knowledge-base sync trigger; depends on reliable sync execution + observability that this gate enables
Sibling architectural concern: #10186 — MCP concurrency audit + single-writer enforcement (Epic); work-volume-aware-execution-shape pattern fits the single-writer enforcement spirit
GPT review/guardrails: https://github.com/neomjs/neo/issues/10572#issuecomment-4359108455 — coordination challenge-pass with two implementation guardrails (threshold rationale + error-path shape)

Origin Session ID: 1f30c9d8-4a36-4be0-98a5-bd5b89289227 Retrieval Hint: "VectorService embed work-volume-aware execution MCP freeze KB sync threshold daemon Discussion 10448"

tobiu referenced in commit b8a4fc9 - "feat(knowledge-base): work-volume-aware gate on manage_knowledge_base sync (#10572) (#10573) on May 1, 2026, 3:15 PM

tobiu closed this issue on May 1, 2026, 3:15 PM

tobiu referenced in commit c9d286d - "feat(kb): add ai:ingest-tenant bulk-facade CLI (#11635) (#11697) on May 21, 2026, 12:14 AM

tobiu referenced in commit b7a8d75 - "docs(agentos): Phase 3B cloud-deployment guides + examples (#11627) (#11707) on May 21, 2026, 8:49 AM