LearnNewsExamplesServices
Frontmatter
id10915
titleMCP servers fail in deployed CI: singleton-transport reconnect + auth env var mismatch
stateClosed
labels
bugairegressionarchitecture
assigneesneo-opus-4-7
createdAtMay 7, 2026, 7:00 PM
updatedAtMay 9, 2026, 11:15 PM
githubUrlhttps://github.com/neomjs/neo/issues/10915
authorneo-opus-4-7
commentsCount0
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 7, 2026, 7:22 PM

MCP servers fail in deployed CI: singleton-transport reconnect + auth env var mismatch

Closedbugairegressionarchitecture
neo-opus-4-7
neo-opus-4-7 commented on May 7, 2026, 7:00 PM

Context

Surfaced 2026-05-07 by Lane C #10899 integration row CI on rebased head cd6ab05c6 after #10914 (bash TCP probe) cleared the Chroma healthcheck substrate layer. Empirical diagnosis from CI logs by @neo-gpt (A2A 2026-05-07T16:55:01Z, broadcast MESSAGE:45cd4186).

The Problems (two distinct bugs, surfaced together)

Bug 1: TransportService singleton-transport reconnect

File: ai/mcp/server/shared/services/TransportService.mjs:153

Symptom:

Error: Already connected to a transport. Call close() before connecting to a new transport,
or use a separate Protocol instance per connection.

Root cause: TransportService creates a new StreamableHTTPServerTransport per session (correct for streamable-HTTP MCP) but reconnects the singleton server.mcpServer / Protocol instance via server.mcpServer.connect(transport) for every new client. The MCP SDK's Protocol class is single-transport: the first client consumes it, subsequent clients hit the "Already connected" guard → uncaught exception → Express HTML 500 page.

Empirical anchors: all 5 integration specs fail at the SECOND /mcp POST (each spec creates fresh client connections, and the first connection works for the spec but the next-connecting spec sees the bug):

  • healthcheck.spec.mjs:9 — KB+MC healthcheck tool call → 500
  • CrossTenantIsolation.spec.mjs:20 — alice/bob clients (2nd client = first failure) → 500
  • HeartbeatPropagation.spec.mjs:11 — repeated assertSustainedHealth probes → 500
  • healthcheck.spec.mjs:40 — sustained-liveness composability → 500
  • AuthRejection.spec.mjs:11 — first client (without identity) succeeds incorrectly (compounded by Bug 2)

Bug 2: Auth env var name mismatch

Compose configuration: ai/deploy/docker-compose.test.yml for mc-server and kb-server:

- NEO_AUTH_TRUST_PROXY_IDENTITY=true

Config template binding: the auth.trustProxyIdentity config reads from AUTH_TRUST_PROXY_IDENTITY (NO NEO_ prefix) per the canonical template in ai/mcp/server/<server>/config.template.mjs.

Result: the trust-proxy-identity gate never activates in Docker config. The auth-rejection layer that should 401 missing-X-PREFERRED-USERNAME requests never fires. AuthRejection test's expected-rejection becomes unexpected-success.

Empirical anchor: AuthRejection.spec.mjs:32 expect(rejectionError).toBeTruthy() failure — rejectionError is undefined because the no-identity client succeeded (gate not active).

The Architectural Reality

  • MCP SDK's StreamableHTTPServerTransport model: each client gets its own transport, but the McpServer/Protocol can either be (a) singleton with multiplexing OR (b) per-session new instance. Current TransportService does NEITHER — it creates new transports but reuses the singleton Protocol, hitting the SDK's "one transport per Protocol" invariant.
  • The auth env var mismatch is a downstream effect of the env-var-namespacing audit pattern (see #10884 for prior NEO_ prefix rationalization). Compose was updated to canonical NEO_* but the binding in config.template.mjs was missed.

The Fix (two prescriptions; can be 1 or 2 PRs)

Bug 1 fix (TransportService refactor):

Either:

  • (a) Per-session new McpServer instance (one Protocol per client). Session map Map<sessionId, {server, transport}>. Higher memory but simpler invariant.
  • (b) Singleton McpServer + multi-transport multiplexing. Requires cooperation from MCP SDK Protocol class — may not be supported.

Recommend (a) unless SDK provides multiplexing primitive. Per-session McpServer is the canonical streamable-HTTP server pattern.

Bug 2 fix (auth env alignment):

Either:

  • (a) Update ai/mcp/server/<server>/config.template.mjs to bind from NEO_AUTH_TRUST_PROXY_IDENTITY.
  • (b) Change ai/deploy/docker-compose.test.yml to set AUTH_TRUST_PROXY_IDENTITY (drop NEO_ prefix).

Recommend (a)NEO_* prefix is the canonical namespace per #10884; the binding template is the bug.

Acceptance Criteria

  • Bug 1: TransportService refactored so concurrent client connections to /mcp each get their own McpServer/Protocol instance (no singleton reuse).
  • Bug 2: auth.trustProxyIdentity reads from NEO_AUTH_TRUST_PROXY_IDENTITY env var (canonical NEO_* prefix per #10884).
  • All 5 integration specs pass under npm run test-integration once Lane C #10899 re-runs CI on a rebase post-merge.

Out of Scope

  • Refactoring StdioServerTransport (Stdio sessions are inherently 1:1 with a server instance — no multiplexing concern).
  • Audit of OTHER env var name mismatches between compose and config templates (file follow-up if Lane C unit-test substrate audit doesn't already cover).
  • Performance tuning of per-session McpServer creation (file follow-up if profiling shows it's a hotspot).

Related

  • Surfacing context: Lane C CI run 25509545707 (job 74864331326 on PR #10899 head cd6ab05c6) — full server-side log evidence pulled by @neo-gpt.
  • Predecessor in this lineage: Chroma healthcheck substrate fixes (#10902/#10904 → #10908/#10909 → #10911/#10912 → #10913/#10914). Bash TCP probe cleared the compose-up healthcheck layer; this ticket is the next substrate layer (deployed MCP server itself).
  • Related env-var-namespacing work: #10884 (NEO_ prefix canonicalization for SessionService env vars).
  • Author of empirical diagnosis: @neo-gpt (#10901 Lane A author, has #10895 trust-proxy-identity context).
  • In-flight implementation: @neo-gemini-3-1-pro (per @tobiu A2A 2026-05-07T16:58Z) — "knee deep into coding" without ticket; this ticket is being filed concurrently.

Origin Session ID: 7e897a0b-33ce-4d6c-b1a9-a1ff93e4e571

Retrieval Hint: query_raw_memories(query="MCP TransportService singleton reconnect StreamableHTTPServerTransport auth env mismatch deployed CI Lane C")

tobiu referenced in commit 3a6a335 - "fix(mcp): resolve TransportService concurrency and auth config template prefix (#10915) (#10916) on May 7, 2026, 7:22 PM
tobiu closed this issue on May 7, 2026, 7:22 PM