LearnNewsExamplesServices
Frontmatter
id10932
titleTransportService.setup() races Express bind — fetch fires before listener accepts
stateClosed
labels
bugaiarchitecture
assigneesneo-opus-4-7
createdAtMay 7, 2026, 11:33 PM
updatedAtMay 9, 2026, 11:15 PM
githubUrlhttps://github.com/neomjs/neo/issues/10932
authorneo-opus-4-7
commentsCount0
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 7, 2026, 11:39 PM

TransportService.setup() races Express bind — fetch fires before listener accepts

Closedbugaiarchitecture
neo-opus-4-7
neo-opus-4-7 commented on May 7, 2026, 11:33 PM

Context

Surfaced 2026-05-07 during Bucket G G5 flake triage at #10924 (G5#4 row in triage matrix). One of 4 flaky tests from CI run 25514310691: test/playwright/unit/ai/mcp/server/shared/services/TransportService.spec.mjs:23 — onsessionclosed hook removes transport and calls server.onSessionClosed via actual HTTP request. Failure surface: TypeError: Cannot read properties of undefined (reading 'get') on initResponse.headers.get(...).

Root cause inspection (per G5 triage): port 3125 is unique to this spec (no cross-spec collision verified via grep). The most likely failure mode is TransportService.setup() returns before Express has actually completed binding to port 3125 — app.listen() returns synchronously without awaiting the listener accept-state, so the subsequent fetch() call races the bind.

Filed retroactively to track G5#4 work that landed via PR #10930 commit bf894b00b (authored by @neo-gemini-3-1-pro). The fix shape was correct; this ticket creates the missing close-target so the work has proper graph attribution.

The Problem

TransportService.setup() invokes Express app.listen(port, callback) and returns from the async function before the listener has actually entered the LISTENING state. Under load or under fullyParallel test interleaving, the calling spec's subsequent fetch(port) can fire before the bind completes — causing the response object to lack the expected shape.

The onsessionclosed test reproducer:

  1. Calls await TransportService.setup({...}) → returns immediately, listener pending
  2. Calls await fetch('http://localhost:3125/mcp', {...POST init...}) → races
  3. Reads initResponse.headers.get('mcp-session-id') → headers undefined when fetch lost the race

This is intermittent (passes on retry) because under low-load conditions the bind completes within the V8 event-loop tick before fetch fires.

The Architectural Reality

  • ai/mcp/server/shared/services/TransportService.mjs #setup() method
  • Express app.listen() returns a http.Server instance synchronously; the 'listening' event (or the listen-callback) fires when the bind is complete
  • Test consumer: test/playwright/unit/ai/mcp/server/shared/services/TransportService.spec.mjs:23
  • No destroy() teardown method previously existed — leaked HTTP servers across test runs

The Fix

Wrap app.listen(port, callback) in a Promise that resolves only after the listen-callback fires (or rejects on 'error' event). Capture the returned http.Server instance on this.httpServer for later teardown. Add a destroy() method that closes the HTTP server cleanly.

This guarantees await TransportService.setup({...}) only returns once the listener is ready to accept connections, eliminating the race.

Acceptance Criteria

  • (AC1) TransportService.setup() wraps app.listen() in a Promise that resolves on the listen-callback (port bound + listening) and rejects on 'error' event
  • (AC2) this.httpServer captures the listener instance for later teardown
  • (AC3) TransportService.destroy() closes the HTTP server cleanly (idempotent on missing server)
  • (AC4) Unit test asserts await TransportService.setup({...}) returns only after the listener is accepting (probe via test-side fetch immediately after setup resolves; expect 200/406, not connection-refused)
  • (AC5) Existing TransportService.spec onsessionclosed test passes deterministically (no longer flaky)

Out of Scope

  • Migrating other services that currently use raw app.listen() — file as separate ticket if discovered
  • Adding port-conflict retry logic — out of substrate-fix scope
  • Replacing Express with a different HTTP substrate — architectural rewrite, not relevant

Avoided Traps

  • Polling for bind via setTimeout loop: rejected — race-prone, adds non-determinism. The Express callback / 'listening' event is the canonical signal.
  • Fixed setTimeout(N) before resolve: rejected — slows fast paths, doesn't guarantee correctness on slow CI hosts.
  • Bind-status check via separate connect() probe: rejected — adds spec complexity; the listen-callback is the canonical signal.

Related

  • Implementation landed: PR #10930 commit bf894b00b (@neo-gemini-3-1-pro authored, bundled with #10931 fix)
  • Triage source: #10924 G5#4 row
  • Bucket G epic: #10924
  • Sibling flake patterns: G5#1 (DiscussionService — resolved via #10929), G5#2 (KBRecorderService — deferred), G5#3 (PermissionService — deferred)
  • Empirical anchor: CI run 25514310691

Origin Session ID: 7e897a0b-33ce-4d6c-b1a9-a1ff93e4e571

Retrieval Hint: query_raw_memories(query="TransportService.setup bind race app.listen Promise listener G5#4 #10783 #10931 PR #10930")

tobiu referenced in commit 21d5cdd - "feat(memory-core): healthcheck features.wake observability block (#10783) (#10930) on May 7, 2026, 11:39 PM
tobiu closed this issue on May 7, 2026, 11:39 PM