LearnNewsExamplesServices
Frontmatter
id10935
titleTransportService.spec:23 still flakes after #10930 bind-race fix — second race surface
stateClosed
labels
bugaitestingneeds-re-triage
assignees[]
createdAtMay 8, 2026, 12:07 AM
updatedAtMay 12, 2026, 4:09 AM
githubUrlhttps://github.com/neomjs/neo/issues/10935
authorneo-opus-4-7
commentsCount2
parentIssue10924
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 11, 2026, 1:33 AM

TransportService.spec:23 still flakes after #10930 bind-race fix — second race surface

Closedbugaitestingneeds-re-triage
neo-opus-4-7
neo-opus-4-7 commented on May 8, 2026, 12:07 AM

Context

Surfaced 2026-05-08 during PR #10933 (Phase 3 unit-row re-add) CI run 25524203756. The test test/playwright/unit/ai/mcp/server/shared/services/TransportService.spec.mjs:23 — onsessionclosed hook removes transport and calls server.onSessionClosed via actual HTTP request flakes with:

TypeError: Cannot read properties of undefined (reading 'get')
  at: const sessionId = initResponse.headers.get('mcp-session-id');  // line 69

This is the SAME flake originally classified as G5#4 in #10924 triage, and supposedly resolved by #10932 → PR #10930 commit bf894b00b (Promise-wrapped app.listen() + finally restore + new deterministic regression test in setup() awaits listener accept-state before resolving (#10932)).

The fix did land in dev — verified ai/mcp/server/shared/services/TransportService.mjs:224-241 has the new Promise wrap + httpServer capture + destroy method. Yet the flake persists in CI (workers:1 substrate).

The Problem

The setup() await-bind fix was correct for the bind-race surface I tested (single fetch immediately after setup → no ECONNREFUSED). My new regression test [TransportService.spec:94 setup() awaits listener accept-state ... (#10932)](commit 79fa24e67) passes deterministically.

But the original flake's surface is different: it does await TransportService.setup(...) THEN await fetch(/mcp, POST init) and expects initResponse.headers to be defined. Empirically initResponse.headers is undefined — meaning either:

  • initResponse is not a real Response object (impossible per fetch spec UNLESS fetch is mocked/replaced)
  • OR there's a different race surface where the listener accepts the connection but Express middleware chain isn't fully wired before responding

Hypothesis: under workers:1, a sibling spec runs first and replaces global.fetch (or some other process-level state) without restoring it. When TransportService.spec:23 runs and calls fetch, it gets a stub/mock that returns a Response-like object without headers.

The Architectural Reality

  • test/playwright/unit/ai/mcp/server/shared/services/TransportService.spec.mjs:23-92 — the flaky test (uses port 3125, real fetch)
  • test/playwright/unit/ai/mcp/server/shared/services/TransportService.spec.mjs:94+ — my new regression-guard test (uses port 3126, real fetch — passes deterministically)
  • ai/mcp/server/shared/services/TransportService.mjs:224-241 — Promise-wrapped app.listen() (already fixed per #10932)

The asymmetry between the two tests: original uses port 3125 + 50-line POST-DELETE init; my new test uses port 3126 + simple GET. Either the POST-DELETE flow itself has a race (Express middleware setup race on /mcp POST handler) OR a sibling spec is interfering.

The Fix (TBD via investigation)

Investigation steps:

  1. Run TransportService.spec.mjs in isolation with workers:1 — does it flake? If YES → race intra-spec (Express middleware not fully wired). If NO → cross-spec interference.
  2. If cross-spec: identify which sibling replaces global.fetch or holds state on port 3125
  3. If intra-spec: add await new Promise(r => setTimeout(r, 50)) post-setup AS DIAGNOSTIC ONLY to confirm timing-shape, then look for the actual unwired-middleware lag

Acceptance Criteria

  • (AC1) Empirically reproduce the flake locally with workers:1
  • (AC2) Categorize: intra-spec race vs cross-spec interference
  • (AC3) Implement targeted fix; do NOT introduce setTimeout-shaped band-aids
  • (AC4) TransportService.spec:23 runs deterministically across 5 consecutive npm run test-unit invocations on CI substrate

Out of Scope

  • Re-architecting Express middleware setup (out of substrate-fix scope)
  • Refactoring the test to avoid POST-DELETE flow (the test SHAPE is correct; it tests real onsessionclosed hook semantics)

Avoided Traps

  • Skip-guard via NEO_TEST_SKIP_CI: applied as immediate ship-the-PR move on PR #10933, NOT the proper fix. The skip preserves the unit-suite gate while this investigation lands.
  • Bumping retries to 5: papers over without addressing root cause.

Related

Origin Session ID: 7e897a0b-33ce-4d6c-b1a9-a1ff93e4e571

Retrieval Hint: query_raw_memories(query="TransportService bind race residual second surface workers 1 flake PR 10933 #10932 followup")

tobiu referenced in commit 98897fc - "feat(ci): re-add unit suite to matrix post-Bucket-G substrate (#10939) (#10953) on May 8, 2026, 2:43 PM