LearnNewsExamplesServices
Frontmatter
id14798
titleworkspaceSafety CI timeout hides orchestrator boot failure
stateOpen
labels
bugaitestingbuild
assigneesneo-gpt
createdAt4:42 PM
updatedAt6:09 PM
githubUrlhttps://github.com/neomjs/neo/issues/14798
authorneo-gpt
commentsCount1
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
contentTrust
projected
quarantined0
signals[]
blockedBy[]
blocking[]

workspaceSafety CI timeout hides orchestrator boot failure

Open Backlog/active-chunk-4 bugaitestingbuild
neo-gpt
neo-gpt commented on 4:42 PM

Context

Two consecutive #14796 CI attempts failed the same unrelated integration-unified job while all #14688-owned checks and focused tests were green. The failed job is test/playwright/integration/ai/daemons/workspaceSafety.spec.mjs, not a file touched by #14796.

Observed on 2026-07-04:

  • PR #14796 head 272c0a053fba3be31729057ade094114a846953a.
  • First failed run: 28709106103, job 85139567984.
  • Rerun failed again: job 85140366859.
  • Both failures: Timed out after 30000ms waiting for log predicate. Last content (0 bytes) in waitForLogContent() while waiting for [Orchestrator] Started..
  • The same integration job reported 45 passed, 2 skipped, 1 did not run, then this one timeout failure.

Duplicate/freshness sweep:

  • Live latest-open sweep: checked latest 20 open issues at 2026-07-04T14:41:34Z; no equivalent workspaceSafety / orchestrator-startup-log timeout issue found.
  • Targeted open issue searches for workspaceSafety timeout orchestrator log, integration-unified flake, and never use overlays tests returned no exact open duplicate.
  • A2A in-flight sweep: checked latest 30 messages; no [lane-claim] / [lane-intent] for this workspaceSafety failure. Nearby claims are unrelated (#14797, #14794, review routing, #14751 body fix).
  • KB ticket search for workspaceSafety orchestrator startup timeout integration-unified ticket surfaced archived/background orchestrator issues only, not an open duplicate.

The Problem

workspaceSafety.spec.mjs waits only on the daemon log file for the cloud-mode boot proof. When the daemon does not create the log file or exits before writing to it, the test times out with Last content (0 bytes) and does not surface the child process stdout, stderr, exit code, or missing-log/root-cause details at the failing assertion.

That makes the CI failure both sticky and low-signal: a rerun reproduced the same timeout, but the current artifact does not tell reviewers whether the daemon crashed before opening the log, failed config/bootstrap validation, inherited a bad environment, or simply exceeded the boot window.

This blocks unrelated open PRs because integration-unified is a merge gate.

The Architectural Reality

The owning test is test/playwright/integration/ai/daemons/workspaceSafety.spec.mjs. It spawns ai/daemons/orchestrator/daemon.mjs in a fresh temporary workspace with:

  • NEO_AI_DEPLOYMENT_MODE=cloud
  • NEO_AI_ORCHESTRATOR_DIR=<tmp>/orchestrator-daemon
  • NEO_AI_DB_PATH=<tmp>/memory-core-graph.sqlite
  • cwd isolated to the temporary workspace

The log wait helper at workspaceSafety.spec.mjs only polls orchestrator.log. The first test attaches stdout and stderr buffers, but those buffers are not included when waitForLogContent() times out before the post-wait assertions execute.

Structure-map gate: npm run --silent ai:structure-map -- --files --loc was run in this turn for Agent OS touched surfaces; this lane belongs to the integration-test harness around ai/daemons/orchestrator/daemon.mjs, not to the GitHub Workflow MCP surface that #14796 modifies.

The Fix

Make the workspaceSafety boot probe diagnostic and deterministic enough for CI:

  • Race the log wait with child process exit/error so a daemon crash produces an immediate failure including exitCode, signal, stdout, and stderr.
  • On timeout, include whether orchestrator.log exists, the dataDir listing, child process state, and captured stdout/stderr in the thrown error.
  • Verify whether the cloud-mode boot needs additional env isolation or a longer boot predicate timeout; do not hide a real daemon-start regression by only increasing the timeout.
  • Preserve the test’s real contract: fresh workspace cloud-mode boot must self-bootstrap sqlite and reach [Orchestrator] Started. without fatal log surfaces.

Contract Ledger Matrix

Target Surface Source of Authority Proposed Behavior Fallback Docs Evidence
workspaceSafety.spec.mjs cloud boot probe Existing AC5 runtime proof for #11948 / #11837 Fails fast or times out with daemon stdout/stderr/exit/log diagnostics instead of 0 bytes only If the daemon legitimately needs more than 30s on hosted runners, raise timeout only with evidence from captured diagnostics Test comments/JSDoc in workspaceSafety.spec.mjs Re-run targeted integration spec and a CI PR check

Decision Record impact

none — this is test-harness reliability around an existing orchestrator workspace-safety invariant.

Acceptance Criteria

  • Reproduce or explain the repeated CI failure from run 28709106103 / jobs 85139567984 and 85140366859.
  • workspaceSafety.spec.mjs reports child process stdout, stderr, exit code/signal, log existence, and data-dir listing when the [Orchestrator] Started. predicate is not reached.
  • The cloud-mode boot test still proves sqlite self-bootstrap and the [Orchestrator] Started. log surface in a fresh workspace.
  • Run the targeted integration command: npm run test-integration-unified -- test/playwright/integration/ai/daemons/workspaceSafety.spec.mjs.
  • PR evidence names the hosted CI result, not just local focused tests.

Out of Scope

  • Reworking the orchestrator daemon architecture.
  • Broad config-lifecycle overlay redesign (#14674 / #14675 already cover that larger class).
  • Retiring integration-unified as a gate.

Avoided Traps

  • Do not label this as a harmless known flake without a ticket and reproducer evidence; it is currently blocking unrelated PRs.
  • Do not only rerun CI indefinitely. Two same-signature failures are enough to require a durable lane.
  • Do not make the test pass by masking daemon stderr or skipping the workspace-safety assertion.

Related

PR #14796 exposed the gate while changing unrelated GitHub Workflow MCP review-preflight files. The failure itself is in workspaceSafety.spec.mjs / orchestrator daemon integration.

Origin Session ID: 6439a7c5-5f2f-4658-9226-835c317c7a0b Retrieval Hint: workspaceSafety.spec.mjs orchestrator Started timeout Last content 0 bytes integration-unified 28709106103