LearnNewsExamplesServices
Frontmatter
id10913
titleChroma healthcheck times out: needs start_period for cold-start grace on CI
stateClosed
labels
bugaibuild
assigneesneo-opus-4-7
createdAtMay 7, 2026, 6:30 PM
updatedAtMay 9, 2026, 11:15 PM
githubUrlhttps://github.com/neomjs/neo/issues/10913
authorneo-opus-4-7
commentsCount0
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 7, 2026, 6:45 PM

Chroma healthcheck times out: needs start_period for cold-start grace on CI

Closedbugaibuild
neo-opus-4-7
neo-opus-4-7 commented on May 7, 2026, 6:30 PM

Context

5th iteration on the Lane C #10899 integration row. After #10902/#10904 (Dockerfile --ignore-scripts), #10908/#10909 (curl→python urllib), and #10911/#10912 (pythonpython3), the integration row STILL fails with the same 60s Chroma healthcheck timeout.

@neo-gpt independently arrived at the same diagnosis (A2A 2026-05-07T16:28:34Z): "likely the Docker healthcheck command path is failing, not Chroma startup itself ... consider start_period if the first probes race startup."

Empirical Evidence

Lane C run 25508089233 (post-#10912 with python3):

  • 16:19:23 — Chroma container output banner: "Connect to Chroma at: http://localhost:8000"
  • 16:19:23 → 16:20:23zero output from chroma-1 for 60s exactly
  • 16:20:23dependency failed to start: container neo-integration-test-chroma-1 is unhealthy
  • 16:22:28 — Playwright webServer 240s timeout

The 60s window matches 5s × 12 retries exactly. The healthcheck has no start_period — failures from the first probe count toward the retry budget. On cold-start GitHub ubuntu-latest hardware (2-CPU, 7GB), Chroma's banner appears before the uvicorn HTTP server has bound to port 8000 + completed startup. The healthcheck fires every 5s during this window, all probes return connection-refused, all count as failures, container marked unhealthy at retry 12.

Empirical verification chain that still holds:

  • python3 IS in PATH (verified via WebFetch on python:3.11-slim-bookworm Dockerfile + chromadb 1.5.9 source).
  • /api/v2/heartbeat IS the canonical endpoint on Chroma 1.5.9 server (verified in chromadb/server/fastapi/__init__.py route registration).
  • ✗ HTTP server is NOT yet bound when the first ~12 probes fire on cold-start CI.

The Fix

Add start_period: 60s to the healthcheck. During this grace period, failed probes do NOT count toward retries — Chroma gets a runway to be HTTP-ready before the retry budget starts.

   healthcheck:
     test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/api/v2/heartbeat', timeout=2)"]
     interval: 5s
     timeout: 5s
     retries: 12
+    start_period: 60s

Total grace + retry window: 60s (start_period) + 60s (retries × interval) = 120s before unhealthy. Comfortably under Playwright's 240s webServer timeout (compose-up cold-start uses ~80s for image pull + container create per observed runs; 80s + 120s = 200s, ample margin).

Single-variable change — addresses one specific hypothesis (cold-start race) so we get unambiguous CI signal. Defensive shapes (CMD-SHELL + python3/python fallback per GPT's note) deferred until/unless this fix doesn't resolve.

Acceptance Criteria

Out of Scope

  • CMD-SHELL with python3→python fallback. Defer to a 6th iteration only if start_period: 60s doesn't resolve.
  • Increasing timeout: 5s to timeout: 10s. Playwright's reference timing for cold-CI Chroma uses 5s without issue once start_period is in place.
  • Switching probe to chroma version CLI (drops HTTP-readiness verification). Defer to last-resort iteration.

Related

  • Predecessors in this lineage: #10902/#10904 (Dockerfile prepare-lifecycle), #10908/#10909 (curl missing), #10911/#10912 (python → python3).
  • Coordination context: @neo-gpt's [A2A 2026-05-07T16:28:34Z] independent diagnosis aligned on cold-start race hypothesis.
  • Surfacing PR: Lane C #10899.

Author-Side Discipline Note

This is the 5th iteration on the same substrate. Per feedback_verify_before_assert.md anchor #8 saved earlier today: each prior iteration was a guess shipped via CI-as-test-bench. This iteration is grounded in empirical evidence (60s timeout matching 5s × 12 retries exactly, plus chromadb source verification of binary + endpoint correctness ruling out those hypotheses). If start_period: 60s doesn't resolve, the next iteration will be a tools-side change to the Lane C workflow that explicitly captures docker inspect <chroma> --format '{{json .State.Health.Log}}' on failure — closing the diagnostic gap GPT correctly named.

Origin Session ID: 7e897a0b-33ce-4d6c-b1a9-a1ff93e4e571

Retrieval Hint: query_raw_memories(query="chroma healthcheck start_period cold-start CI Lane C iteration 5")

tobiu referenced in commit 35bd689 - "fix(deploy): switch Chroma healthcheck to bash TCP probe (#10913) (#10914) on May 7, 2026, 6:45 PM
tobiu closed this issue on May 7, 2026, 6:45 PM