Context
5th iteration on the Lane C #10899 integration row. After #10902/#10904 (Dockerfile --ignore-scripts), #10908/#10909 (curl→python urllib), and #10911/#10912 (python→python3), the integration row STILL fails with the same 60s Chroma healthcheck timeout.
@neo-gpt independently arrived at the same diagnosis (A2A 2026-05-07T16:28:34Z): "likely the Docker healthcheck command path is failing, not Chroma startup itself ... consider start_period if the first probes race startup."
Empirical Evidence
Lane C run 25508089233 (post-#10912 with python3):
16:19:23 — Chroma container output banner: "Connect to Chroma at: http://localhost:8000"
16:19:23 → 16:20:23 — zero output from chroma-1 for 60s exactly
16:20:23 — dependency failed to start: container neo-integration-test-chroma-1 is unhealthy
16:22:28 — Playwright webServer 240s timeout
The 60s window matches 5s × 12 retries exactly. The healthcheck has no start_period — failures from the first probe count toward the retry budget. On cold-start GitHub ubuntu-latest hardware (2-CPU, 7GB), Chroma's banner appears before the uvicorn HTTP server has bound to port 8000 + completed startup. The healthcheck fires every 5s during this window, all probes return connection-refused, all count as failures, container marked unhealthy at retry 12.
Empirical verification chain that still holds:
- ✓
python3 IS in PATH (verified via WebFetch on python:3.11-slim-bookworm Dockerfile + chromadb 1.5.9 source).
- ✓
/api/v2/heartbeat IS the canonical endpoint on Chroma 1.5.9 server (verified in chromadb/server/fastapi/__init__.py route registration).
- ✗ HTTP server is NOT yet bound when the first ~12 probes fire on cold-start CI.
The Fix
Add start_period: 60s to the healthcheck. During this grace period, failed probes do NOT count toward retries — Chroma gets a runway to be HTTP-ready before the retry budget starts.
healthcheck:
test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/api/v2/heartbeat', timeout=2)"]
interval: 5s
timeout: 5s
retries: 12
+ start_period: 60s
Total grace + retry window: 60s (start_period) + 60s (retries × interval) = 120s before unhealthy. Comfortably under Playwright's 240s webServer timeout (compose-up cold-start uses ~80s for image pull + container create per observed runs; 80s + 120s = 200s, ample margin).
Single-variable change — addresses one specific hypothesis (cold-start race) so we get unambiguous CI signal. Defensive shapes (CMD-SHELL + python3/python fallback per GPT's note) deferred until/unless this fix doesn't resolve.
Acceptance Criteria
Out of Scope
- CMD-SHELL with python3→python fallback. Defer to a 6th iteration only if
start_period: 60s doesn't resolve.
- Increasing
timeout: 5s to timeout: 10s. Playwright's reference timing for cold-CI Chroma uses 5s without issue once start_period is in place.
- Switching probe to
chroma version CLI (drops HTTP-readiness verification). Defer to last-resort iteration.
Related
- Predecessors in this lineage: #10902/#10904 (Dockerfile prepare-lifecycle), #10908/#10909 (curl missing), #10911/#10912 (python → python3).
- Coordination context: @neo-gpt's [A2A 2026-05-07T16:28:34Z] independent diagnosis aligned on cold-start race hypothesis.
- Surfacing PR: Lane C #10899.
Author-Side Discipline Note
This is the 5th iteration on the same substrate. Per feedback_verify_before_assert.md anchor #8 saved earlier today: each prior iteration was a guess shipped via CI-as-test-bench. This iteration is grounded in empirical evidence (60s timeout matching 5s × 12 retries exactly, plus chromadb source verification of binary + endpoint correctness ruling out those hypotheses). If start_period: 60s doesn't resolve, the next iteration will be a tools-side change to the Lane C workflow that explicitly captures docker inspect <chroma> --format '{{json .State.Health.Log}}' on failure — closing the diagnostic gap GPT correctly named.
Origin Session ID: 7e897a0b-33ce-4d6c-b1a9-a1ff93e4e571
Retrieval Hint: query_raw_memories(query="chroma healthcheck start_period cold-start CI Lane C iteration 5")
Context
5th iteration on the Lane C #10899 integration row. After #10902/#10904 (Dockerfile
--ignore-scripts), #10908/#10909 (curl→python urllib), and #10911/#10912 (python→python3), the integration row STILL fails with the same 60s Chroma healthcheck timeout.@neo-gpt independently arrived at the same diagnosis (A2A 2026-05-07T16:28:34Z): "likely the Docker healthcheck command path is failing, not Chroma startup itself ... consider start_period if the first probes race startup."
Empirical Evidence
Lane C run 25508089233 (post-#10912 with
python3):16:19:23— Chroma container output banner: "Connect to Chroma at: http://localhost:8000"16:19:23 → 16:20:23— zero output from chroma-1 for 60s exactly16:20:23—dependency failed to start: container neo-integration-test-chroma-1 is unhealthy16:22:28— Playwright webServer 240s timeoutThe 60s window matches
5s × 12 retriesexactly. The healthcheck has nostart_period— failures from the first probe count toward the retry budget. On cold-start GitHub ubuntu-latest hardware (2-CPU, 7GB), Chroma's banner appears before the uvicorn HTTP server has bound to port 8000 + completed startup. The healthcheck fires every 5s during this window, all probes return connection-refused, all count as failures, container marked unhealthy at retry 12.Empirical verification chain that still holds:
python3IS in PATH (verified via WebFetch onpython:3.11-slim-bookwormDockerfile + chromadb 1.5.9 source)./api/v2/heartbeatIS the canonical endpoint on Chroma 1.5.9 server (verified inchromadb/server/fastapi/__init__.pyroute registration).The Fix
Add
start_period: 60sto the healthcheck. During this grace period, failed probes do NOT count towardretries— Chroma gets a runway to be HTTP-ready before the retry budget starts.healthcheck: test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/api/v2/heartbeat', timeout=2)"] interval: 5s timeout: 5s retries: 12 + start_period: 60sTotal grace + retry window: 60s (start_period) + 60s (retries × interval) = 120s before unhealthy. Comfortably under Playwright's 240s
webServertimeout (compose-up cold-start uses ~80s for image pull + container create per observed runs; 80s + 120s = 200s, ample margin).Single-variable change — addresses one specific hypothesis (cold-start race) so we get unambiguous CI signal. Defensive shapes (CMD-SHELL + python3/python fallback per GPT's note) deferred until/unless this fix doesn't resolve.
Acceptance Criteria
ai/deploy/docker-compose.test.ymlChroma healthcheck hasstart_period: 60s.Tests / integrationmatrix row passes on next rebase + CI run.Out of Scope
start_period: 60sdoesn't resolve.timeout: 5stotimeout: 10s. Playwright's reference timing for cold-CI Chroma uses 5s without issue once start_period is in place.chroma versionCLI (drops HTTP-readiness verification). Defer to last-resort iteration.Related
Author-Side Discipline Note
This is the 5th iteration on the same substrate. Per
feedback_verify_before_assert.mdanchor #8 saved earlier today: each prior iteration was a guess shipped via CI-as-test-bench. This iteration is grounded in empirical evidence (60s timeout matching5s × 12 retriesexactly, plus chromadb source verification of binary + endpoint correctness ruling out those hypotheses). Ifstart_period: 60sdoesn't resolve, the next iteration will be a tools-side change to the Lane C workflow that explicitly capturesdocker inspect <chroma> --format '{{json .State.Health.Log}}'on failure — closing the diagnostic gap GPT correctly named.Origin Session ID:
7e897a0b-33ce-4d6c-b1a9-a1ff93e4e571Retrieval Hint:
query_raw_memories(query="chroma healthcheck start_period cold-start CI Lane C iteration 5")