Context
Surfaced 2026-05-07 by Lane C #10899 integration-row CI re-fire after #10904 (Dockerfile fix) merged. The Dockerfile fix unblocked the build path; the next failure mode is now visible: Chroma container's healthcheck fails because curl is not installed in the chromadb/chroma:1.5.9 base image, leading to dependency failed to start: container neo-integration-test-chroma-1 is unhealthy after 60s and a 240s webServer timeout in Playwright.
The Problem
ai/deploy/docker-compose.test.yml:7 declares:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/api/v2/heartbeat"]
interval: 5s
timeout: 5s
retries: 12
The chromadb/chroma:1.5.9 image is debian-slim-based and does not include curl by default. Docker reports the healthcheck as failing every 5s; after 12 retries (60s) the container is marked unhealthy. kb-server and mc-server both have depends_on: chroma: condition: service_healthy, so neither starts → composeWebServer.mjs waits → 240s webServer-timeout → integration suite fails.
Empirical anchor: Lane C CI run 25506320077 integration job, log lines:
15:45:28 — Chroma starts, banner output observed (process is healthy, server listening).
15:46:28 — dependency failed to start: container neo-integration-test-chroma-1 is unhealthy (exactly 60s, matching 5s × 12 retries).
15:48:39 — Error: Timed out waiting 240000ms from config.webServer.
The Chroma server itself is running and listening on port 8000 (verified via banner). The healthcheck just can't probe it because curl is absent.
Why this wasn't caught earlier
Same root pattern as #10902: local Docker layer caching. On a developer machine that previously ran docker compose up against an older Chroma image (which may have included curl) OR where --no-cache was never invoked, the cache masks the issue. Lane C (#10899) is the first PR to run a truly clean ubuntu-latest Docker build → bug surfaces on first --no-cache build of chromadb/chroma:1.5.9.
The Architectural Reality
ai/deploy/docker-compose.test.yml:7 — current healthcheck.
- The
chromadb/chroma:1.5.9 image is a Python application; Python is guaranteed to be present (it's the runtime). curl/wget are not.
- The Chroma
/api/v2/heartbeat endpoint exists and is the canonical health probe (verified via image documentation + previous PR-#10880 design intent).
- We need a healthcheck that uses something guaranteed to exist in the image, not curl.
The Fix
Replace the curl-based healthcheck with a Python urllib-based check (uses the runtime that's already in the image):
healthcheck:
- test: ["CMD", "curl", "-f", "http://localhost:8000/api/v2/heartbeat"]
+ test: ["CMD", "python", "-c", "import urllib.request,sys; sys.exit(0 if urllib.request.urlopen('http://localhost:8000/api/v2/heartbeat',timeout=2).read() else 1)"]
interval: 5s
timeout: 5s
retries: 12
The Python check:
- Uses
urllib.request.urlopen (stdlib, no dependencies).
- 2s socket timeout per probe (well under the 5s
timeout envelope).
- Returns exit 0 on any 2xx + non-empty body, exit 1 on any failure (HTTP error, connection refused, timeout, malformed response).
- Compatible with all chromadb image variants (Python is the runtime).
Contract Ledger (T3)
| Target Surface |
Source of Authority |
Proposed Behavior |
Fallback / Edge Case |
Docs |
Evidence |
ai/deploy/docker-compose.test.yml:7 healthcheck |
This ticket; surfaced by #10897 Lane C CI re-fire post-#10904 |
Healthcheck uses Python urllib.request (stdlib) instead of curl (absent from image). Same probe target (http://localhost:8000/api/v2/heartbeat), same interval/timeout/retries. |
If chromadb ever ships an image variant without Python (unlikely — Chroma is a Python app), fallback to TCP-port check via nc/netcat or bash-builtin /dev/tcp. Document in inline comment. |
Inline docker-compose.test.yml comment cross-linking this ticket. |
L3 — Lane C CI (#10899) integration job runs green on the rebased head after this fix lands. |
Acceptance Criteria
Out of Scope
- Audit of OTHER healthchecks elsewhere in the repo for similar curl-dependency assumptions (only
ai/deploy/docker-compose.test.yml has compose-level healthchecks today; cross-substrate audit is future ticket).
- Switching to a different chromadb base image variant.
- Custom Dockerfile FROM the chromadb image with curl pre-installed (heavier intervention; Python-stdlib path is leaner).
Avoided Traps / Gold Standards Rejected
- Rejected: install curl in a custom Chroma Dockerfile. Adds image-build complexity for a healthcheck-only utility. Python-stdlib path is leaner.
- Rejected: drop the healthcheck + use
service_started instead of service_healthy. Race condition: kb-server / mc-server may start before Chroma is actually ready and fail with connection-refused on first MCP request. Healthcheck is the right substrate.
- Rejected: switch to
wget. Same problem as curl — not guaranteed in chromadb image. Python is the safe-by-construction choice (it's the runtime).
- Rejected: bash
/dev/tcp probe. Requires bash (not just sh); chromadb image basis isn't guaranteed to have bash. Less portable than Python.
Related
- Surfacing context: Lane C CI run 25506320077 integration job, post-#10904 rebase.
- Sibling substrate fix (predecessor): #10902 → PR #10904 — Dockerfile prepare-lifecycle fix. Same pattern: substrate config bug masked by local Docker layer caching, surfaced by clean-CI.
- Original substrate: #10801 → PR #10880 (Docker artifacts). The current healthcheck shape was introduced there.
- Downstream impact: Lane C #10899 (CI workflow gating depends on this), Lane B #10898 (rebased; integration evidence path live but blocked by this), and all future integration-test PRs.
Origin Session ID: 7e897a0b-33ce-4d6c-b1a9-a1ff93e4e571
Retrieval Hint: query_raw_memories(query="chroma healthcheck curl docker-compose Lane C CI integration substrate")
Context
Surfaced 2026-05-07 by Lane C #10899 integration-row CI re-fire after #10904 (Dockerfile fix) merged. The Dockerfile fix unblocked the build path; the next failure mode is now visible: Chroma container's healthcheck fails because
curlis not installed in thechromadb/chroma:1.5.9base image, leading todependency failed to start: container neo-integration-test-chroma-1 is unhealthyafter 60s and a 240swebServertimeout in Playwright.The Problem
ai/deploy/docker-compose.test.yml:7declares:healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/api/v2/heartbeat"] interval: 5s timeout: 5s retries: 12The
chromadb/chroma:1.5.9image is debian-slim-based and does not includecurlby default. Docker reports the healthcheck as failing every 5s; after 12 retries (60s) the container is marked unhealthy.kb-serverandmc-serverboth havedepends_on: chroma: condition: service_healthy, so neither starts →composeWebServer.mjswaits → 240s webServer-timeout → integration suite fails.Empirical anchor: Lane C CI run 25506320077 integration job, log lines:
15:45:28— Chroma starts, banner output observed (process is healthy, server listening).15:46:28—dependency failed to start: container neo-integration-test-chroma-1 is unhealthy(exactly 60s, matching5s × 12 retries).15:48:39—Error: Timed out waiting 240000ms from config.webServer.The Chroma server itself is running and listening on port 8000 (verified via banner). The healthcheck just can't probe it because curl is absent.
Why this wasn't caught earlier
Same root pattern as #10902: local Docker layer caching. On a developer machine that previously ran
docker compose upagainst an older Chroma image (which may have included curl) OR where--no-cachewas never invoked, the cache masks the issue. Lane C (#10899) is the first PR to run a truly clean ubuntu-latest Docker build → bug surfaces on first--no-cachebuild ofchromadb/chroma:1.5.9.The Architectural Reality
ai/deploy/docker-compose.test.yml:7— current healthcheck.chromadb/chroma:1.5.9image is a Python application; Python is guaranteed to be present (it's the runtime). curl/wget are not./api/v2/heartbeatendpoint exists and is the canonical health probe (verified via image documentation + previous PR-#10880 design intent).The Fix
Replace the
curl-based healthcheck with a Pythonurllib-based check (uses the runtime that's already in the image):healthcheck: - test: ["CMD", "curl", "-f", "http://localhost:8000/api/v2/heartbeat"] + test: ["CMD", "python", "-c", "import urllib.request,sys; sys.exit(0 if urllib.request.urlopen('http://localhost:8000/api/v2/heartbeat',timeout=2).read() else 1)"] interval: 5s timeout: 5s retries: 12The Python check:
urllib.request.urlopen(stdlib, no dependencies).timeoutenvelope).Contract Ledger (T3)
ai/deploy/docker-compose.test.yml:7healthcheckurllib.request(stdlib) instead ofcurl(absent from image). Same probe target (http://localhost:8000/api/v2/heartbeat), same interval/timeout/retries.nc/netcatorbash-builtin/dev/tcp. Document in inline comment.docker-compose.test.ymlcomment cross-linking this ticket.integrationjob runs green on the rebased head after this fix lands.Acceptance Criteria
ai/deploy/docker-compose.test.yml:7healthcheck uses the Python-stdlib probe per Ledger row 1.Tests / integrationmatrix row passes on next rebase + run.docker compose -f ai/deploy/docker-compose.test.yml upon dev machines).Out of Scope
ai/deploy/docker-compose.test.ymlhas compose-level healthchecks today; cross-substrate audit is future ticket).Avoided Traps / Gold Standards Rejected
service_startedinstead ofservice_healthy. Race condition: kb-server / mc-server may start before Chroma is actually ready and fail with connection-refused on first MCP request. Healthcheck is the right substrate.wget. Same problem as curl — not guaranteed in chromadb image. Python is the safe-by-construction choice (it's the runtime)./dev/tcpprobe. Requires bash (not just sh); chromadb image basis isn't guaranteed to have bash. Less portable than Python.Related
Origin Session ID:
7e897a0b-33ce-4d6c-b1a9-a1ff93e4e571Retrieval Hint:
query_raw_memories(query="chroma healthcheck curl docker-compose Lane C CI integration substrate")