Cloud-Native KB Ingestion — Hook Wiring
Status — Phase 3B + #11743. This guide documents the shipped ingestion surfaces of Epic #11624 plus the #11743 tenant-side repo-push client: the
ingest_source_filesMCP operation (#11634), thenpm run ai:kb-push-clientremote MCP wrapper, and thenpm run ai:ingest-tenantbulk CLI (#11635). The runnablepre-pushhook and worked external workspace it references ship underai/examples/cloud-deployment/.
Three operator surfaces, one ingestion service
A tenant's source content reaches the Knowledge Base through the same KnowledgeBaseIngestionService.ingestSourceFiles orchestrator and writes to the unified Chroma knowledge-base collection. The operational surfaces differ in who calls them, where they run, and work-volume policy.
For the operator-facing decision model — how to choose repo slugs, treat credentials, inventory source families, and decide when to use each facade — read Tenant Ingestion Model first.
| Surface | Caller | Volume policy | Built for |
|---|---|---|---|
ingest_source_files |
A remote MCP client of the sse / StreamableHTTP deployment — an agent in the tenant workspace, or a tenant push client |
Gated — refuses a batch over mcpSyncMaxChunks (default 50); listed/callable only when the KB server runs with transport === 'sse' |
Incremental pushes: a commit's worth of changed files |
npm run ai:kb-push-client |
A tenant git hook or CI job; wraps the remote MCP call | Same MCP gate as ingest_source_files |
Operator-facing repo-push invocation over StreamableHTTP/SSE |
npm run ai:ingest-tenant |
A shell process co-located with the KB server — the cloud operator, a CI job | Ungated (viaMcp: false) |
Initial tenant onboarding (5k–50k chunks), large back-fills |
The fork between them is the #10572 MCP work-volume gate. An MCP tool call holds the calling agent's turn open; embedding tens of thousands of chunks synchronously inside one call is the wrong shape. The gate makes that structural — ingest_source_files refuses an over-volume batch (it returns a refusal payload, it does not block), and the bulk CLI is the sanctioned path for the volume the gate rejects. viaMcp: false — passed only by the CLI — is the single sanctioned gate bypass.
Transport and auth model
The KB MCP server runs dual-transport (ai/mcp/server/knowledge-base/Server.mjs): stdio for a local single-repo deployment, or sse (StreamableHTTP) for a cloud deployment serving remote tenants — selected by aiConfig.transport with aiConfig.mcpHttpPort. ingest_source_files is transport-gated: it is listed and callable only when the KB server runs with transport === 'sse'. Local stdio clients do not see it and direct calls fail closed with guidance to use the local CLI/service path instead. The ai:ingest-tenant CLI is not a remote facade — it imports the KB services directly and runs on the deployment host.
The deployable repo-push path is npm run ai:kb-push-client. It is a tenant-side MCP client wrapper around ingest_source_files:
NEO_KB_MCP_URL="https://agent-os.example.com/kb/mcp" \
NEO_KB_INGEST_TOKEN="$repo_push_token" \
NEO_KB_TENANT_ID="client-org" \
NEO_KB_REPO_SLUG="neomjs/create-app" \
npm run ai:kb-push-client -- --from-stdin < envelope.jsonThe operator-facing primitive is the repo-push automation identity, not a human agent session:
- Create a service account or machine client in the deployment's OIDC provider, for example
kb-repo-push-client-org. - Configure the token audience/resource to match the KB deployment's public MCP resource. Behind the reference Caddy ingress, that means the canonical KB public URL exposed by
NEO_PUBLIC_URLplus the/mcpendpoint path, commonlyhttps://agent-os.example.com/kb/mcp. - Store the access token or client-credentials exchange output in the tenant hook/CI secret store, exposed to the hook as
NEO_KB_INGEST_TOKEN. Do not commit it, put it inrepoSlug, or log it. - Scope the identity to the tenant it represents. The server stamps the authoritative tenant identity from authenticated context; payload
tenantIdremains a default/claim, not authority.
If a deployment deliberately runs unauthenticated for a local demo, pass --allow-unauthenticated; production tenant ingestion should not use that flag.
The incremental facade — ingest_source_files
ingest_source_files accepts a push envelope:
{
"tenantId": "client-org", // the client's claim; the server stamps the authoritative value
"files": [
{ "sourcePath": "src/foo.mjs", "content": "<raw file text>" }, // raw — server parses
{ "sourcePath": "src/bar.mjs", "parsedChunks": [ /* parsed-chunk-v1 */ ] } // client-parsed
],
"deleted": [ { "sourcePath": "src/gone.mjs", "repoSlug": "main-app" } ],
"manifestSnapshot": { "repoSlug": "main-app", "pathsAfterPush": [ "src/foo.mjs", "src/bar.mjs" ] },
"baseRevision": "<last-pushed-SHA>",
"headRevision": "<current-SHA>"
}No envelope field is strictly required — the ingestion service validates and returns a structured summary rather than throwing. A files entry is resolved to parsed-chunk-v1 records by one of several paths: a raw {content} payload is parsed server-side by a registered parser (parserId, default raw-text); a {parsedChunks: [...]} payload carries client-side parsed records; an entry that is itself a parsed-chunk-v1 record (schemaVersion: '1.0.0') passes straight through. See Custom Parsers for the parser side.
The volume gate. Batch volume is the summed parsedChunks length across files, counting each raw file as 1. Over mcpSyncMaxChunks (default 50), the call refuses up front — before embedding — and returns:
{
"error": "KB ingest work volume exceeds MCP-callable threshold",
"code": "KB_INGEST_VOLUME_EXCEEDED",
"message": "<split-the-batch guidance>",
"batchSize": 312,
"threshold": 50
}A caller branches on code: split into sub-threshold ingest_source_files calls, or hand the back-fill to the bulk CLI. A successful call returns {ingested, deleted, embeddingsGenerated, errors, tenantId, durationMs}.
Tenant push client — npm run ai:kb-push-client
The push client accepts a single JSON envelope and submits it to the remote MCP endpoint:
npm run ai:kb-push-client -- \
--url https://agent-os.example.com/kb/mcp \
--tenant-id client-org \
--repo-slug neomjs/create-app \
--from-file envelope.jsonEnvironment defaults:
| Variable | Meaning |
|---|---|
NEO_KB_MCP_URL |
Remote KB MCP endpoint URL, for example https://agent-os.example.com/kb/mcp. |
NEO_KB_MCP_TRANSPORT |
Client transport; streamable-http by default, sse for older endpoint wiring. |
NEO_KB_INGEST_TOKEN |
Bearer token for the repo-push automation identity. |
NEO_KB_TENANT_ID |
Optional envelope default; server-side auth still stamps the authoritative tenant. |
NEO_KB_REPO_SLUG |
Optional envelope default for records/manifests that omit repoSlug. |
Failure signatures a hook/CI job should branch on:
| Signature | Meaning | Operator response |
|---|---|---|
KB_INGEST_VOLUME_EXCEEDED |
Batch exceeds mcpSyncMaxChunks. |
Split the envelope or use ai:ingest-tenant for bulk onboarding/backfill. |
HTTP 401 / Unauthorized |
Token missing, expired, wrong audience/resource, or rejected by proxy/auth middleware. | Refresh the automation identity token and verify NEO_PUBLIC_URL / audience wiring. |
| Tool not listed / MCP call rejected | The endpoint is not the KB MCP server, transport config is wrong, or the deployment gates ingest tools by transport/auth. | Verify NEO_KB_MCP_URL, transport, and server config before retrying. |
Non-empty errors array |
The ingestion service accepted the call but one or more files failed validation/parsing. | Fail the hook/CI job and surface the structured {code, message} entries. |
This client is the current #11743 MCP-over-StreamableHTTP answer. A non-MCP HTTP or queue receiver that reuses KnowledgeBaseIngestionService remains a future alternative if the MCP client path proves operationally awkward. Server-side ref-only webhook/clone ingestion remains the separate #11731 exploration.
The bulk facade — npm run ai:ingest-tenant
For the volume the gate rejects, the Phase 2C CLI streams a JSONL file into the ingestion service:
npm run ai:ingest-tenant -- <tenantId> (--from-file <path.jsonl> | --from-stdin) [--batch-size <n>]
- Input — JSONL, one ingestion
filesentry per line: aparsed-chunk-v1record, or a raw{sourcePath, content}payload the server will parse. A line that failsJSON.parseis counted as aKB_INGEST_CLI_JSONL_PARSE_FAILEDerror and skipped — one malformed line never aborts the stream. --batch-size— records peringestSourceFilescall (default 500); the stream is flushed batch-by-batch, so a multi-thousand-record import never materializes the whole corpus in memory.- Heavy-maintenance lease — the run holds the shared heavy-maintenance lease so a bulk import cannot collide with
ai:sync-kbor the orchestrator'skbSynctask on the unifiedknowledge-basecollection. If another holder has the lease, the run printsDeferred: heavy-maintenance lease held by '<owner>'and exits 0 — re-invoke once the holder completes.
The CLI prints a JSON summary — {tenantId, ingested, embeddingsGenerated, deleted, batches, parseErrors, errors} — and exits non-zero if any error was accumulated.
The CLI submits each batch as a plain
filesarray — it does not carrydeleted/manifestSnapshot/ revision-boundary fields. Bulk imports are initial-load or full-resync; per-push deletion signaling is aningest_source_filesconcern (see below). Because every chunk carriesmetadata.ingestedAt, chunks imported by the CLI after the last persisted manifest are outside that manifest's deletion authority; run a later manifest-carrying push or full claimed-state resync when the operator wants to advance the manifest baseline.
Deletion signaling
An incremental push carries only changed files, so the server cannot infer deletions. The ingest_source_files envelope therefore carries explicit deletion signals — a push MAY combine any of three:
| Mechanism | Envelope field | Shape | Trade-off |
|---|---|---|---|
| Tombstones | deleted |
[{sourcePath, repoSlug}] |
Cheap, single-record granular; the client tracks its own deletes |
| Manifest snapshot | manifestSnapshot |
{repoSlug, pathsAfterPush: [...]} |
Robust against missed deletes; O(N) payload in post-push file count; durable baseline for daemon reconciliation |
| Revision boundary | baseRevision + headRevision |
last-pushed + current SHA | Cheapest signal; the server derives the delete set from the tenant's tracked revision |
When a payload carries more than one, the server applies them in precedence order — revision-boundary computes the expected change set, tombstones extend it, the manifest reconciles surplus chunks as orphans. Revision-boundary deletion additionally requires Phase 2E tenant config storage (#11637): the resolver that maps a SHA range to deleted paths is wired by that phase; until it lands, a revision-boundary-only payload returns KB_REVISION_BOUNDARY_UNAVAILABLE, and tombstones + manifest remain the available signals. The full contract is in deletion-signaling-contract.md.
manifestSnapshot is also persisted on the sibling graph node kb-manifest:<tenantId> (#11711), keyed by repoSlug with its updatedAt timestamp. The Phase 4B reconciliation daemon can later classify persisted chunks that are absent from the latest manifest as manifest orphans, but only inside the manifest's freshness window: metadata.ingestedAt must be finite and <= manifest.updatedAt. Chunks missing ingestedAt, or chunks ingested after the manifest was written, are skipped because the manifest cannot speak for content added by a bulk import or a minimal hook after that snapshot.
Enable reconciliationAutoTombstone only when the tenant hook topology sends full manifest snapshots at the reconciliation points that should authorize deletes. Tombstone/revision-boundary-only hooks remain safe and cheap; they just should not rely on an older manifest to delete content created after that older manifest until a later manifest-carrying push advances the baseline.
Wiring a pre-push git hook
A pre-push hook is the recommended trigger: it fires once per git push, receives the pushed ref range on stdin, and runs before the remote updates. The reference implementation is ai/examples/cloud-deployment/pre-push-hook.sh; its shape:
- Read the pushed ref range (
<local-ref> <local-sha> <remote-ref> <remote-sha>) from the hook's stdin. - Enumerate changed files —
git diff --name-only --diff-filter=ACMR <remote-sha> <local-sha>for adds/modifies,--diff-filter=Dfor deletes. - Assemble the envelope — changed files into
files, deleted paths intodeleted, the SHA pair intobaseRevision/headRevision. - Submit it — when
NEO_KB_MCP_URLis configured, pipe the envelope toai:kb-push-client, which calls thesse/ StreamableHTTPingest_source_filesendpoint; an initial import of an existing repo still goes to the deployment-host bulk CLI. In localstdiomode, use the CLI/service path instead; the MCP tool is intentionally hidden. - Inspect the returned summary — a non-empty
errorsarray fails the hook so the developer sees it.
The example combines tombstones + revision-boundary — the precise-but-cheap pair for a hook that already runs git diff.
post-commit vs pre-push
| Hook | Fires | Best mechanism | Notes |
|---|---|---|---|
pre-push |
once per git push |
tombstones + revision-boundary | Recommended default — batches a push's commits, SHA range on stdin |
post-commit |
every commit | revision-boundary only | High frequency; keep payloads minimal. The Phase 4B reconciliation daemon can catch drift only within the last persisted manifest's freshness window; rows ingested after that manifest are skipped until a later manifest-carrying push advances the baseline. |
Error handling — the structured summary
ingestSourceFiles does not throw for per-file problems. It returns a summary whose errors array accumulates structured {code, message} entries — a non-empty array does not imply total failure (sibling files in the same batch may have ingested cleanly). A caller should:
- treat
KB_INGEST_VOLUME_EXCEEDEDas "split the batch / use the bulk CLI", not as a failure; - surface a non-empty
errorsarray to the developer — fail the hook — so a malformed file or an unregistered parser is not silently dropped; - treat the bulk CLI's non-zero exit code the same way — it exits non-zero whenever
errorsis non-empty.
Related
- Overview — the contract split, topology anchor, default-source inheritance.
- Tenant Ingestion Model — tenant repo identity, credential boundary, source-family inventory, and push-vs-bulk operational flow.
- Configuration —
mcpSyncMaxChunks,transport, and the otheraiConfigkeys. - Custom Parsers — authoring a parser that turns a tenant file format into
parsed-chunk-v1records. - Security — write-side stamping, spoof-rejection, and the parser-execution boundary.
deletion-signaling-contract.md·identity-tuple.md— the ingestion contracts.- #11743 repo-push receiver/auth model · #11634
ingest_source_files· #11635ai:ingest-tenant· #10572 MCP work-volume gate.