Context
A recurring substrate gap surfaced across three Phase 4 / cloud-KB features: tenant KB chunks carry no ingest timestamp. A stamped chunk's metadata (VectorService.resolveTenantStamp → applyTenantStamp) is {tenantId, repoSlug, visibility, tenantConfigVersion, originAgentIdentity, ...parsed-chunk-v1 fields} — there is no ingestedAt / createdAt. parsed-chunk-v1.schema.json has no timestamp field either.
Three features have now hit this gap:
- #11640 (Phase 4B reconciliation daemon) — wanted a wall-clock orphan-grace window; had to use a
tenantConfigVersion version-gap instead because no per-chunk timestamp exists.
- #11711 (reconciliation V1.x) — manifest-grace reasoning inherits the same gap.
- #11641 (Phase 4C GC daemon) — blocked: time-based + count-based retention policies ("retain last 90 days" / "retain last N chunks") have no substrate without a per-chunk timestamp; version-based-only retention would merely duplicate #11640.
Three independent consumers wanting the same one-field stamp is a clear substrate insufficiency.
The Problem
Retention, age-based GC, and time-windowed reconciliation all need to answer "when was this chunk ingested?" Today that question is unanswerable from chunk metadata. Each consuming feature has had to work around it (version-gap proxies) or defer scope. The fix is a single additive field, stamped server-side at embed time.
The Architectural Reality
ai/services/knowledge-base/VectorService.mjs — resolveTenantStamp(tenantContext) builds the server-derived tenant stamp; applyTenantStamp(chunk, stamp) spreads it onto the chunk ({...chunk, ...stamp, hash, id}). The stamp is the natural home for ingestedAt.
TENANT_GUARDED_FIELDS (VectorService.mjs:13) — the server-derived fields validated against client spoofing. ingestedAt is purely server-derived → belongs in this list.
createTenantAwareChunkId hashes {tenantId, repoSlug, hash, type, name, source} — it does not include the stamp's volatile fields, so adding ingestedAt does not perturb chunk IDs (a same-content re-push resolves to the same ID).
The Fix
Add ingestedAt (epoch ms, Date.now() at stamp time) to resolveTenantStamp's output and to TENANT_GUARDED_FIELDS. It then flows through applyTenantStamp onto every chunk's Chroma metadata, queryable by downstream retention / GC / reconciliation consumers.
Semantics — resolved (PR #11713 Cycle-1, @neo-gpt review): ingestedAt marks when a chunk row is actually embedded / upserted. embed()'s zero-change fast path dedupes already-known content-hash IDs, so a same-content re-push does not re-upsert and the chunk keeps its prior ingestedAt; a content change yields a new content-hash chunk row with its own fresh ingestedAt. This embed-time anchor (≈ "first-embedded-at for this content") is the correct retention semantics — re-pushing identical content must not reset its age.
Contract Ledger
| Target Surface |
Source of Authority |
Proposed Behavior |
Fallback / Edge Case |
Docs |
Evidence |
VectorService.resolveTenantStamp → chunk metadata ingestedAt |
this ticket; the #11641 / #11640 / #11711 consumers |
resolveTenantStamp adds ingestedAt: Date.now() (epoch ms); applyTenantStamp spreads it onto every embedded chunk's metadata. It marks the actual embed/upsert — an unchanged re-push (zero-change fast path) retains the prior value. |
A chunk embedded before this ticket has no ingestedAt — consumers MUST treat a missing ingestedAt as "unknown age" and fail-safe (never expire / action a chunk with no timestamp), mirroring #11640's missing-tenantConfigVersion skip. |
Yes — JSDoc |
Unit: resolveTenantStamp includes a finite ingestedAt; applyTenantStamp propagates it; a same-content re-push retains it |
TENANT_GUARDED_FIELDS |
#11631 write-side tenant stamping |
ingestedAt is added — a client-supplied ingestedAt is rejected / overwritten by the server value (it is server-derived, never client-authored). |
Per the existing spoofRejectionMode policy. |
Yes — JSDoc |
Unit: a client-supplied ingestedAt is overwritten with the server value |
Acceptance Criteria
Out of Scope
- Consuming
ingestedAt for retention / GC — that is #11641 (which this ticket unblocks).
- Backfilling
ingestedAt onto chunks embedded before this ticket — consumers fail-safe on a missing value; a backfill (if ever needed) is a separate concern.
Related
- #11641 — Phase 4C GC daemon (blocked by this ticket — time/count-based retention needs
ingestedAt).
- #11640 / #11711 — reconciliation; both worked around the absent timestamp via a version-gap.
- #11631 — write-side tenant stamping (the
TENANT_GUARDED_FIELDS precedent).
- #11628 — Phase 4 epic (this is a Phase-2-substrate enabler for it).
Origin Session ID
470c38e7-1ffc-4851-867d-d30c1b6fbdb2
Handoff Retrieval Hints
- The gap was surfaced during #11641 Phase 4C intake (the substrate V-B-A sweep) — see the #11641 intake-update comment.
query_raw_memories: "KB chunk ingestedAt stamp retention prerequisite"
Context
A recurring substrate gap surfaced across three Phase 4 / cloud-KB features: tenant KB chunks carry no ingest timestamp. A stamped chunk's metadata (
VectorService.resolveTenantStamp→applyTenantStamp) is{tenantId, repoSlug, visibility, tenantConfigVersion, originAgentIdentity, ...parsed-chunk-v1 fields}— there is noingestedAt/createdAt.parsed-chunk-v1.schema.jsonhas no timestamp field either.Three features have now hit this gap:
tenantConfigVersionversion-gap instead because no per-chunk timestamp exists.Three independent consumers wanting the same one-field stamp is a clear substrate insufficiency.
The Problem
Retention, age-based GC, and time-windowed reconciliation all need to answer "when was this chunk ingested?" Today that question is unanswerable from chunk metadata. Each consuming feature has had to work around it (version-gap proxies) or defer scope. The fix is a single additive field, stamped server-side at embed time.
The Architectural Reality
ai/services/knowledge-base/VectorService.mjs—resolveTenantStamp(tenantContext)builds the server-derived tenant stamp;applyTenantStamp(chunk, stamp)spreads it onto the chunk ({...chunk, ...stamp, hash, id}). The stamp is the natural home foringestedAt.TENANT_GUARDED_FIELDS(VectorService.mjs:13) — the server-derived fields validated against client spoofing.ingestedAtis purely server-derived → belongs in this list.createTenantAwareChunkIdhashes{tenantId, repoSlug, hash, type, name, source}— it does not include the stamp's volatile fields, so addingingestedAtdoes not perturb chunk IDs (a same-content re-push resolves to the same ID).The Fix
Add
ingestedAt(epoch ms,Date.now()at stamp time) toresolveTenantStamp's output and toTENANT_GUARDED_FIELDS. It then flows throughapplyTenantStamponto every chunk's Chroma metadata, queryable by downstream retention / GC / reconciliation consumers.Semantics — resolved (PR #11713 Cycle-1, @neo-gpt review):
ingestedAtmarks when a chunk row is actually embedded / upserted.embed()'s zero-change fast path dedupes already-known content-hash IDs, so a same-content re-push does not re-upsert and the chunk keeps its prioringestedAt; a content change yields a new content-hash chunk row with its own freshingestedAt. This embed-time anchor (≈ "first-embedded-at for this content") is the correct retention semantics — re-pushing identical content must not reset its age.Contract Ledger
VectorService.resolveTenantStamp→ chunk metadataingestedAtresolveTenantStampaddsingestedAt: Date.now()(epoch ms);applyTenantStampspreads it onto every embedded chunk's metadata. It marks the actual embed/upsert — an unchanged re-push (zero-change fast path) retains the prior value.ingestedAt— consumers MUST treat a missingingestedAtas "unknown age" and fail-safe (never expire / action a chunk with no timestamp), mirroring #11640's missing-tenantConfigVersionskip.resolveTenantStampincludes a finiteingestedAt;applyTenantStamppropagates it; a same-content re-push retains itTENANT_GUARDED_FIELDSingestedAtis added — a client-suppliedingestedAtis rejected / overwritten by the server value (it is server-derived, never client-authored).spoofRejectionModepolicy.ingestedAtis overwritten with the server valueAcceptance Criteria
resolveTenantStampstampsingestedAt(epoch ms) on the tenant stamp.ingestedAtpropagates onto embedded chunk metadata viaapplyTenantStamp.ingestedAtis inTENANT_GUARDED_FIELDS(server-derived; client-spoof-rejected).ingestedAtis stamped at the actual embed / upsert; an unchanged same-content re-push (the zero-change fast path) retains the prior value.Out of Scope
ingestedAtfor retention / GC — that is #11641 (which this ticket unblocks).ingestedAtonto chunks embedded before this ticket — consumers fail-safe on a missing value; a backfill (if ever needed) is a separate concern.Related
ingestedAt).TENANT_GUARDED_FIELDSprecedent).Origin Session ID
470c38e7-1ffc-4851-867d-d30c1b6fbdb2Handoff Retrieval Hints
query_raw_memories: "KB chunk ingestedAt stamp retention prerequisite"