parsed-chunk-v1.schema.json (ingest contract) |
Discussion #11623 §4 Q4 + §1 #3; meta-Epic #11624 Topology Anchor; ADR 0003 |
Validates client-side parsed chunks at Phase 2 ingestion boundary. Required fields: schemaVersion='1.0.0', tenantId (lowercase-kebab AgentIdentity slug), repoSlug, rootKind (closed enum: neo-workspace/bare-repo/external-source), sourcePath, content, hashInputs (non-empty), parserId, parserVersion (semver), kind (open enum), name. Server-side embedding via TextEmbeddingService.embedTexts() always triggers — records carrying embedding field rejected via additionalProperties: false. |
Records with embedding field → REJECTED (spoof-rejection invariant; routed to backup-record-v1 path if restore intent). Records with invalid tenantId pattern → REJECTED. Records with unknown rootKind → REJECTED. Records with empty hashInputs → REJECTED. Schema-version drift → Phase 2 ingestion service emits deprecation warning (out-of-scope here). |
ai/services/knowledge-base/parser/parsed-chunk-v1.schema.json (JSON Schema 2020-12); cross-referenced from source/Base.mjs JSDoc + identity-tuple.md + deletion-signaling-contract.md |
Unit: test/playwright/unit/ai/services/knowledge-base/parser/Schemas.spec.mjs — 11 cases (happy path full + minimal, embedding spoof rejection, missing required, schemaVersion mismatch, rootKind enum, tenantId pattern, hashInputs empty, additionalProperties, open-enum kind, JSON round-trip) |
backup-record-v1.schema.json (restore-only contract) |
Discussion #11623 §3 audit table + §11 Avoided Trap "Conflating backup-record-v1 with ingest"; #10129 atomic-bundle bundle-meta.json shape |
Validates restore-only records produced by DatabaseService.manageDatabaseBackup({action: 'export'}) and consumed by {action: 'import'}. Required fields: id, embedding (non-empty number array), metadata, document. Embeddings are preserved verbatim — no re-embedding triggered. |
Records without embedding → REJECTED (matches DatabaseService.importDatabase runtime constraint: embeddings: batch.map(r => r.embedding) produces undefined slots, NOT omitted property; Chroma client only auto-embeds when entire recordSet.embeddings property absent). Records with unknown top-level fields → REJECTED. Embedding-dim mismatch with target collection → hard restore failure (Chroma layer; out-of-schema concern). |
ai/services/knowledge-base/parser/backup-record-v1.schema.json; cross-referenced from source/Base.mjs JSDoc |
Unit: same Schemas.spec.mjs — 7 cases (happy path with embedding, REJECT without embedding, missing id/metadata/document, additionalProperties, JSON round-trip) |
Path-identity tuple {tenantId, repoSlug, rootKind, sourcePath} |
Discussion #11623 §4 Q12 + §6 sweep point 3 (Cycle 2 GPT path-determinism blocker) + ADR 0003 |
Replaces single-neoRootDir source-path assumption. tenantId server-stamped from authenticated AgentIdentity (Phase 0/1C scope); repoSlug tenant-namespaced; rootKind closed enum; sourcePath forward-slash normalized relative to repoSlug root. Same source content under two tenants → distinct chunk ids (server-side hash prepends tenantId+repoSlug). |
Reserved value tenantId: 'neo-shared' tags Neo's curated content (visible across tenants via memorySharing: 'team' semantics). repoSlug: 'neo' is Neo's own repo identifier. Backward-compat: existing Neo chunks reformulated with tenantId='neo-shared' + repoSlug='neo' (byte-equivalence fixture in Phase 0/1B will validate non-disruption). |
ai/services/knowledge-base/parser/identity-tuple.md |
Doc: spec-only here; runtime tenant-aware chunkId derivation lands in Phase 0/1B (#11630) + Phase 0/1C (#11631) with byte-equivalence fixture |
| Deletion-signaling contract (tombstone / manifest / revision-boundary) |
Discussion #11623 §4 Q11 + §6 sweep point 4 (Cycle 2 GPT state-mutability blocker) |
Three mutually-supporting incremental-push deletion mechanisms: explicit tombstones ({deleted: [<sourcePath>...]}), manifest snapshot ({manifestSnapshot: {pathsAfterPush}}), revision boundary ({baseRevision, headRevision}). Server applies in precedence: revision-boundary → tombstones → manifest reconciliation. |
Existing VectorService.mjs:198-207 full-corpus delete logic is PRESERVED (handles full-resync). Phase 2 incremental-push consumes this contract. Multi-mechanism payload precedence: richer payloads override sparser. Force-push history rewrite needs Phase 4B (#11640) reconciliation daemon. |
ai/services/knowledge-base/parser/deletion-signaling-contract.md |
Doc: spec-only here; runtime endpoint consumption lands in Phase 2A (#11633) KnowledgeBaseIngestionService.ingestSourceFiles |
Context
Sub of Phase 0/1 Epic #11625 (meta-Epic #11624). Graduated from Discussion #11623.
Substrate floor for everything else in Phase 0/1. All other Phase 0/1 subs depend on this ticket's schemas + contracts being stable. Pure JSON Schema + spec authoring; no runtime code changes.
The Problem
Phase 0/1 + Phase 2 implementations need a stable contract surface BEFORE any service-layer or endpoint code lands. Currently:
ingestSourceFilesingest payload)importDatabaserestore records (the existing JSONL{id, embedding, metadata, document}shape)The Fix
Author the following at
ai/services/knowledge-base/parser/:parsed-chunk-v1.schema.json— JSON Schema for client-side parser output (ingestion path). Fields:schemaVersion,tenantId,repoSlug,rootKind,sourcePath,content,hashInputs,parserId,parserVersion,kind,name,line_start?,line_end?,className?,extends?,customMeta?. MUST reject records carrying anembeddingfield (routed to restore-only path).backup-record-v1.schema.json— JSON Schema for existingimportDatabaserestore shape:{id, embedding, metadata, document}. Formalizes the existing contract; does NOT change behavior.identity-tuple.md(or inline schema docs) —{tenantId, repoSlug, rootKind, sourcePath}semantics.rootKindenum:neo-workspace | bare-repo | external-source.repoSlugformat conventions. Cross-link toSearchService.mjs:118-120(hydration impact) andApiSource.mjs:101-105(current single-root assumption).deletion-signaling-contract.md— three mutually-supporting mechanisms specced:{deleted: [<sourcePath>...]}{manifestSnapshot: {pathsAfterPush: [<sourcePath>...]}}{baseRevision: <SHA>, headRevision: <SHA>}Acceptance Criteria
parsed-chunk-v1.schema.jsonexists; passes JSON Schema validationbackup-record-v1.schema.jsonexists; formalizes currentimportDatabaseshape; backward-compatibleidentity-tuple.mddocuments tuple semantics +rootKindenumdeletion-signaling-contract.mdspecifies tombstone / manifest / revision-boundary payload shapesBase.mjs(chunk shape extension point) for cold-reader discoverabilityparsed-chunk-v1records pass; records withembeddingfield rejectedbackup-record-v1records pass; records WITHOUT embedding rejected (matchesDatabaseService.importDatabaseruntime constraint)Contract Ledger (T3 — backfilled 2026-05-19 per GPT Cycle 1 review on PR #11647)
Per
learn/agentos/contract-ledger.mdTier-3 requirement for tickets introducing consumed surfaces. The surfaces below are consumed by Phase 0/1B (#11630 registry), Phase 0/1C (#11631 write-side stamping), Phase 0/1D (#11632 read-side filter), Phase 2A/B/C (#11633-35 ingestion service + facades), Phase 4B/C (#11640-41 reconciliation + GC daemons).parsed-chunk-v1.schema.json(ingest contract)schemaVersion='1.0.0',tenantId(lowercase-kebab AgentIdentity slug),repoSlug,rootKind(closed enum:neo-workspace/bare-repo/external-source),sourcePath,content,hashInputs(non-empty),parserId,parserVersion(semver),kind(open enum),name. Server-side embedding viaTextEmbeddingService.embedTexts()always triggers — records carryingembeddingfield rejected viaadditionalProperties: false.embeddingfield → REJECTED (spoof-rejection invariant; routed to backup-record-v1 path if restore intent). Records with invalidtenantIdpattern → REJECTED. Records with unknownrootKind→ REJECTED. Records with emptyhashInputs→ REJECTED. Schema-version drift → Phase 2 ingestion service emits deprecation warning (out-of-scope here).ai/services/knowledge-base/parser/parsed-chunk-v1.schema.json(JSON Schema 2020-12); cross-referenced fromsource/Base.mjsJSDoc +identity-tuple.md+deletion-signaling-contract.mdtest/playwright/unit/ai/services/knowledge-base/parser/Schemas.spec.mjs— 11 cases (happy path full + minimal, embedding spoof rejection, missing required, schemaVersion mismatch, rootKind enum, tenantId pattern, hashInputs empty, additionalProperties, open-enum kind, JSON round-trip)backup-record-v1.schema.json(restore-only contract)DatabaseService.manageDatabaseBackup({action: 'export'})and consumed by{action: 'import'}. Required fields:id,embedding(non-empty number array),metadata,document. Embeddings are preserved verbatim — no re-embedding triggered.embedding→ REJECTED (matchesDatabaseService.importDatabaseruntime constraint:embeddings: batch.map(r => r.embedding)producesundefinedslots, NOT omitted property; Chroma client only auto-embeds when entire recordSet.embeddings property absent). Records with unknown top-level fields → REJECTED. Embedding-dim mismatch with target collection → hard restore failure (Chroma layer; out-of-schema concern).ai/services/knowledge-base/parser/backup-record-v1.schema.json; cross-referenced fromsource/Base.mjsJSDocSchemas.spec.mjs— 7 cases (happy path with embedding, REJECT without embedding, missing id/metadata/document, additionalProperties, JSON round-trip){tenantId, repoSlug, rootKind, sourcePath}neoRootDirsource-path assumption.tenantIdserver-stamped from authenticated AgentIdentity (Phase 0/1C scope);repoSlugtenant-namespaced;rootKindclosed enum;sourcePathforward-slash normalized relative torepoSlugroot. Same source content under two tenants → distinct chunk ids (server-side hash prependstenantId+repoSlug).tenantId: 'neo-shared'tags Neo's curated content (visible across tenants viamemorySharing: 'team'semantics).repoSlug: 'neo'is Neo's own repo identifier. Backward-compat: existing Neo chunks reformulated withtenantId='neo-shared'+repoSlug='neo'(byte-equivalence fixture in Phase 0/1B will validate non-disruption).ai/services/knowledge-base/parser/identity-tuple.md{deleted: [<sourcePath>...]}), manifest snapshot ({manifestSnapshot: {pathsAfterPush}}), revision boundary ({baseRevision, headRevision}). Server applies in precedence: revision-boundary → tombstones → manifest reconciliation.VectorService.mjs:198-207full-corpus delete logic is PRESERVED (handles full-resync). Phase 2 incremental-push consumes this contract. Multi-mechanism payload precedence: richer payloads override sparser. Force-push history rewrite needs Phase 4B (#11640) reconciliation daemon.ai/services/knowledge-base/parser/deletion-signaling-contract.mdKnowledgeBaseIngestionService.ingestSourceFilesTier classification: T3 (Explicit Matrix). T4 (Executable Contract) is intentionally deferred — schema validators land as
parsed-chunk-v1is consumed in Phase 2A/B/C runtime ingestion, not in Phase 0/1A. The schema files + AJV unit tests are the proof of definitional completeness at this phase; runtime enforcement layers perevidence-ladder.mdL3-L4 are downstream Phase 2 acceptance.Out of Scope
Related
Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838Handoff Retrieval Hints
query_raw_memories({query: 'parsed-chunk-v1 schema contract'})parsed-chunk-v1.schema.json— substrate floor; everything else builds on thisMemoryService.mjs:391-410query-time policy pattern is the read-side reference for tenant fields