LearnNewsExamplesServices
Frontmatter
id11629
titlePhase 0/1A — Ingestion Contracts: parsed-chunk-v1 + backup-record-v1 + Path-Identity Tuple + Tombstone Spec
stateClosed
labels
enhancementaiarchitecture
assigneesneo-opus-ada
createdAtMay 19, 2026, 1:53 PM
updatedAtJun 7, 2026, 7:13 PM
githubUrlhttps://github.com/neomjs/neo/issues/11629
authorneo-opus-ada
commentsCount0
parentIssue11625
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[x] 11630 Phase 0/1B — Source/Parser Registry Extraction + Per-Source Path Externalization + Byte-Equivalence Fixture
closedAtMay 19, 2026, 4:22 PM

Phase 0/1A — Ingestion Contracts: parsed-chunk-v1 + backup-record-v1 + Path-Identity Tuple + Tombstone Spec

Closed v13.0.0/archive-v13-0-0-chunk-12 enhancementaiarchitecture
neo-opus-ada
neo-opus-ada commented on May 19, 2026, 1:53 PM

Context

Sub of Phase 0/1 Epic #11625 (meta-Epic #11624). Graduated from Discussion #11623.

Substrate floor for everything else in Phase 0/1. All other Phase 0/1 subs depend on this ticket's schemas + contracts being stable. Pure JSON Schema + spec authoring; no runtime code changes.

The Problem

Phase 0/1 + Phase 2 implementations need a stable contract surface BEFORE any service-layer or endpoint code lands. Currently:

  • No formal schema for client-side parsed chunks (the Phase 2 ingestSourceFiles ingest payload)
  • No formal schema for importDatabase restore records (the existing JSONL {id, embedding, metadata, document} shape)
  • No specified shape for tenant/repo/path identity tuple in chunk metadata
  • No specified contract for incremental-push deletion-signaling (tombstone / manifest / revision-boundary)

The Fix

Author the following at ai/services/knowledge-base/parser/:

  • parsed-chunk-v1.schema.json — JSON Schema for client-side parser output (ingestion path). Fields: schemaVersion, tenantId, repoSlug, rootKind, sourcePath, content, hashInputs, parserId, parserVersion, kind, name, line_start?, line_end?, className?, extends?, customMeta?. MUST reject records carrying an embedding field (routed to restore-only path).
  • backup-record-v1.schema.json — JSON Schema for existing importDatabase restore shape: {id, embedding, metadata, document}. Formalizes the existing contract; does NOT change behavior.
  • identity-tuple.md (or inline schema docs) — {tenantId, repoSlug, rootKind, sourcePath} semantics. rootKind enum: neo-workspace | bare-repo | external-source. repoSlug format conventions. Cross-link to SearchService.mjs:118-120 (hydration impact) and ApiSource.mjs:101-105 (current single-root assumption).
  • deletion-signaling-contract.md — three mutually-supporting mechanisms specced:
    1. Explicit tombstones: {deleted: [<sourcePath>...]}
    2. Manifest snapshot: {manifestSnapshot: {pathsAfterPush: [<sourcePath>...]}}
    3. Revision boundary: {baseRevision: <SHA>, headRevision: <SHA>}

Acceptance Criteria

  • parsed-chunk-v1.schema.json exists; passes JSON Schema validation
  • backup-record-v1.schema.json exists; formalizes current importDatabase shape; backward-compatible
  • identity-tuple.md documents tuple semantics + rootKind enum
  • deletion-signaling-contract.md specifies tombstone / manifest / revision-boundary payload shapes
  • Schemas referenced from inline JSDoc in Base.mjs (chunk shape extension point) for cold-reader discoverability
  • Schema unit test: round-trip valid parsed-chunk-v1 records pass; records with embedding field rejected
  • Schema unit test: round-trip valid backup-record-v1 records pass; records WITHOUT embedding rejected (matches DatabaseService.importDatabase runtime constraint)
  • Contract Ledger matrix below covers all four consumed surfaces

Contract Ledger (T3 — backfilled 2026-05-19 per GPT Cycle 1 review on PR #11647)

Per learn/agentos/contract-ledger.md Tier-3 requirement for tickets introducing consumed surfaces. The surfaces below are consumed by Phase 0/1B (#11630 registry), Phase 0/1C (#11631 write-side stamping), Phase 0/1D (#11632 read-side filter), Phase 2A/B/C (#11633-35 ingestion service + facades), Phase 4B/C (#11640-41 reconciliation + GC daemons).

Target Surface Source of Authority Proposed Behavior Fallback / Edge Case Docs Evidence
parsed-chunk-v1.schema.json (ingest contract) Discussion #11623 §4 Q4 + §1 #3; meta-Epic #11624 Topology Anchor; ADR 0003 Validates client-side parsed chunks at Phase 2 ingestion boundary. Required fields: schemaVersion='1.0.0', tenantId (lowercase-kebab AgentIdentity slug), repoSlug, rootKind (closed enum: neo-workspace/bare-repo/external-source), sourcePath, content, hashInputs (non-empty), parserId, parserVersion (semver), kind (open enum), name. Server-side embedding via TextEmbeddingService.embedTexts() always triggers — records carrying embedding field rejected via additionalProperties: false. Records with embedding field → REJECTED (spoof-rejection invariant; routed to backup-record-v1 path if restore intent). Records with invalid tenantId pattern → REJECTED. Records with unknown rootKind → REJECTED. Records with empty hashInputs → REJECTED. Schema-version drift → Phase 2 ingestion service emits deprecation warning (out-of-scope here). ai/services/knowledge-base/parser/parsed-chunk-v1.schema.json (JSON Schema 2020-12); cross-referenced from source/Base.mjs JSDoc + identity-tuple.md + deletion-signaling-contract.md Unit: test/playwright/unit/ai/services/knowledge-base/parser/Schemas.spec.mjs — 11 cases (happy path full + minimal, embedding spoof rejection, missing required, schemaVersion mismatch, rootKind enum, tenantId pattern, hashInputs empty, additionalProperties, open-enum kind, JSON round-trip)
backup-record-v1.schema.json (restore-only contract) Discussion #11623 §3 audit table + §11 Avoided Trap "Conflating backup-record-v1 with ingest"; #10129 atomic-bundle bundle-meta.json shape Validates restore-only records produced by DatabaseService.manageDatabaseBackup({action: 'export'}) and consumed by {action: 'import'}. Required fields: id, embedding (non-empty number array), metadata, document. Embeddings are preserved verbatim — no re-embedding triggered. Records without embedding → REJECTED (matches DatabaseService.importDatabase runtime constraint: embeddings: batch.map(r => r.embedding) produces undefined slots, NOT omitted property; Chroma client only auto-embeds when entire recordSet.embeddings property absent). Records with unknown top-level fields → REJECTED. Embedding-dim mismatch with target collection → hard restore failure (Chroma layer; out-of-schema concern). ai/services/knowledge-base/parser/backup-record-v1.schema.json; cross-referenced from source/Base.mjs JSDoc Unit: same Schemas.spec.mjs — 7 cases (happy path with embedding, REJECT without embedding, missing id/metadata/document, additionalProperties, JSON round-trip)
Path-identity tuple {tenantId, repoSlug, rootKind, sourcePath} Discussion #11623 §4 Q12 + §6 sweep point 3 (Cycle 2 GPT path-determinism blocker) + ADR 0003 Replaces single-neoRootDir source-path assumption. tenantId server-stamped from authenticated AgentIdentity (Phase 0/1C scope); repoSlug tenant-namespaced; rootKind closed enum; sourcePath forward-slash normalized relative to repoSlug root. Same source content under two tenants → distinct chunk ids (server-side hash prepends tenantId+repoSlug). Reserved value tenantId: 'neo-shared' tags Neo's curated content (visible across tenants via memorySharing: 'team' semantics). repoSlug: 'neo' is Neo's own repo identifier. Backward-compat: existing Neo chunks reformulated with tenantId='neo-shared' + repoSlug='neo' (byte-equivalence fixture in Phase 0/1B will validate non-disruption). ai/services/knowledge-base/parser/identity-tuple.md Doc: spec-only here; runtime tenant-aware chunkId derivation lands in Phase 0/1B (#11630) + Phase 0/1C (#11631) with byte-equivalence fixture
Deletion-signaling contract (tombstone / manifest / revision-boundary) Discussion #11623 §4 Q11 + §6 sweep point 4 (Cycle 2 GPT state-mutability blocker) Three mutually-supporting incremental-push deletion mechanisms: explicit tombstones ({deleted: [<sourcePath>...]}), manifest snapshot ({manifestSnapshot: {pathsAfterPush}}), revision boundary ({baseRevision, headRevision}). Server applies in precedence: revision-boundary → tombstones → manifest reconciliation. Existing VectorService.mjs:198-207 full-corpus delete logic is PRESERVED (handles full-resync). Phase 2 incremental-push consumes this contract. Multi-mechanism payload precedence: richer payloads override sparser. Force-push history rewrite needs Phase 4B (#11640) reconciliation daemon. ai/services/knowledge-base/parser/deletion-signaling-contract.md Doc: spec-only here; runtime endpoint consumption lands in Phase 2A (#11633) KnowledgeBaseIngestionService.ingestSourceFiles

Tier classification: T3 (Explicit Matrix). T4 (Executable Contract) is intentionally deferred — schema validators land as parsed-chunk-v1 is consumed in Phase 2A/B/C runtime ingestion, not in Phase 0/1A. The schema files + AJV unit tests are the proof of definitional completeness at this phase; runtime enforcement layers per evidence-ladder.md L3-L4 are downstream Phase 2 acceptance.

Out of Scope

  • Source/Parser registry implementation → Phase 0/1B
  • Server-side schema validator wiring → Phase 0/1B / 2
  • memorySharing port → Phase 0/1C/D
  • Runtime tombstone enforcement → Phase 2 ingestion service

Related

  • Parent: #11625
  • Blocks: Phase 0/1B (#TBD), Phase 0/1C (#TBD), Phase 0/1D (#TBD), Phase 2 subs (need schemas to validate against)
  • Discussion source: #11623 §4 Q4 + §7 Phase 0/1

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • query_raw_memories({query: 'parsed-chunk-v1 schema contract'})
  • Start with parsed-chunk-v1.schema.json — substrate floor; everything else builds on this
  • MemoryService.mjs:391-410 query-time policy pattern is the read-side reference for tenant fields
tobiu referenced in commit 8f240be - "feat(kb): scaffold Phase 0/1A ingestion contracts (#11629) (#11647) on May 19, 2026, 4:22 PM
tobiu closed this issue on May 19, 2026, 4:22 PM
tobiu referenced in commit fdfb48f - "fix(ai): KB importDatabase tolerates null document field (#11653) (#11657) on May 20, 2026, 2:50 AM