LearnNewsExamplesServices
Frontmatter
id11630
titlePhase 0/1B — Source/Parser Registry Extraction + Per-Source Path Externalization + Byte-Equivalence Fixture
stateClosed
labels
enhancementaiarchitecture
assigneesneo-opus-ada
createdAtMay 19, 2026, 1:53 PM
updatedAtJun 7, 2026, 7:13 PM
githubUrlhttps://github.com/neomjs/neo/issues/11630
authorneo-opus-ada
commentsCount2
parentIssue11625
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[x] 11629 Phase 0/1A — Ingestion Contracts: parsed-chunk-v1 + backup-record-v1 + Path-Identity Tuple + Tombstone Spec
blocking[x] 11631 Phase 0/1C — KB Tenant Isolation Write-Side: VectorService Server-Stamping + Tenant-Aware chunkId + Spoof-Rejection
closedAtMay 20, 2026, 12:57 PM

Phase 0/1B — Source/Parser Registry Extraction + Per-Source Path Externalization + Byte-Equivalence Fixture

Closed v13.0.0/archive-v13-0-0-chunk-12 enhancementaiarchitecture
neo-opus-ada
neo-opus-ada commented on May 19, 2026, 1:53 PM

Context

Sub of Phase 0/1 Epic #11625 (meta-Epic #11624). Graduated from Discussion #11623.

Removes the hardcoded single-source-repo assumption from the substrate WITHOUT changing chunk-shape semantics. Validated by byte-equivalence fixture (current Neo source output stable before/after extraction).

The Problem

The configurability gap is mechanically located:

// ai/services/knowledge-base/DatabaseService.mjs:460-471
const sources = [
    AdrSource, ApiSource, ConceptSource, DiscussionSource,
    LearningSource, PullRequestSource, ReleaseNotesSource,
    SkillSource, TicketSource, TestSource
];

Plus each Source subclass hardcodes its paths (e.g., ApiSource.sourceMap maps src/apps/examples/docs/app/ai — Neo-specific).

Cloud deployments need:

  • Data-driven source registration
  • Per-source path externalization
  • useDefaultSources / useDefaultParsers opt-in/out booleans
  • Custom-source/parser registration API
  • PROOF that the extraction doesn't regress retrieval quality (byte-equivalence fixture)

The Fix

  1. Data-driven source registry in aiConfig.knowledgeBase.sources:
       sources: [
      { sourceClass: 'AdrSource', paths: {...} },
      { sourceClass: 'ApiSource', paths: { src: 'src', apps: 'app', ... } },
      // ...
    ]
  2. useDefaultSources: true / useDefaultParsers: true (default in aiConfig) preserves current 10-source / 3-parser binding for zero-config Neo deployments
  3. Custom-source registration API: tenants can append/override entries in aiConfig.knowledgeBase.sources and aiConfig.knowledgeBase.parsers
  4. Per-source path externalization: ApiSource.sourceMap etc. → consume from aiConfig.knowledgeBase.sources[X].paths
  5. Byte-equivalence fixture at test/playwright/unit/ai/knowledge-base/byte-equivalence.spec.mjs:
    • Run current 10 sources × current parsers × current paths → capture chunk JSONL output
    • Run new registry-driven path → capture chunk JSONL output
    • Assert: chunk-level byte-equivalence (chunk.hash stable; chunk.content stable; chunk.metadata stable except for new path-identity fields when reformulated as chunk.metadata.source = {tenantId: 'neo-shared', ...} — fixture validates the reformulation is non-disruptive for existing content semantics)

Acceptance Criteria

  • aiConfig.knowledgeBase.sources data-driven registry shape defined
  • aiConfig.knowledgeBase.parsers data-driven parser registry shape defined
  • useDefaultSources boolean config (default true)
  • useDefaultParsers boolean config (default true)
  • Hardcoded sources array in DatabaseService.mjs:460-471 replaced with registry consumption
  • All 10 existing Source classes consume paths from config (not hardcoded)
  • Custom source/parser registration API documented in JSDoc + Phase 3 guide
  • Byte-equivalence fixture passes: current Neo source output (~24k chunks per recent Chroma defrag data) stable before/after
  • Unit tests: data-driven registry loader (default-true, default-false, custom-additions, custom-overrides)

Out of Scope

  • Schema authoring → Phase 0/1A (blocker)
  • memorySharing chunk metadata fields → Phase 0/1C/D
  • Runtime ingestion endpoint → Phase 2

Related

  • Parent: #11625
  • Blocked-by: Phase 0/1A (#TBD — needs schemas stable to validate registry output)
  • Blocks: Phase 0/1C (needs registry to know how chunkId derivation interacts with tenantId)
  • Discussion source: #11623 §7 Phase 0/1, §4 Q2 (registry shape)

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • DatabaseService.mjs:460-471 is the mechanical surface
  • ApiSource.mjs:67-73 is the canonical hardcoded-paths example
  • Byte-equivalence fixture is the load-bearing safety net — author it FIRST to capture current behavior, then refactor against it