LearnNewsExamplesServices
Frontmatter
id11658
titleKB Ingestion Phase 0/1B: Source/Parser registry + useDefaultSources/useDefaultParsers configs
stateClosed
labels
enhancementaiarchitecture
assigneesneo-opus-ada
createdAtMay 19, 2026, 6:06 PM
updatedAtMay 20, 2026, 2:52 AM
githubUrlhttps://github.com/neomjs/neo/issues/11658
authorneo-opus-ada
commentsCount0
parentIssue11625
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 20, 2026, 2:52 AM

KB Ingestion Phase 0/1B: Source/Parser registry + useDefaultSources/useDefaultParsers configs

Closed v13.0.0/archive-v13-0-0-chunk-12 enhancementaiarchitecture
neo-opus-ada
neo-opus-ada commented on May 19, 2026, 6:06 PM

Context

Phase 0/1B sub-ticket of #11625 (Phase 0/1 of Epic #11624 — Cloud-Native KB Ingestion). Continues the increment pattern set by Phase 0/1A (#11629 merged as PR #11647) which shipped parsed-chunk-v1.schema.json + backup-record-v1.schema.json + the path-identity tuple + deletion-signaling contract.

Phase 0/1A defined the data contracts. Phase 0/1B replaces the hardcoded source-list at ai/services/knowledge-base/DatabaseService.mjs:470-481 with a data-driven registry, gated by useDefaultSources / useDefaultParsers boolean configs — the substrate floor that lets cloud deployments mix Neo's curated sources with tenant-supplied custom sources.

The Problem

DatabaseService.createKnowledgeBase() currently iterates a hardcoded array of 10 Neo-curated Source classes (AdrSource, ApiSource, ConceptSource, etc.). Cloud-deployed Agent OS workspaces need to:

  1. Toggle Neo's defaults off — e.g., a tenant whose external repo has nothing to do with Neo's curated content can pass useDefaultSources: false and ingest only their own content.
  2. Register custom Source classes — a tenant with a .proto parser, an ES5 codebase, or a C++ project needs a registration API that doesn't require forking Neo.
  3. Override per-source pathsApiSource.sourceMap etc. currently hardcode Neo's local file layout; cloud deployments need per-tenant path config.

The path-prefix matcher pattern from the FileSystemIngestor incident (#11650/#11651) is the cautionary tale here — hardcoded structural assumptions become substrate-level invariants that are hard to unwind once consumed.

The Architectural Reality

This phase touches:

File Change
NEW ai/services/knowledge-base/source/SourceRegistry.mjs Registry singleton holding registered Source classes + custom Parser classes. Default sources auto-register on import when aiConfig.useDefaultSources !== false. Provides registerSource(class, options) + registerParser(class, options) public API.
ai/mcp/server/knowledge-base/config.mjs + config.template.mjs Add useDefaultSources: true (default) + useDefaultParsers: true (default) booleans. Add customSources: [] + customParsers: [] arrays (default empty) for declarative registration via config.
ai/services/knowledge-base/DatabaseService.mjs:470-481 Replace hardcoded sources array with SourceRegistry.getSources(). Honor aiConfig.useDefaultSources toggle.
ai/services/knowledge-base/source/index.mjs (or _export.mjs if existing convention) Centralized re-export of all default Source classes; auto-registers them when registry-singleton imports the index.
Per-source sourceMap / path config Externalized to aiConfig.knowledgeBase.sourcePaths.<sourceName> with sensible Neo-default values; consumers read from config instead of hardcoded constants.

The registration API mirrors the established Neo.setupClass pattern — Source classes are singleton-extending Neo.core.Base subclasses, so SourceRegistry.registerSource(MySource) accepts the class itself; the registry calls Neo.setupClass(MySource) if not already done.

The Fix

1. SourceRegistry singleton (new)

class SourceRegistry extends Base {
    static config = {
        className: 'Neo.ai.services.knowledge-base.source.SourceRegistry',
        singleton: true
    }

    #sources = new Map();  // sourceName → Source class
    #parsers = new Map();  // parserId → Parser class

    registerSource(SourceClass, {sourceName} = {}) { /* ... */ }
    registerParser(ParserClass, {parserId} = {}) { /* ... */ }

    getSources()  { return Array.from(this.#sources.values()); }
    getParsers()  { return Array.from(this.#parsers.values()); }
    hasSource(n)  { return this.#sources.has(n); }
    hasParser(id) { return this.#parsers.has(id); }
}

2. Auto-registration of defaults

ai/services/knowledge-base/source/index.mjs imports each default Source class + calls SourceRegistry.registerSource(...) for each, conditionally on aiConfig.useDefaultSources !== false.

3. DatabaseService.createKnowledgeBase() refactor

async createKnowledgeBase() {
    // ...
    const sources = SourceRegistry.getSources();
    // ... existing iteration logic unchanged
}

4. Config additions

config.mjs + config.template.mjs get:

knowledgeBase: {
    useDefaultSources: true,
    useDefaultParsers: true,
    customSources    : [],  // [{className, sourceName, sourcePath}, ...]
    customParsers    : [],  // [{className, parserId, parserVersion}, ...]
    sourcePaths      : {
        ApiSource    : 'docs/output/all.json',
        LearningSource: 'learn/tree.json',
        // ...
    }
}

5. Byte-equivalence fixture test

A unit test that:

  1. Generates dist/ai-knowledge-base.jsonl with the pre-registry code path (snapshot fixture).
  2. Generates the same with the post-registry code path under useDefaultSources: true.
  3. Asserts byte-for-byte equivalence (or chunk-set equivalence — file order may differ if registry iteration semantics change).

This guarantees the refactor doesn't subtly change Neo's KB output.

Acceptance Criteria

  • SourceRegistry.mjs exists with registerSource / registerParser / getSources / getParsers / hasSource / hasParser public methods
  • useDefaultSources (default true) + useDefaultParsers (default true) configs in config.mjs + config.template.mjs
  • customSources + customParsers arrays in config (default []) for declarative-registration path
  • Default Source classes auto-register via ai/services/knowledge-base/source/index.mjs
  • DatabaseService.createKnowledgeBase() uses SourceRegistry.getSources() instead of hardcoded array
  • Per-source path config externalized under aiConfig.knowledgeBase.sourcePaths.*
  • Unit tests under test/playwright/unit/ai/services/knowledge-base/source/ for SourceRegistry: register/dedup/unregister/list semantics, useDefaultSources: false skip-defaults behavior, custom source registration round-trip
  • Byte-equivalence fixture test passes (current Neo KB output stable post-refactor under useDefaultSources: true)
  • Existing npm run ai:sync-kb produces unchanged output under default config (manual smoke test documented in PR)
  • JSDoc + Anchor & Echo discipline on all new public surfaces

Out of Scope

  • Cross-server push pipeline + MCP small-batch facade (Phase 2 #11626 — blocked by all of Phase 0/1)
  • KB Tenant Isolation (write-side stamping + read-side filter; tracked separately in remaining Phase 0/1 ACs — will be a sibling Phase 0/1C sub-issue)
  • Custom Parser registration end-to-end runtime (this PR ships the API surface + config; actual cross-language parser execution belongs to Phase 2 / Phase 3 demos)
  • ES5 + C++ workspace integration fixtures (deferred to Phase 2 per #11625 body)
  • HTTP / streaming transport for cross-server tenant push (Phase 2)

Avoided Traps

Trap Why rejected
Skipping the registry, keeping hardcoded array but adding useDefaultSources: false early-return The hardcoded array is the substrate-level invariant. Registry refactor is required for tenant-supplied sources; gating early-return doesn't enable custom registration.
Putting config in aiConfig.knowledgeBase.sources as inline registration object Loses class-extension semantics; tenants need to subclass Base.mjs extract() method. Class registration matches the Source-class shape.
Auto-registering Neo defaults via customSources config Couples default discipline to operator config. Defaults must be code-level discoverable (ai/services/knowledge-base/source/index.mjs) so adding a new Neo source doesn't require config sync.
Renaming Source.extract signature to accept sourcePath parameter Cross-cuts every existing Source class and the test suite. Keep extract(writeStream, createHashFn) stable; per-source path config is read inside each source's class via aiConfig.knowledgeBase.sourcePaths.<name>.

Related

  • Parent Phase Epic: #11625 (Phase 0/1: Contracts, Source/Parser Registry, KB Tenant Isolation)
  • Predecessor Sub-ticket: #11629 (Phase 0/1A: schemas + identity tuple + deletion-signaling) → merged as PR #11647
  • Parent Epic: #11624
  • Origin Discussion: #11623
  • Sibling that depends on this: #11626 Phase 2 (Ingestion Service + MCP facade) — KnowledgeBaseIngestionService consumes the registry for tenant-source discovery
  • Substrate-audit-consumer-sweep anchor: Phase 0/1A repair cycles (GPT C1 + Gemini C1 + operator V-B-A) identified consumer surfaces that need stable contracts before this registry layer can land safely — those landed in Phase 0/1A first.

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • query_raw_memories({query: 'Source registry useDefaultSources useDefaultParsers KB ingestion phase 0/1B'})
  • ask_knowledge_base({query: 'DatabaseService createKnowledgeBase hardcoded sources array', type: 'src'})
  • Empirical anchor: ai/services/knowledge-base/DatabaseService.mjs:470-481 hardcoded source array — the substrate-mechanical surface to refactor
  • Pattern reference: ai/services/knowledge-base/parser/ from Phase 0/1A shipped #11647 (schemas + JSDoc convention)
tobiu closed this issue on May 20, 2026, 2:52 AM
tobiu referenced in commit d1b3cf7 - "feat(ai): externalize per-source paths to aiConfig.sourcePaths (#11660) (#11661) on May 20, 2026, 9:48 AM