LearnNewsExamplesServices
Frontmatter
id11665
titleKB Ingestion Phase 4A: Multi-tenant ingestion observability daemon scaffold + telemetry schema
stateClosed
labels
enhancementai
assigneesneo-opus-ada
createdAtMay 20, 2026, 3:38 AM
updatedAtJun 7, 2026, 7:13 PM
githubUrlhttps://github.com/neomjs/neo/issues/11665
authorneo-opus-ada
commentsCount2
parentIssue11628
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 20, 2026, 4:28 AM

KB Ingestion Phase 4A: Multi-tenant ingestion observability daemon scaffold + telemetry schema

neo-opus-ada
neo-opus-ada commented on May 20, 2026, 3:38 AM

Context

Phase 4A sub-ticket of #11628 (Operations + Observability for Cloud-Native KB Deployments). Surfaced 2026-05-20 during nightshift-mode lane-pickup after parallel-track work on #11631 (Phase 0/1C-α write-side stamping by @neo-gpt) + #11660 (Phase 0/1B-β path externalization by @neo-opus-ada) + #11663 (Phase 4 retention policy by @neo-opus-ada).

Pre-Phase-2-actionable per #11628 body escape clause:

"Operator alerting + dashboard surfacing may begin in parallel once Phase 0/1 #11625 contracts land (daemon-scaffolding can start pre-Phase-2)."

Phase 0/1A schemas + 0/1B registry are already on dev; Phase 0/1B-β paths (PR #11661) + 0/1C-α write-side stamping (PR #11662) are in @tobiu's merge gate. The daemon SHELL + TELEMETRY SCHEMA can land before Phase 2 #11626 ships its ingestion-service hooks — actual live-integration with Phase 2 ingestion calls is deferred to a follow-up sub-ticket.

The Problem

After Phase 0/1 + Phase 0/1C-α land, cloud Agent OS deployments have:

  • Stable parsed-chunk-v1 + backup-record-v1 schemas
  • Source/Parser registry + per-source path config
  • Write-side tenant stamping with tenantId / repoSlug / visibility / originAgentIdentity
  • Configurable bundle retention

What's still missing for operational visibility:

  • No per-tenant ingestion metrics — push frequency, error rates, ingestion latency, embedding-budget burn
  • No telemetry schema for what metrics get persisted
  • No daemon orchestration that wakes periodically to roll up + persist tenant-scoped metrics

The Phase 4 #11628 ticket's value-prop framing (operational-cost-of-recovery reduction, NOT data-loss prevention) anchors what 4A should collect.

The Architectural Reality

Per #11628 Architectural Substrate Precedent — Phase 4A EXTENDS KBRecorderService.mjs rather than introducing a new substrate. KBRecorderService already persists every KB MCP tool invocation into the shared Memory Core SQLite database. Phase 4A adds:

  1. Per-tenant scoping: extend the existing kb_query_log schema with tenant-aware columns (tenantId, repoSlug, visibility) OR add a sibling table kb_ingestion_metrics for ingestion-specific counts.
  2. Daemon orchestration: new daemon at ai/scripts/kb-observability-daemon.mjs following the orchestrator-daemon.mjs / swarm-heartbeat-daemon.mjs pattern. Wakes on schedule; rolls up metrics; persists to MC SQLite.
  3. Telemetry-schema doc: design doc at learn/agentos/decisions/0NNN-kb-ingestion-telemetry-schema.md (ADR) defining the persistence contract so Phase 4B reconciliation + Phase 4D alerting consume a stable schema.

The Fix

1. Telemetry-schema ADR

Create learn/agentos/decisions/0NNN-kb-ingestion-telemetry-schema.md:

  • What metrics are collected (per-tenant)
  • Schema shape (column names, types, indexing)
  • Retention policy (per-tenant — different from Phase 4 bundle retention)
  • Consumer expectations (Phase 4B reconciliation, Phase 4D alerting, operator dashboards)

2. SQLite schema extension

Either:

  • (a) Extend kb_query_log with tenantId, repoSlug, visibility columns (add migration)
  • (b) Add sibling kb_ingestion_metrics table for ingestion-specific counts

Choice deferred to implementation — extension is simpler if metrics are query-call-shaped; sibling table is cleaner if metrics are roll-up-shaped.

3. Daemon shell

ai/scripts/kb-observability-daemon.mjs:

import {KBRecorderService} from '../services/knowledge-base/KBRecorderService.mjs';
import mcConfig from '../mcp/server/memory-core/config.mjs';

// Wake on schedule (default: every 15 minutes).
// Roll up tenant-scoped metrics from kb_query_log (or sibling).
// Persist to a roll-up table for Phase 4D alerting consumption.

export async function runObservabilityDaemon({intervalMs = 15 * 60 * 1000, oneShot = false} = {}) {
    // ... implementation
}

if (import.meta.url === `file://${process.argv[1]}`) {
    runObservabilityDaemon({oneShot: process.argv.includes('--once')}).catch(err => {
        console.error('[kb-observability-daemon] failed:', err);
        process.exit(1);
    });
}

4. Operator-facing CLI

npm run ai:kb-observability entry in package.json (one-shot mode for operator-driven manual rollup).

5. Tests

  • Unit: kb-observability-daemon.spec.mjs — verifies rollup logic against synthetic kb_query_log rows with mixed tenants.
  • Integration: synthetic multi-tenant fixture (reuse Phase 5 test fixtures once they land) — deferred to follow-up if Phase 5 fixtures aren't ready.

Acceptance Criteria

  • Telemetry-schema ADR drafted under learn/agentos/decisions/ (number assigned at file time)
  • SQLite schema extension OR sibling table added (impl decides between extending kb_query_log vs creating kb_ingestion_metrics)
  • ai/scripts/kb-observability-daemon.mjs shell exists, follows existing daemon pattern, wakes on schedule
  • Operator CLI npm run ai:kb-observability -- --once exists for manual rollup
  • Unit test covers rollup against synthetic per-tenant rows
  • Phase 2 live-integration hooks explicitly deferred to a follow-up sub-ticket (Phase 4A-β or similar)
  • No regression to existing KBRecorderService behavior (manual smoke OR test assertion)

Out of Scope

  • Phase 2 live-integration hooks — Phase 2 #11626 doesn't exist yet; integration follows in Phase 4A-β post-Phase-2.
  • Phase 4D alerting surface — separate sub-ticket. This phase collects + persists telemetry; alerting consumes it.
  • Phase 4B reconciliation daemon — separate sub-ticket. Different concern (state-drift detection vs metrics rollup).
  • Phase 4C stale-chunk GC daemon — separate sub-ticket. Different concern.
  • Per-tenant SLA / quota enforcement — V1 surfaces the data; SLA enforcement is post-V1 commercialization scope per #11628 OOS.
  • Cross-deployment fleet management — single-deployment scope for V1.

Avoided Traps

Trap Why rejected
Building before Phase 2 ingestion-service exists This sub-ticket scopes to scaffolding + schema only; live-integration deferred. The shell + schema ARE pre-Phase-2-actionable per #11628 escape clause.
New telemetry database KBRecorderService already uses Memory Core SQLite; reuse the substrate.
Dashboard infrastructure Out of scope for daemon work; sandman_handoff or portal app extension is Phase 4D territory.
Bundling rollup + alerting into one daemon Separate concerns; testability + future modification budget benefit from separate daemons (per #11628 Avoided Traps).
Roll-up cadence baked into the daemon Cadence becomes aiConfig.kbObservability.intervalMs config so operators can tune for high-volume vs low-volume deployments.

Related

  • Parent Phase Epic: #11628 (Phase 4: Operations + Observability)
  • Parent meta-Epic: #11624
  • Substrate-extension target: ai/services/knowledge-base/KBRecorderService.mjs (existing telemetry persistence)
  • Daemon precedents: ai/scripts/orchestrator-daemon.mjs, swarm-heartbeat-daemon.mjs, bridge-daemon.mjs
  • Sibling parallel lanes in flight:
    • PR #11661 Phase 0/1B-β path-externalization (@neo-opus-ada)
    • PR #11662 Phase 0/1C-α write-side stamping (@neo-gpt) — APPROVED, awaiting merge
    • PR #11664 Phase 4 retention policy (@neo-opus-ada) — APPROVED, awaiting merge
    • #11646 Phase 5C KB unit coverage expansion (@neo-gpt) in progress

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • query_raw_memories({query: 'Phase 4A KB observability daemon telemetry schema KBRecorderService extension'})
  • ask_knowledge_base({query: 'KBRecorderService kb_query_log kb_query_faqs Memory Core SQLite telemetry', type: 'src'})
  • Daemon pattern reference: ai/scripts/orchestrator-daemon.mjs — proven daemon orchestration shape
  • Operator framing 2026-05-19 (#11628 retention sub-ticket context): operational-cost-of-recovery reduction is the value-prop, NOT data-loss prevention
tobiu referenced in commit 5d64a1f - "feat(ai): KB ingestion telemetry schema + recordIngestionMetric API (#11639) (#11667) on May 20, 2026, 8:01 AM