LearnNewsExamplesServices
Frontmatter
id11639
titlePhase 4A — Per-Tenant Ingestion Observability Daemon (KBRecorderService Extension)
stateClosed
labels
enhancementaiarchitecture
assigneesneo-opus-4-7
createdAtMay 19, 2026, 1:56 PM
updatedAtMay 21, 2026, 8:03 AM
githubUrlhttps://github.com/neomjs/neo/issues/11639
authorneo-opus-4-7
commentsCount2
parentIssue11628
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[x] 11642 Phase 4D — Operator Alerting Surface: Telemetry Thresholds → A2A + External Notification, [x] 11640 Phase 4B — Manifest Reconciliation Daemon: Tenant-State vs Chroma-Actual Sync
closedAtMay 21, 2026, 8:03 AM

Phase 4A — Per-Tenant Ingestion Observability Daemon (KBRecorderService Extension)

Closedenhancementaiarchitecture
neo-opus-4-7
neo-opus-4-7 commented on May 19, 2026, 1:56 PM

Context

Sub of Phase 4 Epic #11628 (meta-Epic #11624).

Extends existing KBRecorderService for multi-tenant telemetry. No new database; reuses Memory Core SQLite substrate per KBRecorderService precedent.

The Problem

Once Phase 2 push pipeline ships, cloud operators have no visibility into per-tenant ingestion health: push frequency, error rates, ingestion latency, embedding-budget burn, schema-version drift, etc. Without observability, operators can't:

  • Detect tenant abuse (excessive pushes, embedding budget exhaustion)
  • Detect tenant errors (silent push failures, schema mismatches)
  • Plan capacity (per-tenant chunk growth rates)
  • Surface health in operator dashboards

The Fix

Extend KBRecorderService.mjs (currently captures KB QUERY telemetry → projects to kb_query_faqs) to also capture INGESTION telemetry:

  1. New SQLite table kb_ingestion_events in shared Memory Core database:
    • tenantId, agentIdentity, eventType (push/parse/embed/error), timestamp, chunkCount, durationMs, bytesIngested, errorCode?, schemaVersion
  2. KnowledgeBaseIngestionService.ingestSourceFiles (Phase 2A) emits telemetry events to KBRecorderService.recordIngestionEvent({...})
  3. Aggregation projection (similar to kb_query_faqs): per-tenant rolling-window metrics materialized into kb_tenant_ingestion_health table
    • Per-tenant push frequency (events/hour)
    • Per-tenant error rate (errors/total)
    • Per-tenant chunk growth rate
    • Per-tenant embedding-budget burn (chunks × embedding cost)
  4. Daemon process: ai/scripts/kb-observability-daemon.mjs (sibling to existing daemons) runs aggregation projection periodically + writes to sandman_handoff.md health section

Acceptance Criteria

  • kb_ingestion_events table schema defined + migration
  • KBRecorderService.recordIngestionEvent method implemented
  • Phase 2A KnowledgeBaseIngestionService emits events at all lifecycle hooks (push received, parse complete, embed complete, error)
  • kb_tenant_ingestion_health aggregation projection materialized
  • Daemon ai/scripts/kb-observability-daemon.mjs exists; follows existing daemon pattern (orchestrator-daemon precedent)
  • Daemon emits health summary into sandman_handoff.md under ## KB Multi-Tenant Health section
  • Unit tests: event recording, aggregation projection, daemon scheduling
  • Integration test: multi-tenant push simulation → daemon aggregation → sandman_handoff.md health section populated

Out of Scope

  • Operator alerting on thresholds → Phase 4D
  • Reconciliation logic → Phase 4B
  • Stale-chunk GC → Phase 4C
  • External dashboards / Grafana / Prometheus (sandman_handoff + portal app is V1 substrate; external dashboards is later)

Related

  • Parent: #11628
  • Blocked-by: Phase 2A (#TBD — KnowledgeBaseIngestionService must emit events)
  • Daemon pattern precedent: ai/scripts/orchestrator-daemon.mjs, swarm-heartbeat-daemon.mjs, bridge-daemon.mjs
  • Substrate extension: KBRecorderService.mjs (existing telemetry collector)
  • Sandman integration: GoldenPathSynthesizer.renderConsumerFrictionSection pattern (PR #11622 sibling)

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • KBRecorderService.mjs is the extension target — read it first
  • GapInferenceEngine (referenced in KBRecorderService) is the projection-consumer architectural pattern
  • ai/scripts/orchestrator-daemon.mjs is the daemon-scheduling pattern reference
tobiu referenced in commit 5d64a1f - "feat(ai): KB ingestion telemetry schema + recordIngestionMetric API (#11639) (#11667) on May 20, 2026, 8:01 AM
tobiu referenced in commit ad4a108 - "feat(ai): KB Multi-Tenant Health section in Sandman handoff (#11639) (#11708) on May 21, 2026, 8:03 AM
tobiu closed this issue on May 21, 2026, 8:03 AM