Context
Sub of Phase 4 Epic #11628 (meta-Epic #11624).
Extends existing KBRecorderService for multi-tenant telemetry. No new database; reuses Memory Core SQLite substrate per KBRecorderService precedent.
The Problem
Once Phase 2 push pipeline ships, cloud operators have no visibility into per-tenant ingestion health: push frequency, error rates, ingestion latency, embedding-budget burn, schema-version drift, etc. Without observability, operators can't:
- Detect tenant abuse (excessive pushes, embedding budget exhaustion)
- Detect tenant errors (silent push failures, schema mismatches)
- Plan capacity (per-tenant chunk growth rates)
- Surface health in operator dashboards
The Fix
Extend KBRecorderService.mjs (currently captures KB QUERY telemetry → projects to kb_query_faqs) to also capture INGESTION telemetry:
- New SQLite table
kb_ingestion_events in shared Memory Core database:
tenantId, agentIdentity, eventType (push/parse/embed/error), timestamp, chunkCount, durationMs, bytesIngested, errorCode?, schemaVersion
KnowledgeBaseIngestionService.ingestSourceFiles (Phase 2A) emits telemetry events to KBRecorderService.recordIngestionEvent({...})
- Aggregation projection (similar to
kb_query_faqs): per-tenant rolling-window metrics materialized into kb_tenant_ingestion_health table
- Per-tenant push frequency (events/hour)
- Per-tenant error rate (errors/total)
- Per-tenant chunk growth rate
- Per-tenant embedding-budget burn (chunks × embedding cost)
- Daemon process:
ai/scripts/kb-observability-daemon.mjs (sibling to existing daemons) runs aggregation projection periodically + writes to sandman_handoff.md health section
Acceptance Criteria
Out of Scope
- Operator alerting on thresholds → Phase 4D
- Reconciliation logic → Phase 4B
- Stale-chunk GC → Phase 4C
- External dashboards / Grafana / Prometheus (sandman_handoff + portal app is V1 substrate; external dashboards is later)
Related
- Parent: #11628
- Blocked-by: Phase 2A (#TBD —
KnowledgeBaseIngestionService must emit events)
- Daemon pattern precedent:
ai/scripts/orchestrator-daemon.mjs, swarm-heartbeat-daemon.mjs, bridge-daemon.mjs
- Substrate extension:
KBRecorderService.mjs (existing telemetry collector)
- Sandman integration:
GoldenPathSynthesizer.renderConsumerFrictionSection pattern (PR #11622 sibling)
Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838
Handoff Retrieval Hints
KBRecorderService.mjs is the extension target — read it first
GapInferenceEngine (referenced in KBRecorderService) is the projection-consumer architectural pattern
ai/scripts/orchestrator-daemon.mjs is the daemon-scheduling pattern reference
Context
Sub of Phase 4 Epic #11628 (meta-Epic #11624).
Extends existing KBRecorderService for multi-tenant telemetry. No new database; reuses Memory Core SQLite substrate per
KBRecorderServiceprecedent.The Problem
Once Phase 2 push pipeline ships, cloud operators have no visibility into per-tenant ingestion health: push frequency, error rates, ingestion latency, embedding-budget burn, schema-version drift, etc. Without observability, operators can't:
The Fix
Extend
KBRecorderService.mjs(currently captures KB QUERY telemetry → projects tokb_query_faqs) to also capture INGESTION telemetry:kb_ingestion_eventsin shared Memory Core database:tenantId,agentIdentity,eventType(push/parse/embed/error),timestamp,chunkCount,durationMs,bytesIngested,errorCode?,schemaVersionKnowledgeBaseIngestionService.ingestSourceFiles(Phase 2A) emits telemetry events toKBRecorderService.recordIngestionEvent({...})kb_query_faqs): per-tenant rolling-window metrics materialized intokb_tenant_ingestion_healthtableai/scripts/kb-observability-daemon.mjs(sibling to existing daemons) runs aggregation projection periodically + writes tosandman_handoff.mdhealth sectionAcceptance Criteria
kb_ingestion_eventstable schema defined + migrationKBRecorderService.recordIngestionEventmethod implementedKnowledgeBaseIngestionServiceemits events at all lifecycle hooks (push received, parse complete, embed complete, error)kb_tenant_ingestion_healthaggregation projection materializedai/scripts/kb-observability-daemon.mjsexists; follows existing daemon pattern (orchestrator-daemon precedent)sandman_handoff.mdunder## KB Multi-Tenant HealthsectionOut of Scope
Related
KnowledgeBaseIngestionServicemust emit events)ai/scripts/orchestrator-daemon.mjs,swarm-heartbeat-daemon.mjs,bridge-daemon.mjsKBRecorderService.mjs(existing telemetry collector)GoldenPathSynthesizer.renderConsumerFrictionSectionpattern (PR #11622 sibling)Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838Handoff Retrieval Hints
KBRecorderService.mjsis the extension target — read it firstGapInferenceEngine(referenced in KBRecorderService) is the projection-consumer architectural patternai/scripts/orchestrator-daemon.mjsis the daemon-scheduling pattern reference