Context
Phase 4A sub-ticket of #11628 (Operations + Observability for Cloud-Native KB Deployments). Surfaced 2026-05-20 during nightshift-mode lane-pickup after parallel-track work on #11631 (Phase 0/1C-α write-side stamping by @neo-gpt) + #11660 (Phase 0/1B-β path externalization by @neo-opus-ada) + #11663 (Phase 4 retention policy by @neo-opus-ada).
Pre-Phase-2-actionable per #11628 body escape clause:
"Operator alerting + dashboard surfacing may begin in parallel once Phase 0/1 #11625 contracts land (daemon-scaffolding can start pre-Phase-2)."
Phase 0/1A schemas + 0/1B registry are already on dev; Phase 0/1B-β paths (PR #11661) + 0/1C-α write-side stamping (PR #11662) are in @tobiu's merge gate. The daemon SHELL + TELEMETRY SCHEMA can land before Phase 2 #11626 ships its ingestion-service hooks — actual live-integration with Phase 2 ingestion calls is deferred to a follow-up sub-ticket.
The Problem
After Phase 0/1 + Phase 0/1C-α land, cloud Agent OS deployments have:
- Stable
parsed-chunk-v1 + backup-record-v1 schemas
- Source/Parser registry + per-source path config
- Write-side tenant stamping with
tenantId / repoSlug / visibility / originAgentIdentity
- Configurable bundle retention
What's still missing for operational visibility:
- No per-tenant ingestion metrics — push frequency, error rates, ingestion latency, embedding-budget burn
- No telemetry schema for what metrics get persisted
- No daemon orchestration that wakes periodically to roll up + persist tenant-scoped metrics
The Phase 4 #11628 ticket's value-prop framing (operational-cost-of-recovery reduction, NOT data-loss prevention) anchors what 4A should collect.
The Architectural Reality
Per #11628 Architectural Substrate Precedent — Phase 4A EXTENDS KBRecorderService.mjs rather than introducing a new substrate. KBRecorderService already persists every KB MCP tool invocation into the shared Memory Core SQLite database. Phase 4A adds:
- Per-tenant scoping: extend the existing
kb_query_log schema with tenant-aware columns (tenantId, repoSlug, visibility) OR add a sibling table kb_ingestion_metrics for ingestion-specific counts.
- Daemon orchestration: new daemon at
ai/scripts/kb-observability-daemon.mjs following the orchestrator-daemon.mjs / swarm-heartbeat-daemon.mjs pattern. Wakes on schedule; rolls up metrics; persists to MC SQLite.
- Telemetry-schema doc: design doc at
learn/agentos/decisions/0NNN-kb-ingestion-telemetry-schema.md (ADR) defining the persistence contract so Phase 4B reconciliation + Phase 4D alerting consume a stable schema.
The Fix
1. Telemetry-schema ADR
Create learn/agentos/decisions/0NNN-kb-ingestion-telemetry-schema.md:
- What metrics are collected (per-tenant)
- Schema shape (column names, types, indexing)
- Retention policy (per-tenant — different from Phase 4 bundle retention)
- Consumer expectations (Phase 4B reconciliation, Phase 4D alerting, operator dashboards)
2. SQLite schema extension
Either:
- (a) Extend
kb_query_log with tenantId, repoSlug, visibility columns (add migration)
- (b) Add sibling
kb_ingestion_metrics table for ingestion-specific counts
Choice deferred to implementation — extension is simpler if metrics are query-call-shaped; sibling table is cleaner if metrics are roll-up-shaped.
3. Daemon shell
ai/scripts/kb-observability-daemon.mjs:
import {KBRecorderService} from '../services/knowledge-base/KBRecorderService.mjs';
import mcConfig from '../mcp/server/memory-core/config.mjs';
export async function runObservabilityDaemon({intervalMs = 15 * 60 * 1000, oneShot = false} = {}) {
}
if (import.meta.url === `file://${process.argv[1]}`) {
runObservabilityDaemon({oneShot: process.argv.includes('--once')}).catch(err => {
console.error('[kb-observability-daemon] failed:', err);
process.exit(1);
});
}4. Operator-facing CLI
npm run ai:kb-observability entry in package.json (one-shot mode for operator-driven manual rollup).
5. Tests
- Unit:
kb-observability-daemon.spec.mjs — verifies rollup logic against synthetic kb_query_log rows with mixed tenants.
- Integration: synthetic multi-tenant fixture (reuse Phase 5 test fixtures once they land) — deferred to follow-up if Phase 5 fixtures aren't ready.
Acceptance Criteria
Out of Scope
- Phase 2 live-integration hooks — Phase 2 #11626 doesn't exist yet; integration follows in Phase 4A-β post-Phase-2.
- Phase 4D alerting surface — separate sub-ticket. This phase collects + persists telemetry; alerting consumes it.
- Phase 4B reconciliation daemon — separate sub-ticket. Different concern (state-drift detection vs metrics rollup).
- Phase 4C stale-chunk GC daemon — separate sub-ticket. Different concern.
- Per-tenant SLA / quota enforcement — V1 surfaces the data; SLA enforcement is post-V1 commercialization scope per #11628 OOS.
- Cross-deployment fleet management — single-deployment scope for V1.
Avoided Traps
| Trap |
Why rejected |
| Building before Phase 2 ingestion-service exists |
This sub-ticket scopes to scaffolding + schema only; live-integration deferred. The shell + schema ARE pre-Phase-2-actionable per #11628 escape clause. |
| New telemetry database |
KBRecorderService already uses Memory Core SQLite; reuse the substrate. |
| Dashboard infrastructure |
Out of scope for daemon work; sandman_handoff or portal app extension is Phase 4D territory. |
| Bundling rollup + alerting into one daemon |
Separate concerns; testability + future modification budget benefit from separate daemons (per #11628 Avoided Traps). |
| Roll-up cadence baked into the daemon |
Cadence becomes aiConfig.kbObservability.intervalMs config so operators can tune for high-volume vs low-volume deployments. |
Related
- Parent Phase Epic: #11628 (Phase 4: Operations + Observability)
- Parent meta-Epic: #11624
- Substrate-extension target:
ai/services/knowledge-base/KBRecorderService.mjs (existing telemetry persistence)
- Daemon precedents:
ai/scripts/orchestrator-daemon.mjs, swarm-heartbeat-daemon.mjs, bridge-daemon.mjs
- Sibling parallel lanes in flight:
- PR #11661 Phase 0/1B-β path-externalization (@neo-opus-ada)
- PR #11662 Phase 0/1C-α write-side stamping (@neo-gpt) — APPROVED, awaiting merge
- PR #11664 Phase 4 retention policy (@neo-opus-ada) — APPROVED, awaiting merge
- #11646 Phase 5C KB unit coverage expansion (@neo-gpt) in progress
Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838
Handoff Retrieval Hints
query_raw_memories({query: 'Phase 4A KB observability daemon telemetry schema KBRecorderService extension'})
ask_knowledge_base({query: 'KBRecorderService kb_query_log kb_query_faqs Memory Core SQLite telemetry', type: 'src'})
- Daemon pattern reference:
ai/scripts/orchestrator-daemon.mjs — proven daemon orchestration shape
- Operator framing 2026-05-19 (#11628 retention sub-ticket context): operational-cost-of-recovery reduction is the value-prop, NOT data-loss prevention
Context
Phase 4A sub-ticket of #11628 (Operations + Observability for Cloud-Native KB Deployments). Surfaced 2026-05-20 during nightshift-mode lane-pickup after parallel-track work on #11631 (Phase 0/1C-α write-side stamping by @neo-gpt) + #11660 (Phase 0/1B-β path externalization by @neo-opus-ada) + #11663 (Phase 4 retention policy by @neo-opus-ada).
Pre-Phase-2-actionable per #11628 body escape clause:
Phase 0/1A schemas + 0/1B registry are already on
dev; Phase 0/1B-β paths (PR #11661) + 0/1C-α write-side stamping (PR #11662) are in @tobiu's merge gate. The daemon SHELL + TELEMETRY SCHEMA can land before Phase 2 #11626 ships its ingestion-service hooks — actual live-integration with Phase 2 ingestion calls is deferred to a follow-up sub-ticket.The Problem
After Phase 0/1 + Phase 0/1C-α land, cloud Agent OS deployments have:
parsed-chunk-v1+backup-record-v1schemastenantId/repoSlug/visibility/originAgentIdentityWhat's still missing for operational visibility:
The Phase 4 #11628 ticket's value-prop framing (operational-cost-of-recovery reduction, NOT data-loss prevention) anchors what 4A should collect.
The Architectural Reality
Per #11628 Architectural Substrate Precedent — Phase 4A EXTENDS
KBRecorderService.mjsrather than introducing a new substrate. KBRecorderService already persists every KB MCP tool invocation into the shared Memory Core SQLite database. Phase 4A adds:kb_query_logschema with tenant-aware columns (tenantId,repoSlug,visibility) OR add a sibling tablekb_ingestion_metricsfor ingestion-specific counts.ai/scripts/kb-observability-daemon.mjsfollowing theorchestrator-daemon.mjs/swarm-heartbeat-daemon.mjspattern. Wakes on schedule; rolls up metrics; persists to MC SQLite.learn/agentos/decisions/0NNN-kb-ingestion-telemetry-schema.md(ADR) defining the persistence contract so Phase 4B reconciliation + Phase 4D alerting consume a stable schema.The Fix
1. Telemetry-schema ADR
Create
learn/agentos/decisions/0NNN-kb-ingestion-telemetry-schema.md:2. SQLite schema extension
Either:
kb_query_logwithtenantId,repoSlug,visibilitycolumns (add migration)kb_ingestion_metricstable for ingestion-specific countsChoice deferred to implementation — extension is simpler if metrics are query-call-shaped; sibling table is cleaner if metrics are roll-up-shaped.
3. Daemon shell
ai/scripts/kb-observability-daemon.mjs:import {KBRecorderService} from '../services/knowledge-base/KBRecorderService.mjs'; import mcConfig from '../mcp/server/memory-core/config.mjs'; // Wake on schedule (default: every 15 minutes). // Roll up tenant-scoped metrics from kb_query_log (or sibling). // Persist to a roll-up table for Phase 4D alerting consumption. export async function runObservabilityDaemon({intervalMs = 15 * 60 * 1000, oneShot = false} = {}) { // ... implementation } if (import.meta.url === `file://${process.argv[1]}`) { runObservabilityDaemon({oneShot: process.argv.includes('--once')}).catch(err => { console.error('[kb-observability-daemon] failed:', err); process.exit(1); }); }4. Operator-facing CLI
npm run ai:kb-observabilityentry inpackage.json(one-shot mode for operator-driven manual rollup).5. Tests
kb-observability-daemon.spec.mjs— verifies rollup logic against synthetickb_query_logrows with mixed tenants.Acceptance Criteria
learn/agentos/decisions/(number assigned at file time)kb_query_logvs creatingkb_ingestion_metrics)ai/scripts/kb-observability-daemon.mjsshell exists, follows existing daemon pattern, wakes on schedulenpm run ai:kb-observability -- --onceexists for manual rollupOut of Scope
Avoided Traps
aiConfig.kbObservability.intervalMsconfig so operators can tune for high-volume vs low-volume deployments.Related
ai/services/knowledge-base/KBRecorderService.mjs(existing telemetry persistence)ai/scripts/orchestrator-daemon.mjs,swarm-heartbeat-daemon.mjs,bridge-daemon.mjsOrigin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838Handoff Retrieval Hints
query_raw_memories({query: 'Phase 4A KB observability daemon telemetry schema KBRecorderService extension'})ask_knowledge_base({query: 'KBRecorderService kb_query_log kb_query_faqs Memory Core SQLite telemetry', type: 'src'})ai/scripts/orchestrator-daemon.mjs— proven daemon orchestration shape