Context
Phase 4 sub-Epic of meta-Epic #11624 (Cloud-Native KB Ingestion for External Workspaces). NOT originally in Discussion #11623; surfaced 2026-05-19 during post-graduation operator-directed sub-decomposition review. Operator-mention of "daemons etc." prompted explicit operational-substrate decomposition.
Blocked-by Phase 2 #11626 — most observability/reconciliation work needs the ingestion service + facades stable. Operator alerting + dashboard surfacing may begin in parallel once Phase 0/1 #11625 contracts land (daemon-scaffolding can start pre-Phase-2).
The Problem
Phase 2 ships the push pipeline; Phase 3 ships the operator-facing guide. What's missing for production cloud Agent OS deployments:
- No per-tenant ingestion observability (push frequency, error rates, ingestion latency, embedding-budget burn)
- No periodic state-reconciliation (catches missed tombstones; force-push detection beyond per-push payload)
- No stale-chunk garbage collection (orphans from source removal, parser swap, retention policy)
- No operator alerting surface (quota/error thresholds → A2A or external notification)
These concerns aren't covered by any existing phase. Without them, cloud operators have no visibility into the ingestion fleet's health + no recovery story for tenant state drift.
Architectural Substrate Precedent (V-B-A grounded)
Neo already has daemon-pattern substrate this Epic builds on:
New observability daemons EXTEND KBRecorderService (multi-tenant telemetry collection) rather than introduce a new substrate; reconciliation + GC daemons follow the orchestrator-daemon pattern.
KB-as-Cache vs MC-as-Store (load-bearing invariant for Phase 4 daemons)
(Added 2026-05-19 per operator V-B-A on backup substrate symmetry framing post-PR-#11647 merge.)
A structural distinction between the two substrates this Epic operates on:
- Knowledge Base is a cache+index over external sources. Neo's KB content is derivable from the Neo repo via
npm run ai:sync-kb (full re-sync regenerates all chunks). Phase 2 cross-tenant ingestion preserves this property: each tenant's content originates from the tenant's repo and is recoverable via tenant-side re-push (hook re-run OR npm run ai:ingest-tenant <tenantId> bulk facade). KB wipe is recoverable from external sources at any v-version. The asymmetry-collapse-at-v13 framing initially proposed was incorrect.
- Memory Core is a primary store. Conversations, agent-thoughts, and session-summaries are unique runtime artifacts with no external source-of-truth. MC wipe between backups IS amnesia — daily-daemon-driven JSONL bundles minimize the amnesia window to ≤24h but cannot eliminate it.
Phase 4 daemon value-prop reset (per the cache vs store distinction)
Phase 4B reconciliation daemon (#11640) value for KB is operational-cost-of-recovery reduction, NOT data-loss prevention. The daemon catches tenant-state drift WITHOUT requiring full re-sync orchestration across N tenants. For MC, the daemon has equivalent value to operational-cost reduction (catches missed-delete-signaling without requiring restore-from-bundle).
Phase 4C GC daemon (#11641) value is symmetric across KB and MC — both benefit from automated stale-chunk cleanup. The cache/store distinction doesn't change GC semantics.
Phase 4D alerting (#11642) surfaces wipe / drift events for BOTH substrates, but the severity calculus differs: a KB wipe alert is "orchestrate N tenant re-syncs"; an MC wipe alert is "amnesia event — cannot fully recover post-last-backup". Per-substrate severity threshold configuration follows from this.
Retention policy implications (per-substrate, not symmetric)
- KB JSONL bundle: lighter retention acceptable (weekly cadence; defrag pre-nuke 7d unchanged) — backup is cost-optimization for re-sync orchestration, not data-loss prevention
- MC JSONL bundle: daily cadence + 3-30 day retention (status quo) — backup IS data-loss prevention
This per-substrate retention asymmetry should be configurable via aiConfig.{knowledgeBase,memoryCore}.backupRetention.* — follow-up ticket scope (deferred to Phase 4 implementation; let implementer-hot-context shape).
The Fix
Four sub-tickets:
- Phase 4A — Per-tenant ingestion observability daemon (extends
KBRecorderService with multi-tenant telemetry; persists per-tenant metrics to Memory Core SQLite; surfaces via existing portal app OR sandman_handoff.md health section)
- Phase 4B — Manifest reconciliation daemon (periodic tenant-claimed-state vs Chroma-actual-state reconciliation; catches missed tombstones; handles force-push edge cases)
- Phase 4C — Stale-chunk garbage collection daemon (detects chunks orphaned by source-config changes, parser swaps, retention-policy expiration)
- Phase 4D — Operator alerting surface (telemetry thresholds → A2A notification OR external channel; per-tenant quota tracking; error-rate alerts)
Acceptance Criteria
Cross-phase ACs:
Out of Scope
- New dashboard infrastructure (extend portal app or sandman_handoff; don't build standalone)
- Per-tenant SLA / quota enforcement engine (this Epic surfaces the data; SLA enforcement is post-V1 commercialization scope)
- Cross-deployment fleet management (single-deployment scope for V1; multi-deployment is separate substrate)
- ML-driven anomaly detection (rule-based thresholds for V1; ML can be layered later)
Avoided Traps
| Trap |
Why rejected |
| New telemetry database |
KBRecorderService already uses Memory Core SQLite; reuse the substrate |
| Standalone dashboard infrastructure |
Portal app + sandman_handoff exist; surface there rather than fork the operability story |
| Reconciliation as user-on-demand only |
Force-push + missed-tombstone classes need PROACTIVE detection; on-demand reconciliation misses production-class failures |
| Mixing observability + alerting in one daemon |
Concerns separate: telemetry collection ≠ threshold alerting; split into independent daemons for testability |
| Building before Phase 2 service exists |
Observability daemon needs ingestion-service hooks; reconciliation needs push-pipeline state. Sequence: Phase 2 first, then Phase 4 (most subs blocked-by Phase 2) |
Related
- Parent meta-Epic: #11624
- Blocked-by: Phase 2 #11626 (ingestion service + facades must stabilize)
- Sibling phases: Phase 0/1 #11625, Phase 3 #11627
- Daemon precedents: orchestrator-daemon, swarm-heartbeat-daemon, bridge-daemon
- Telemetry substrate:
KBRecorderService.mjs (extension target)
- Origin Discussion: #11623 (Phase 4 not in original §7 decomposition; surfaced post-graduation)
Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838
Handoff Retrieval Hints
query_raw_memories({query: 'Phase 4 KB ingestion observability daemon multi-tenant'})
ask_knowledge_base({query: 'KBRecorderService GapInferenceEngine daemon pattern', type: 'src'})
- Existing daemons:
ai/scripts/orchestrator-daemon.mjs, swarm-heartbeat-daemon.mjs, bridge-daemon.mjs are pattern references
- Operator framing 2026-05-19: "future sessions have amnesia ... better iron out subs now while context is hot" — applied to surfacing operational-substrate that the Discussion didn't decompose
Context
Phase 4 sub-Epic of meta-Epic #11624 (Cloud-Native KB Ingestion for External Workspaces). NOT originally in Discussion #11623; surfaced 2026-05-19 during post-graduation operator-directed sub-decomposition review. Operator-mention of "daemons etc." prompted explicit operational-substrate decomposition.
Blocked-by Phase 2 #11626 — most observability/reconciliation work needs the ingestion service + facades stable. Operator alerting + dashboard surfacing may begin in parallel once Phase 0/1 #11625 contracts land (daemon-scaffolding can start pre-Phase-2).
The Problem
Phase 2 ships the push pipeline; Phase 3 ships the operator-facing guide. What's missing for production cloud Agent OS deployments:
These concerns aren't covered by any existing phase. Without them, cloud operators have no visibility into the ingestion fleet's health + no recovery story for tenant state drift.
Architectural Substrate Precedent (V-B-A grounded)
Neo already has daemon-pattern substrate this Epic builds on:
ai/scripts/orchestrator-daemon.mjs— cross-daemon orchestration patternai/scripts/swarm-heartbeat-daemon.mjs— A2A liveness patternai/scripts/bridge-daemon.mjs— bridge patternai/services/knowledge-base/KBRecorderService.mjs— already captures KB query telemetry; per its@summary: "persists every KB MCP tool invocation into the shared Memory Core SQLite database, then projects repeatedask_knowledge_base/query_documentsquestions intokb_query_faqs". Daemon-adjacent; observability-substrate-ready.Neo.ai.daemons.services.GapInferenceEngine(referenced in KBRecorderService) — daemon namespaceai.daemons.servicesconfirms architectural patternNew observability daemons EXTEND
KBRecorderService(multi-tenant telemetry collection) rather than introduce a new substrate; reconciliation + GC daemons follow the orchestrator-daemon pattern.KB-as-Cache vs MC-as-Store (load-bearing invariant for Phase 4 daemons)
(Added 2026-05-19 per operator V-B-A on backup substrate symmetry framing post-PR-#11647 merge.)
A structural distinction between the two substrates this Epic operates on:
npm run ai:sync-kb(full re-sync regenerates all chunks). Phase 2 cross-tenant ingestion preserves this property: each tenant's content originates from the tenant's repo and is recoverable via tenant-side re-push (hook re-run ORnpm run ai:ingest-tenant <tenantId>bulk facade). KB wipe is recoverable from external sources at any v-version. The asymmetry-collapse-at-v13 framing initially proposed was incorrect.Phase 4 daemon value-prop reset (per the cache vs store distinction)
Phase 4B reconciliation daemon (#11640) value for KB is operational-cost-of-recovery reduction, NOT data-loss prevention. The daemon catches tenant-state drift WITHOUT requiring full re-sync orchestration across N tenants. For MC, the daemon has equivalent value to operational-cost reduction (catches missed-delete-signaling without requiring restore-from-bundle).
Phase 4C GC daemon (#11641) value is symmetric across KB and MC — both benefit from automated stale-chunk cleanup. The cache/store distinction doesn't change GC semantics.
Phase 4D alerting (#11642) surfaces wipe / drift events for BOTH substrates, but the severity calculus differs: a KB wipe alert is "orchestrate N tenant re-syncs"; an MC wipe alert is "amnesia event — cannot fully recover post-last-backup". Per-substrate severity threshold configuration follows from this.
Retention policy implications (per-substrate, not symmetric)
This per-substrate retention asymmetry should be configurable via
aiConfig.{knowledgeBase,memoryCore}.backupRetention.*— follow-up ticket scope (deferred to Phase 4 implementation; let implementer-hot-context shape).The Fix
Four sub-tickets:
KBRecorderServicewith multi-tenant telemetry; persists per-tenant metrics to Memory Core SQLite; surfaces via existing portal app OR sandman_handoff.md health section)Acceptance Criteria
Cross-phase ACs:
ai/scripts/daemon pattern (orchestrator-daemon precedent)Out of Scope
Avoided Traps
Related
KBRecorderService.mjs(extension target)Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838Handoff Retrieval Hints
query_raw_memories({query: 'Phase 4 KB ingestion observability daemon multi-tenant'})ask_knowledge_base({query: 'KBRecorderService GapInferenceEngine daemon pattern', type: 'src'})ai/scripts/orchestrator-daemon.mjs,swarm-heartbeat-daemon.mjs,bridge-daemon.mjsare pattern references