LearnNewsExamplesServices
Frontmatter
id11628
titleKB Ingestion Phase 4: Operations + Observability for Cloud-Native Deployments
stateClosed
labels
epicaiarchitecture
assignees[]
createdAtMay 19, 2026, 1:52 PM
updatedAtJun 7, 2026, 7:13 PM
githubUrlhttps://github.com/neomjs/neo/issues/11628
authorneo-opus-ada
commentsCount5
parentIssue11624
subIssues
11639 Phase 4A — Per-Tenant Ingestion Observability Daemon (KBRecorderService Extension)
11640 Phase 4B — Manifest Reconciliation Daemon: Tenant-State vs Chroma-Actual Sync
11641 Phase 4C — Stale-Chunk Garbage Collection Daemon: Orphan Detection + Retention Enforcement
11642 Phase 4D — Operator Alerting Surface: Telemetry Thresholds → A2A + External Notification
11649 Per-substrate retention policy configuration for KB/MC backup mechanisms
11663 KB Ingestion Phase 4: Configurable bundle retention policy via aiConfig.backupRetention
11665 KB Ingestion Phase 4A: Multi-tenant ingestion observability daemon scaffold + telemetry schema
11711 KB reconciliation: force-push & manifest-orphan drift detection
11716 Audit kb-config daemon readability under GraphService RLS
subIssuesCompleted9
subIssuesTotal9
blockedBy[x] 11626 KB Ingestion Phase 2: Ingestion Service + MCP Small-Batch Facade + Bulk Facade
blocking[]
closedAtMay 21, 2026, 3:00 PM

KB Ingestion Phase 4: Operations + Observability for Cloud-Native Deployments

Closed v13.0.0/archive-v13-0-0-chunk-12 epicaiarchitecture
neo-opus-ada
neo-opus-ada commented on May 19, 2026, 1:52 PM

Context

Phase 4 sub-Epic of meta-Epic #11624 (Cloud-Native KB Ingestion for External Workspaces). NOT originally in Discussion #11623; surfaced 2026-05-19 during post-graduation operator-directed sub-decomposition review. Operator-mention of "daemons etc." prompted explicit operational-substrate decomposition.

Blocked-by Phase 2 #11626 — most observability/reconciliation work needs the ingestion service + facades stable. Operator alerting + dashboard surfacing may begin in parallel once Phase 0/1 #11625 contracts land (daemon-scaffolding can start pre-Phase-2).

The Problem

Phase 2 ships the push pipeline; Phase 3 ships the operator-facing guide. What's missing for production cloud Agent OS deployments:

  • No per-tenant ingestion observability (push frequency, error rates, ingestion latency, embedding-budget burn)
  • No periodic state-reconciliation (catches missed tombstones; force-push detection beyond per-push payload)
  • No stale-chunk garbage collection (orphans from source removal, parser swap, retention policy)
  • No operator alerting surface (quota/error thresholds → A2A or external notification)

These concerns aren't covered by any existing phase. Without them, cloud operators have no visibility into the ingestion fleet's health + no recovery story for tenant state drift.

Architectural Substrate Precedent (V-B-A grounded)

Neo already has daemon-pattern substrate this Epic builds on:

New observability daemons EXTEND KBRecorderService (multi-tenant telemetry collection) rather than introduce a new substrate; reconciliation + GC daemons follow the orchestrator-daemon pattern.

KB-as-Cache vs MC-as-Store (load-bearing invariant for Phase 4 daemons)

(Added 2026-05-19 per operator V-B-A on backup substrate symmetry framing post-PR-#11647 merge.)

A structural distinction between the two substrates this Epic operates on:

  • Knowledge Base is a cache+index over external sources. Neo's KB content is derivable from the Neo repo via npm run ai:sync-kb (full re-sync regenerates all chunks). Phase 2 cross-tenant ingestion preserves this property: each tenant's content originates from the tenant's repo and is recoverable via tenant-side re-push (hook re-run OR npm run ai:ingest-tenant <tenantId> bulk facade). KB wipe is recoverable from external sources at any v-version. The asymmetry-collapse-at-v13 framing initially proposed was incorrect.
  • Memory Core is a primary store. Conversations, agent-thoughts, and session-summaries are unique runtime artifacts with no external source-of-truth. MC wipe between backups IS amnesia — daily-daemon-driven JSONL bundles minimize the amnesia window to ≤24h but cannot eliminate it.

Phase 4 daemon value-prop reset (per the cache vs store distinction)

Phase 4B reconciliation daemon (#11640) value for KB is operational-cost-of-recovery reduction, NOT data-loss prevention. The daemon catches tenant-state drift WITHOUT requiring full re-sync orchestration across N tenants. For MC, the daemon has equivalent value to operational-cost reduction (catches missed-delete-signaling without requiring restore-from-bundle).

Phase 4C GC daemon (#11641) value is symmetric across KB and MC — both benefit from automated stale-chunk cleanup. The cache/store distinction doesn't change GC semantics.

Phase 4D alerting (#11642) surfaces wipe / drift events for BOTH substrates, but the severity calculus differs: a KB wipe alert is "orchestrate N tenant re-syncs"; an MC wipe alert is "amnesia event — cannot fully recover post-last-backup". Per-substrate severity threshold configuration follows from this.

Retention policy implications (per-substrate, not symmetric)

  • KB JSONL bundle: lighter retention acceptable (weekly cadence; defrag pre-nuke 7d unchanged) — backup is cost-optimization for re-sync orchestration, not data-loss prevention
  • MC JSONL bundle: daily cadence + 3-30 day retention (status quo) — backup IS data-loss prevention

This per-substrate retention asymmetry should be configurable via aiConfig.{knowledgeBase,memoryCore}.backupRetention.* — follow-up ticket scope (deferred to Phase 4 implementation; let implementer-hot-context shape).

The Fix

Four sub-tickets:

  • Phase 4A — Per-tenant ingestion observability daemon (extends KBRecorderService with multi-tenant telemetry; persists per-tenant metrics to Memory Core SQLite; surfaces via existing portal app OR sandman_handoff.md health section)
  • Phase 4B — Manifest reconciliation daemon (periodic tenant-claimed-state vs Chroma-actual-state reconciliation; catches missed tombstones; handles force-push edge cases)
  • Phase 4C — Stale-chunk garbage collection daemon (detects chunks orphaned by source-config changes, parser swaps, retention-policy expiration)
  • Phase 4D — Operator alerting surface (telemetry thresholds → A2A notification OR external channel; per-tenant quota tracking; error-rate alerts)

Acceptance Criteria

Cross-phase ACs:

  • All 4 sub-tickets filed with explicit cross-references back to this Epic
  • Daemons follow existing ai/scripts/ daemon pattern (orchestrator-daemon precedent)
  • Telemetry persists to shared Memory Core SQLite (KBRecorderService extension pattern, NOT new database)
  • Operator-facing surfaces integrate with existing portal app OR sandman_handoff.md (not new dashboard infrastructure)
  • Per-tenant retention policy enforceable via tenant config (Q5 from Discussion #11623 — tenant config storage)
  • Test coverage: each daemon has unit tests + integration tests against synthetic multi-tenant ingestion fixtures (reuse Phase 2 test fixtures)

Out of Scope

  • New dashboard infrastructure (extend portal app or sandman_handoff; don't build standalone)
  • Per-tenant SLA / quota enforcement engine (this Epic surfaces the data; SLA enforcement is post-V1 commercialization scope)
  • Cross-deployment fleet management (single-deployment scope for V1; multi-deployment is separate substrate)
  • ML-driven anomaly detection (rule-based thresholds for V1; ML can be layered later)

Avoided Traps

Trap Why rejected
New telemetry database KBRecorderService already uses Memory Core SQLite; reuse the substrate
Standalone dashboard infrastructure Portal app + sandman_handoff exist; surface there rather than fork the operability story
Reconciliation as user-on-demand only Force-push + missed-tombstone classes need PROACTIVE detection; on-demand reconciliation misses production-class failures
Mixing observability + alerting in one daemon Concerns separate: telemetry collection ≠ threshold alerting; split into independent daemons for testability
Building before Phase 2 service exists Observability daemon needs ingestion-service hooks; reconciliation needs push-pipeline state. Sequence: Phase 2 first, then Phase 4 (most subs blocked-by Phase 2)

Related

  • Parent meta-Epic: #11624
  • Blocked-by: Phase 2 #11626 (ingestion service + facades must stabilize)
  • Sibling phases: Phase 0/1 #11625, Phase 3 #11627
  • Daemon precedents: orchestrator-daemon, swarm-heartbeat-daemon, bridge-daemon
  • Telemetry substrate: KBRecorderService.mjs (extension target)
  • Origin Discussion: #11623 (Phase 4 not in original §7 decomposition; surfaced post-graduation)

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • query_raw_memories({query: 'Phase 4 KB ingestion observability daemon multi-tenant'})
  • ask_knowledge_base({query: 'KBRecorderService GapInferenceEngine daemon pattern', type: 'src'})
  • Existing daemons: ai/scripts/orchestrator-daemon.mjs, swarm-heartbeat-daemon.mjs, bridge-daemon.mjs are pattern references
  • Operator framing 2026-05-19: "future sessions have amnesia ... better iron out subs now while context is hot" — applied to surfacing operational-substrate that the Discussion didn't decompose
tobiu referenced in commit 3c47411 - "feat(ai): configurable bundle retention via aiConfig.backupRetention (#11663) (#11664) on May 20, 2026, 8:00 AM
tobiu referenced in commit 5d64a1f - "feat(ai): KB ingestion telemetry schema + recordIngestionMetric API (#11639) (#11667) on May 20, 2026, 8:01 AM
tobiu referenced in commit ad4a108 - "feat(ai): KB Multi-Tenant Health section in Sandman handoff (#11639) (#11708) on May 21, 2026, 8:03 AM