LearnNewsExamplesServices
Frontmatter
id11642
titlePhase 4D — Operator Alerting Surface: Telemetry Thresholds → A2A + External Notification
stateClosed
labels
enhancementaiarchitecture
assigneesneo-opus-ada
createdAtMay 19, 2026, 1:57 PM
updatedAtJun 7, 2026, 7:13 PM
githubUrlhttps://github.com/neomjs/neo/issues/11642
authorneo-opus-ada
commentsCount2
parentIssue11628
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[x] 11639 Phase 4A — Per-Tenant Ingestion Observability Daemon (KBRecorderService Extension)
blocking[]
closedAtMay 21, 2026, 8:07 AM

Phase 4D — Operator Alerting Surface: Telemetry Thresholds → A2A + External Notification

Closed v13.0.0/archive-v13-0-0-chunk-12 enhancementaiarchitecture
neo-opus-ada
neo-opus-ada commented on May 19, 2026, 1:57 PM

Context

Sub of Phase 4 Epic #11628 (meta-Epic #11624).

Closes the operability loop — telemetry (Phase 4A) without alerting is just data. Alerting surfaces actionable issues to cloud operators.

The Problem

Phase 4A collects telemetry; Phase 4B reconciles; Phase 4C garbage-collects. But cloud operators need PROACTIVE notification when thresholds breach:

  • Tenant quota exhausted (push frequency > threshold; chunk count > threshold)
  • Tenant error rate spike (errors/min > threshold)
  • Tenant schema-version drift (old schemaVersion seen → deprecation warning)
  • Reconciliation finds drift > threshold (tenant-side push pipeline broken?)
  • Embedding-budget burn threatens provider quota

Without alerting, operators must manually poll telemetry tables. Production-grade cloud Agent OS needs push-based ops.

The Fix

New daemon (or integration with Phase 4A): ai/scripts/kb-alerting-daemon.mjs OR embedded in Phase 4A daemon (decision: split for testability per Phase 4 Epic Avoided Trap).

Threshold-rule engine:

// aiConfig.knowledgeBase.alertRules
[
  {
    metric: 'tenant.error_rate_5min',
    threshold: 0.1,  // 10% error rate
    severity: 'warning',
    channels: ['a2a:AGENT:*', 'console']
  },
  {
    metric: 'tenant.chunk_count',
    threshold: 100000,  // chunks per tenant
    severity: 'critical',
    channels: ['a2a:operator', 'webhook:https://...']
  },
  // ...
]

Channels:

  1. A2Aadd_message({to: '@<operator-identity>' | 'AGENT:*', subject: '[alert] ...', body: ...}). Reuses existing A2A substrate.
  2. Consolelogger.warn / logger.error in KB server logs
  3. Webhook (V1.5) — POST to external URL per tenant config (Slack, PagerDuty, etc.)

Acceptance Criteria

  • aiConfig.knowledgeBase.alertRules schema defined
  • Rule engine: matches telemetry metrics against rules; fires alerts when threshold crossed
  • A2A channel: alerts emit add_message per rule.channels[i]
  • Console channel: alerts emit logger.warn / logger.error
  • Webhook channel (V1.5 if shipped): POSTs to per-tenant URL
  • Hysteresis: alerts don't re-fire within configurable cooldown window (default 1h)
  • Unit tests: rule matching, channel dispatch, hysteresis
  • Integration test: simulate threshold breach → alert delivered

Out of Scope

  • Threshold tuning per tenant (V1: global thresholds; per-tenant tuning future ticket)
  • ML-driven anomaly detection (rule-based for V1)
  • Alert UI / dashboard (sandman_handoff + portal app surface for V1)
  • Webhook channel may defer to V1.5 if measurement justifies

Contract Ledger

Added at intake by @neo-opus-ada (Claude Code) 2026-05-21 — satisfies the ticket-intake §7 Contract Completeness readiness gate (intake comment: https://github.com/neomjs/neo/issues/11642#issuecomment-4504320783). The original author session is inactive; per ticket-intake §7 the claiming maintainer authors the missing ledger. This ledger folds in @neo-gpt's "Phase 4D Alerting Channel Overlay" (comment 2026-05-20T03:04Z) — its 6 substrate-grounded recommendations are the binding contract here. Tier target: T3 (Explicit Matrix). The ledger is the precise contract; the loose Acceptance Criteria checklist above is unchanged and is refined by these rows.

Target Surface Source of Authority Proposed Behavior Fallback / Edge Case Docs Evidence
aiConfig.knowledgeBase.alertRules — config schema #11628 Phase 4D; this ticket; @neo-gpt overlay #1/#3 An array of rule objects {metric, threshold, severity, channels, deliveryMode?}. metric is a per-tenant telemetry field name from KBRecorderService.getTenantIngestionRollup (errorRate, eventCount, errorEvents, chunksEmbedded, …). threshold is a number; a rule fires when the metric value exceeds it. severity is warning or critical. channels is an explicit, non-empty array per rule — there is no implicit default channel. deliveryMode is wake (default) or audit. The daemon evaluates every rule against the current per-tenant rollup each tick. Missing / empty alertRules → the daemon runs and fires nothing (no-op, not an error). A malformed rule (unknown metric, non-numeric threshold, bad severity, empty channels) → skip that rule with a logger.warn; the daemon continues. No a2a:AGENT:* default — a rule that does not explicitly list a broadcast channel never broadcasts. Yes — aiConfig template + JSDoc + a learn/agentos/ note Unit: rule-schema parse + validation (valid / malformed / empty)
A2A alert channel — a2a:<target> this ticket; the add_message MCP contract; @neo-gpt overlay #2/#3 A channels entry a2a:<target> dispatches add_message({to: <target>, subject: '[alert] <severity>: <metric> over <threshold> (tenant <tenantId>)', body: <detail>}). <target> is a canonical @<identity> direct recipient (first-class) OR AGENT:* for broadcast — broadcast occurs only when a rule explicitly lists a2a:AGENT:*. deliveryMode: wake → wakeful (wakeSuppressed omitted); deliveryMode: auditwakeSuppressed: true (durable mailbox-only record, no wake). An invalid <target> (not a registered AgentIdentity and not AGENT:*) → skip the channel with a logger.warn before dispatch; never dispatch to an unresolved target. An add_message failure → logger.error, best-effort; the daemon continues. Yes — daemon / service JSDoc Unit: channel dispatch — direct-DM, explicit broadcast, invalid-target rejection, wake vs. audit delivery mode
Console alert channel — console this ticket; ai/mcp/server/knowledge-base/logger.mjs A channels entry console dispatches logger.warn for severity: warning and logger.error for severity: critical, with the alert detail. The logger is always available; a throw inside logging is swallowed (best-effort). Yes — JSDoc Unit: console dispatch maps severity → logger level
Webhook alert channel — webhook:<url> this ticket Out-of-Scope; @neo-gpt overlay #6 V1.5-deferred. V1 recognizes a webhook: channel spec but does NOT POST — it emits a logger.warn ("webhook channel deferred to V1.5") and skips. V1.5 (a separate ticket) ships the POST path only alongside an allowlisted-target + secret-handling story — arbitrary-URL POST from alert config is a higher-blast surface than the A2A / console paths. V1: a webhook: channel spec → warn + skip; no network call is made. Yes — when V1.5 ships Unit: a V1 webhook: spec produces warn-skip with no network call
Alert cadence / hysteresis this ticket AC; @neo-gpt overlay #4 A fired alert is suppressed from re-firing within a cooldown window (default 1h, configurable via aiConfig). The cooldown key is the tuple (tenantId, metric, severity, channelTarget) — a single noisy tenant cannot cause a wake-storm, while distinct tenants / metrics / targets alert independently. Cooldown state is in-memory per daemon process; a daemon restart resets it (acceptable — at most one extra alert per key after a restart). Yes — JSDoc Unit: cooldown suppresses a re-fire within the window and permits it after; per-key independence

Related

  • Parent: #11628
  • Blocked-by: Phase 4A (telemetry source)
  • A2A substrate: existing add_message MCP tool
  • Pattern reference: swarm-heartbeat-daemon.mjs (existing daemon-with-A2A-output pattern)

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • add_message is the A2A delivery primitive
  • logger substrate in ai/mcp/server/knowledge-base/logger.mjs is the console channel
  • swarm-heartbeat-daemon.mjs emits A2A messages — pattern reference for alert delivery
tobiu referenced in commit 5d64a1f - "feat(ai): KB ingestion telemetry schema + recordIngestionMetric API (#11639) (#11667) on May 20, 2026, 8:01 AM
tobiu referenced in commit b1b7005 - "feat(ai): KB operator-alerting daemon — Phase 4D (#11642) (#11709) on May 21, 2026, 8:07 AM
tobiu closed this issue on May 21, 2026, 8:07 AM