Context
Sub of Phase 4 Epic #11628 (meta-Epic #11624).
Closes the operability loop — telemetry (Phase 4A) without alerting is just data. Alerting surfaces actionable issues to cloud operators.
The Problem
Phase 4A collects telemetry; Phase 4B reconciles; Phase 4C garbage-collects. But cloud operators need PROACTIVE notification when thresholds breach:
- Tenant quota exhausted (push frequency > threshold; chunk count > threshold)
- Tenant error rate spike (errors/min > threshold)
- Tenant schema-version drift (old schemaVersion seen → deprecation warning)
- Reconciliation finds drift > threshold (tenant-side push pipeline broken?)
- Embedding-budget burn threatens provider quota
Without alerting, operators must manually poll telemetry tables. Production-grade cloud Agent OS needs push-based ops.
The Fix
New daemon (or integration with Phase 4A): ai/scripts/kb-alerting-daemon.mjs OR embedded in Phase 4A daemon (decision: split for testability per Phase 4 Epic Avoided Trap).
Threshold-rule engine:
[
{
metric: 'tenant.error_rate_5min',
threshold: 0.1,
severity: 'warning',
channels: ['a2a:AGENT:*', 'console']
},
{
metric: 'tenant.chunk_count',
threshold: 100000,
severity: 'critical',
channels: ['a2a:operator', 'webhook:https://...']
},
]Channels:
- A2A —
add_message({to: '@<operator-identity>' | 'AGENT:*', subject: '[alert] ...', body: ...}). Reuses existing A2A substrate.
- Console —
logger.warn / logger.error in KB server logs
- Webhook (V1.5) — POST to external URL per tenant config (Slack, PagerDuty, etc.)
Acceptance Criteria
Out of Scope
- Threshold tuning per tenant (V1: global thresholds; per-tenant tuning future ticket)
- ML-driven anomaly detection (rule-based for V1)
- Alert UI / dashboard (sandman_handoff + portal app surface for V1)
- Webhook channel may defer to V1.5 if measurement justifies
Contract Ledger
Added at intake by @neo-opus-ada (Claude Code) 2026-05-21 — satisfies the ticket-intake §7 Contract Completeness readiness gate (intake comment: https://github.com/neomjs/neo/issues/11642#issuecomment-4504320783). The original author session is inactive; per ticket-intake §7 the claiming maintainer authors the missing ledger. This ledger folds in @neo-gpt's "Phase 4D Alerting Channel Overlay" (comment 2026-05-20T03:04Z) — its 6 substrate-grounded recommendations are the binding contract here. Tier target: T3 (Explicit Matrix). The ledger is the precise contract; the loose Acceptance Criteria checklist above is unchanged and is refined by these rows.
| Target Surface |
Source of Authority |
Proposed Behavior |
Fallback / Edge Case |
Docs |
Evidence |
aiConfig.knowledgeBase.alertRules — config schema |
#11628 Phase 4D; this ticket; @neo-gpt overlay #1/#3 |
An array of rule objects {metric, threshold, severity, channels, deliveryMode?}. metric is a per-tenant telemetry field name from KBRecorderService.getTenantIngestionRollup (errorRate, eventCount, errorEvents, chunksEmbedded, …). threshold is a number; a rule fires when the metric value exceeds it. severity is warning or critical. channels is an explicit, non-empty array per rule — there is no implicit default channel. deliveryMode is wake (default) or audit. The daemon evaluates every rule against the current per-tenant rollup each tick. |
Missing / empty alertRules → the daemon runs and fires nothing (no-op, not an error). A malformed rule (unknown metric, non-numeric threshold, bad severity, empty channels) → skip that rule with a logger.warn; the daemon continues. No a2a:AGENT:* default — a rule that does not explicitly list a broadcast channel never broadcasts. |
Yes — aiConfig template + JSDoc + a learn/agentos/ note |
Unit: rule-schema parse + validation (valid / malformed / empty) |
A2A alert channel — a2a:<target> |
this ticket; the add_message MCP contract; @neo-gpt overlay #2/#3 |
A channels entry a2a:<target> dispatches add_message({to: <target>, subject: '[alert] <severity>: <metric> over <threshold> (tenant <tenantId>)', body: <detail>}). <target> is a canonical @<identity> direct recipient (first-class) OR AGENT:* for broadcast — broadcast occurs only when a rule explicitly lists a2a:AGENT:*. deliveryMode: wake → wakeful (wakeSuppressed omitted); deliveryMode: audit → wakeSuppressed: true (durable mailbox-only record, no wake). |
An invalid <target> (not a registered AgentIdentity and not AGENT:*) → skip the channel with a logger.warn before dispatch; never dispatch to an unresolved target. An add_message failure → logger.error, best-effort; the daemon continues. |
Yes — daemon / service JSDoc |
Unit: channel dispatch — direct-DM, explicit broadcast, invalid-target rejection, wake vs. audit delivery mode |
Console alert channel — console |
this ticket; ai/mcp/server/knowledge-base/logger.mjs |
A channels entry console dispatches logger.warn for severity: warning and logger.error for severity: critical, with the alert detail. |
The logger is always available; a throw inside logging is swallowed (best-effort). |
Yes — JSDoc |
Unit: console dispatch maps severity → logger level |
Webhook alert channel — webhook:<url> |
this ticket Out-of-Scope; @neo-gpt overlay #6 |
V1.5-deferred. V1 recognizes a webhook: channel spec but does NOT POST — it emits a logger.warn ("webhook channel deferred to V1.5") and skips. V1.5 (a separate ticket) ships the POST path only alongside an allowlisted-target + secret-handling story — arbitrary-URL POST from alert config is a higher-blast surface than the A2A / console paths. |
V1: a webhook: channel spec → warn + skip; no network call is made. |
Yes — when V1.5 ships |
Unit: a V1 webhook: spec produces warn-skip with no network call |
| Alert cadence / hysteresis |
this ticket AC; @neo-gpt overlay #4 |
A fired alert is suppressed from re-firing within a cooldown window (default 1h, configurable via aiConfig). The cooldown key is the tuple (tenantId, metric, severity, channelTarget) — a single noisy tenant cannot cause a wake-storm, while distinct tenants / metrics / targets alert independently. |
Cooldown state is in-memory per daemon process; a daemon restart resets it (acceptable — at most one extra alert per key after a restart). |
Yes — JSDoc |
Unit: cooldown suppresses a re-fire within the window and permits it after; per-key independence |
Related
- Parent: #11628
- Blocked-by: Phase 4A (telemetry source)
- A2A substrate: existing
add_message MCP tool
- Pattern reference:
swarm-heartbeat-daemon.mjs (existing daemon-with-A2A-output pattern)
Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838
Handoff Retrieval Hints
add_message is the A2A delivery primitive
logger substrate in ai/mcp/server/knowledge-base/logger.mjs is the console channel
swarm-heartbeat-daemon.mjs emits A2A messages — pattern reference for alert delivery
Context
Sub of Phase 4 Epic #11628 (meta-Epic #11624).
Closes the operability loop — telemetry (Phase 4A) without alerting is just data. Alerting surfaces actionable issues to cloud operators.
The Problem
Phase 4A collects telemetry; Phase 4B reconciles; Phase 4C garbage-collects. But cloud operators need PROACTIVE notification when thresholds breach:
Without alerting, operators must manually poll telemetry tables. Production-grade cloud Agent OS needs push-based ops.
The Fix
New daemon (or integration with Phase 4A):
ai/scripts/kb-alerting-daemon.mjsOR embedded in Phase 4A daemon (decision: split for testability per Phase 4 Epic Avoided Trap).Threshold-rule engine:
// aiConfig.knowledgeBase.alertRules [ { metric: 'tenant.error_rate_5min', threshold: 0.1, // 10% error rate severity: 'warning', channels: ['a2a:AGENT:*', 'console'] }, { metric: 'tenant.chunk_count', threshold: 100000, // chunks per tenant severity: 'critical', channels: ['a2a:operator', 'webhook:https://...'] }, // ... ]Channels:
add_message({to: '@<operator-identity>' | 'AGENT:*', subject: '[alert] ...', body: ...}). Reuses existing A2A substrate.logger.warn/logger.errorin KB server logsAcceptance Criteria
aiConfig.knowledgeBase.alertRulesschema definedadd_messageper rule.channels[i]logger.warn/logger.errorOut of Scope
Contract Ledger
aiConfig.knowledgeBase.alertRules— config schema{metric, threshold, severity, channels, deliveryMode?}.metricis a per-tenant telemetry field name fromKBRecorderService.getTenantIngestionRollup(errorRate,eventCount,errorEvents,chunksEmbedded, …).thresholdis a number; a rule fires when the metric value exceeds it.severityiswarningorcritical.channelsis an explicit, non-empty array per rule — there is no implicit default channel.deliveryModeiswake(default) oraudit. The daemon evaluates every rule against the current per-tenant rollup each tick.alertRules→ the daemon runs and fires nothing (no-op, not an error). A malformed rule (unknownmetric, non-numericthreshold, badseverity, emptychannels) → skip that rule with alogger.warn; the daemon continues. Noa2a:AGENT:*default — a rule that does not explicitly list a broadcast channel never broadcasts.aiConfigtemplate + JSDoc + alearn/agentos/notea2a:<target>add_messageMCP contract; @neo-gpt overlay #2/#3channelsentrya2a:<target>dispatchesadd_message({to: <target>, subject: '[alert] <severity>: <metric> over <threshold> (tenant <tenantId>)', body: <detail>}).<target>is a canonical@<identity>direct recipient (first-class) ORAGENT:*for broadcast — broadcast occurs only when a rule explicitly listsa2a:AGENT:*.deliveryMode: wake→ wakeful (wakeSuppressedomitted);deliveryMode: audit→wakeSuppressed: true(durable mailbox-only record, no wake).<target>(not a registered AgentIdentity and notAGENT:*) → skip the channel with alogger.warnbefore dispatch; never dispatch to an unresolved target. Anadd_messagefailure →logger.error, best-effort; the daemon continues.consoleai/mcp/server/knowledge-base/logger.mjschannelsentryconsoledispatcheslogger.warnforseverity: warningandlogger.errorforseverity: critical, with the alert detail.webhook:<url>webhook:channel spec but does NOT POST — it emits alogger.warn("webhook channel deferred to V1.5") and skips. V1.5 (a separate ticket) ships the POST path only alongside an allowlisted-target + secret-handling story — arbitrary-URL POST from alert config is a higher-blast surface than the A2A / console paths.webhook:channel spec → warn + skip; no network call is made.webhook:spec produces warn-skip with no network callaiConfig). The cooldown key is the tuple(tenantId, metric, severity, channelTarget)— a single noisy tenant cannot cause a wake-storm, while distinct tenants / metrics / targets alert independently.Related
add_messageMCP toolswarm-heartbeat-daemon.mjs(existing daemon-with-A2A-output pattern)Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838Handoff Retrieval Hints
add_messageis the A2A delivery primitiveloggersubstrate inai/mcp/server/knowledge-base/logger.mjsis the console channelswarm-heartbeat-daemon.mjsemits A2A messages — pattern reference for alert delivery