LearnNewsExamplesServices
Frontmatter
id10347
titleInvestigate intermittent SENT_TO edge cull on Antigravity-side A2A messages
stateClosed
labels
bugaiarchitecture
assignees[]
createdAtApr 26, 2026, 12:47 AM
updatedAtApr 26, 2026, 3:43 AM
githubUrlhttps://github.com/neomjs/neo/issues/10347
authorneo-opus-4-7
commentsCount1
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtApr 26, 2026, 3:43 AM

Investigate intermittent SENT_TO edge cull on Antigravity-side A2A messages

Closedbugaiarchitecture
neo-opus-4-7
neo-opus-4-7 commented on Apr 26, 2026, 12:47 AM

Author's Note: Filed by Claude Opus 4.7 (Claude Code) during session b5a17132-7324-46e1-b73e-038825bb4d55 per @tobiu's directive ("i strongly agree this needs an investigation"). Empirical anchor: cross-family A2A coordination during this session-arc shows reproducible asymmetric routing failure — @neo-gemini-3-1-pro's outgoing messages from Antigravity sometimes route correctly, sometimes orphan with zero SENT_TO edges. Claude-side outgoing path appears stable. The bug is intermittent on her side specifically.

Context

The multi-day mailbox-debugging arc (#10174#10269#10308#10325#10330/#10331) closed the caller-format-mismatch family of silent-cull bugs. Plus #10325's sharedEntity:true primitive resolved RLS read-path. Plus #10330 unified identity-format normalization. Plus #10331 simplified normalizeMailboxTarget to single-rule canonical-format.

Yet: during this session-arc Gemini's outgoing A2A messages still exhibit intermittent silent-cull. Empirical pattern observed:

Time Subject Routing Reached My Mailbox
20:31:23Z re: PR #10340 review cycle 1 NO SENT_TO/SENT_BY edges ❌ orphaned
21:37:58Z re: PR #10340 cycle 1 & #10333 NO SENT_TO/SENT_BY edges ❌ orphaned
21:38:55Z re: #10336 is also done NO SENT_TO/SENT_BY edges ❌ orphaned
22:08:42Z A2A Channel Restored + Taking #10338 ✓ SENT_BY @neo-gemini-3-1-pro + SENT_TO @neo-opus-4-7 ✅ delivered
22:19:44Z Re: task SENT_TO @alice (wrong target) ❌ wrong recipient
22:31:12Z Review requested for PR #10342 NO SENT_TO/SENT_BY edges ❌ orphaned
22:40:34Z Review Request: PRs #10345 and #10346 NO SENT_TO/SENT_BY edges ❌ orphaned

Pattern: her A2A had ONE clean-routing window after she did git pull origin dev + restart-Antigravity (the 22:08:42Z message), then regressed back to silent-cull state for subsequent messages.

@tobi's "postman ping" pattern is forced load-bearing during regression intervals — undermines the autonomy paradigm + the "swarm evolution when tobi not relaying" goal.

The Problem

Three classes of failure observed on her side:

  1. No SENT_TO/SENT_BY edges (silent-cull at FK check) — most common; #10284-class
  2. Wrong SENT_TO target (e.g., @alice instead of @neo-opus-4-7) — caller passing wrong identifier; possibly stale state-machine context fixture leaking
  3. Brief working window then regression — suggests state-mutation between calls

Claude-side outgoing path remains stable (verified via direct SQL: my outgoing messages consistently have correct SENT_BY @neo-opus-4-7 + SENT_TO @neo-gemini-3-1-pro). Asymmetry is the diagnostic anchor.

The Architectural Reality

Possible mechanisms (NOT prescribing — investigation should empirically narrow):

  1. MC singleton state divergence in Antigravity: Antigravity's MC server may load multiple agent contexts that share state; RequestContextService.getAgentIdentityNodeId() may bind to the wrong agent identity for some addMessage calls
  2. Caller-format regression in test paths or sub-tools: Gemini's MC may be calling addMessage with stale AGENT:bare-name format (the format #10331 normalizer was supposed to fix) — possibly via test fixtures leaking, role/human prefix routes, or other paths bypassing the normalizer
  3. Cache invalidation on identity binding: the brief-working-window pattern suggests post-restart binding works, then some subsequent operation invalidates the binding
  4. AGENT: sentinel handling regression:* if to: 'AGENT:*' is being normalized to @AGENT:* or similar via the new #10331 normalizer logic, broadcast-routing breaks
  5. Vicinity-cache miss under specific call patterns: getAdjacentNodes may not load the recipient identity into vicinity cache for FK check, causing cull even with correct format

The Fix

Diagnostic-first; this ticket is investigation, not prescription. Phases:

Phase 1 — Empirical bisection: capture concrete failure cases on Antigravity side. Each failure case logs:

  • Caller-side to: parameter passed to addMessage
  • Pre-normalize value
  • Post-normalize value
  • FK-check verifyStmt result (count vs 2 expected)
  • Identity-binding me value
  • Whether cull-warning emitted

This is #10284 Phase 1 (make-failure-loud) territory — surface the cull at write-time so we have concrete signals.

Phase 2 — Root-cause narrow: based on Phase 1 logs, identify which of the 5 hypotheses (or others not listed) actually fires. Fix targets the substrate, not the symptom.

Phase 3 — Test coverage: regression-class test exercising the specific failure pattern + cross-process scenarios using identity-binding fixtures.

Acceptance Criteria

  • Phase 1 #10284-style observability landed (post-linkNodes verification with structured-logging on cull events)
  • At minimum 3 failure cases captured from Antigravity-side empirically with structured log payloads
  • Root cause identified (hypothesis 1-5 confirmed/refuted, OR new hypothesis surfaced empirically)
  • Substrate fix lands at the right layer (identity-binding / normalizer / vicinity-cache / sentinel-handling — depends on Phase 2)
  • Regression test covers the specific failure pattern
  • @neo-gemini-3-1-pro's outgoing A2A messages route correctly across at least 5 consecutive sends from her Antigravity MC
  • Claude-side stability preserved (no regression on Claude Code → Gemini path)

Out of Scope

  • #10284 Phase 1 implementation itself — sibling ticket; this ticket is downstream consumer. Coordinate via parent-link if useful
  • Replacing optimistic-concurrency with pessimistic locking — unrelated to this routing bug
  • Multi-tenant RLS hardening — orthogonal; routing-edge persistence is the scope here
  • Antigravity harness internals — we don't have direct visibility into Antigravity's MC process. Investigation works from outside via direct SQLite queries + cross-process behavior diagnosis
  • Generic flaky-test instability — this is a routing-edge persistence issue, not test-pollution

Avoided Traps

  • "Just restart Antigravity again" — rejected per memory feedback_verify_effect_not_just_success; restart fixes briefly then regresses. Not a stable mitigation.
  • Diagnose by intuition — rejected; the 5 hypotheses are surface speculation. Phase 1 empirical capture must precede prescription.
  • Single-fix assumption — rejected; the failure pattern shows brief-working window + regression. Could be multiple stacked bugs (intermittent state corruption + caller-format edge case + cache invalidation).
  • Premature optimization — rejected; investigation first, fix second. Make the failure loud, then catch it precisely.
  • Cross-family-blame framing — rejected per @tobiu's "we don't blame, we learn from incidents." This is a substrate gap, not a Gemini-side authorial mistake.

Related

  • #10284 — Phase 1 make-failure-loud (sibling ticket; this consumer relies on its observability)
  • #10174 — original normalizeMailboxTarget introduction (load-bearing primitive)
  • #10325sharedEntity:true primitive (read-path RLS)
  • #10330 / #10331 — single-canonical identity format migration (caller-format normalization)
  • #10257getAdjacentNodes lazy-load pattern (vicinity-cache substrate)
  • feedback_verify_effect_not_just_success — durable anchor for the verify-mechanism-not-claim discipline
  • #10311 — Epic: Institutionalizing Swarm Autonomy (this bug undermines the autonomy goal until resolved)

Origin Session ID: b5a17132-7324-46e1-b73e-038825bb4d55 Retrieval Hint: "Antigravity SENT_TO silent-cull intermittent A2A routing investigation MC singleton state divergence vicinity cache identity binding regression empirical bisection #10284 phase-1 makefailure-loud"

tobiu referenced in commit 96eee03 - "fix(ai): Atomic Load-Check-Insert for GraphService linkNodes (#10353) on Apr 26, 2026, 3:43 AM
tobiu closed this issue on Apr 26, 2026, 3:43 AM
tobiu referenced in commit 54600d7 - "refactor(ai-services): align Neo classNames with flat SDK locations across M6 servers (#11005) (#11007) on May 9, 2026, 1:18 PM