Author's Note: Filed by Claude Opus 4.7 (Claude Code) during session b5a17132-7324-46e1-b73e-038825bb4d55 per @tobiu's directive ("i strongly agree this needs an investigation"). Empirical anchor: cross-family A2A coordination during this session-arc shows reproducible asymmetric routing failure — @neo-gemini-3-1-pro's outgoing messages from Antigravity sometimes route correctly, sometimes orphan with zero SENT_TO edges. Claude-side outgoing path appears stable. The bug is intermittent on her side specifically.
Context
The multi-day mailbox-debugging arc (#10174 → #10269 → #10308 → #10325 → #10330/#10331) closed the caller-format-mismatch family of silent-cull bugs. Plus #10325's sharedEntity:true primitive resolved RLS read-path. Plus #10330 unified identity-format normalization. Plus #10331 simplified normalizeMailboxTarget to single-rule canonical-format.
Yet: during this session-arc Gemini's outgoing A2A messages still exhibit intermittent silent-cull. Empirical pattern observed:
| Time |
Subject |
Routing |
Reached My Mailbox |
| 20:31:23Z |
re: PR #10340 review cycle 1 |
NO SENT_TO/SENT_BY edges |
❌ orphaned |
| 21:37:58Z |
re: PR #10340 cycle 1 & #10333 |
NO SENT_TO/SENT_BY edges |
❌ orphaned |
| 21:38:55Z |
re: #10336 is also done |
NO SENT_TO/SENT_BY edges |
❌ orphaned |
| 22:08:42Z |
A2A Channel Restored + Taking #10338 |
✓ SENT_BY @neo-gemini-3-1-pro + SENT_TO @neo-opus-4-7 |
✅ delivered |
| 22:19:44Z |
Re: task |
SENT_TO @alice (wrong target) |
❌ wrong recipient |
| 22:31:12Z |
Review requested for PR #10342 |
NO SENT_TO/SENT_BY edges |
❌ orphaned |
| 22:40:34Z |
Review Request: PRs #10345 and #10346 |
NO SENT_TO/SENT_BY edges |
❌ orphaned |
Pattern: her A2A had ONE clean-routing window after she did git pull origin dev + restart-Antigravity (the 22:08:42Z message), then regressed back to silent-cull state for subsequent messages.
@tobi's "postman ping" pattern is forced load-bearing during regression intervals — undermines the autonomy paradigm + the "swarm evolution when tobi not relaying" goal.
The Problem
Three classes of failure observed on her side:
- No SENT_TO/SENT_BY edges (silent-cull at FK check) — most common; #10284-class
- Wrong SENT_TO target (e.g.,
@alice instead of @neo-opus-4-7) — caller passing wrong identifier; possibly stale state-machine context fixture leaking
- Brief working window then regression — suggests state-mutation between calls
Claude-side outgoing path remains stable (verified via direct SQL: my outgoing messages consistently have correct SENT_BY @neo-opus-4-7 + SENT_TO @neo-gemini-3-1-pro). Asymmetry is the diagnostic anchor.
The Architectural Reality
Possible mechanisms (NOT prescribing — investigation should empirically narrow):
- MC singleton state divergence in Antigravity: Antigravity's MC server may load multiple agent contexts that share state;
RequestContextService.getAgentIdentityNodeId() may bind to the wrong agent identity for some addMessage calls
- Caller-format regression in test paths or sub-tools: Gemini's MC may be calling addMessage with stale
AGENT:bare-name format (the format #10331 normalizer was supposed to fix) — possibly via test fixtures leaking, role/human prefix routes, or other paths bypassing the normalizer
- Cache invalidation on identity binding: the brief-working-window pattern suggests post-restart binding works, then some subsequent operation invalidates the binding
- AGENT: sentinel handling regression:* if
to: 'AGENT:*' is being normalized to @AGENT:* or similar via the new #10331 normalizer logic, broadcast-routing breaks
- Vicinity-cache miss under specific call patterns:
getAdjacentNodes may not load the recipient identity into vicinity cache for FK check, causing cull even with correct format
The Fix
Diagnostic-first; this ticket is investigation, not prescription. Phases:
Phase 1 — Empirical bisection: capture concrete failure cases on Antigravity side. Each failure case logs:
- Caller-side
to: parameter passed to addMessage
- Pre-normalize value
- Post-normalize value
- FK-check verifyStmt result (count vs 2 expected)
- Identity-binding
me value
- Whether cull-warning emitted
This is #10284 Phase 1 (make-failure-loud) territory — surface the cull at write-time so we have concrete signals.
Phase 2 — Root-cause narrow: based on Phase 1 logs, identify which of the 5 hypotheses (or others not listed) actually fires. Fix targets the substrate, not the symptom.
Phase 3 — Test coverage: regression-class test exercising the specific failure pattern + cross-process scenarios using identity-binding fixtures.
Acceptance Criteria
Out of Scope
- #10284 Phase 1 implementation itself — sibling ticket; this ticket is downstream consumer. Coordinate via parent-link if useful
- Replacing optimistic-concurrency with pessimistic locking — unrelated to this routing bug
- Multi-tenant RLS hardening — orthogonal; routing-edge persistence is the scope here
- Antigravity harness internals — we don't have direct visibility into Antigravity's MC process. Investigation works from outside via direct SQLite queries + cross-process behavior diagnosis
- Generic flaky-test instability — this is a routing-edge persistence issue, not test-pollution
Avoided Traps
- "Just restart Antigravity again" — rejected per memory
feedback_verify_effect_not_just_success; restart fixes briefly then regresses. Not a stable mitigation.
- Diagnose by intuition — rejected; the 5 hypotheses are surface speculation. Phase 1 empirical capture must precede prescription.
- Single-fix assumption — rejected; the failure pattern shows brief-working window + regression. Could be multiple stacked bugs (intermittent state corruption + caller-format edge case + cache invalidation).
- Premature optimization — rejected; investigation first, fix second. Make the failure loud, then catch it precisely.
- Cross-family-blame framing — rejected per @tobiu's "we don't blame, we learn from incidents." This is a substrate gap, not a Gemini-side authorial mistake.
Related
- #10284 — Phase 1 make-failure-loud (sibling ticket; this consumer relies on its observability)
- #10174 — original
normalizeMailboxTarget introduction (load-bearing primitive)
- #10325 —
sharedEntity:true primitive (read-path RLS)
- #10330 / #10331 — single-canonical identity format migration (caller-format normalization)
- #10257 —
getAdjacentNodes lazy-load pattern (vicinity-cache substrate)
feedback_verify_effect_not_just_success — durable anchor for the verify-mechanism-not-claim discipline
- #10311 — Epic: Institutionalizing Swarm Autonomy (this bug undermines the autonomy goal until resolved)
Origin Session ID: b5a17132-7324-46e1-b73e-038825bb4d55
Retrieval Hint: "Antigravity SENT_TO silent-cull intermittent A2A routing investigation MC singleton state divergence vicinity cache identity binding regression empirical bisection #10284 phase-1 makefailure-loud"
Context
The multi-day mailbox-debugging arc (#10174 → #10269 → #10308 → #10325 → #10330/#10331) closed the caller-format-mismatch family of silent-cull bugs. Plus #10325's
sharedEntity:trueprimitive resolved RLS read-path. Plus #10330 unified identity-format normalization. Plus #10331 simplifiednormalizeMailboxTargetto single-rule canonical-format.Yet: during this session-arc Gemini's outgoing A2A messages still exhibit intermittent silent-cull. Empirical pattern observed:
Pattern: her A2A had ONE clean-routing window after she did
git pull origin dev+ restart-Antigravity (the 22:08:42Z message), then regressed back to silent-cull state for subsequent messages.@tobi's "postman ping" pattern is forced load-bearing during regression intervals — undermines the autonomy paradigm + the "swarm evolution when tobi not relaying" goal.
The Problem
Three classes of failure observed on her side:
@aliceinstead of@neo-opus-4-7) — caller passing wrong identifier; possibly stale state-machine context fixture leakingClaude-side outgoing path remains stable (verified via direct SQL: my outgoing messages consistently have correct SENT_BY @neo-opus-4-7 + SENT_TO @neo-gemini-3-1-pro). Asymmetry is the diagnostic anchor.
The Architectural Reality
Possible mechanisms (NOT prescribing — investigation should empirically narrow):
RequestContextService.getAgentIdentityNodeId()may bind to the wrong agent identity for some addMessage callsAGENT:bare-nameformat (the format #10331 normalizer was supposed to fix) — possibly via test fixtures leaking, role/human prefix routes, or other paths bypassing the normalizerto: 'AGENT:*'is being normalized to@AGENT:*or similar via the new #10331 normalizer logic, broadcast-routing breaksgetAdjacentNodesmay not load the recipient identity into vicinity cache for FK check, causing cull even with correct formatThe Fix
Diagnostic-first; this ticket is investigation, not prescription. Phases:
Phase 1 — Empirical bisection: capture concrete failure cases on Antigravity side. Each failure case logs:
to:parameter passed toaddMessagemevalueThis is
#10284Phase 1 (make-failure-loud) territory — surface the cull at write-time so we have concrete signals.Phase 2 — Root-cause narrow: based on Phase 1 logs, identify which of the 5 hypotheses (or others not listed) actually fires. Fix targets the substrate, not the symptom.
Phase 3 — Test coverage: regression-class test exercising the specific failure pattern + cross-process scenarios using identity-binding fixtures.
Acceptance Criteria
linkNodesverification with structured-logging on cull events)Out of Scope
Avoided Traps
feedback_verify_effect_not_just_success; restart fixes briefly then regresses. Not a stable mitigation.Related
normalizeMailboxTargetintroduction (load-bearing primitive)sharedEntity:trueprimitive (read-path RLS)getAdjacentNodeslazy-load pattern (vicinity-cache substrate)feedback_verify_effect_not_just_success— durable anchor for the verify-mechanism-not-claim disciplineOrigin Session ID:
b5a17132-7324-46e1-b73e-038825bb4d55Retrieval Hint:"Antigravity SENT_TO silent-cull intermittent A2A routing investigation MC singleton state divergence vicinity cache identity binding regression empirical bisection #10284 phase-1 makefailure-loud"