Context
Surfaced as the root finding of a 24-hour arc spanning the SQLite IN-clause overflow incident (#10463 / #10464 / #10465 / #10466), the demo cancellation, the Antigravity wake editor-corruption bug (#10467 / #10468), and the panic-test retrospective. @tobiu identified the meta-pattern: under pressure (and even without it), agents repeatedly skipped empirical validation when the falsifying tools were one tool-call away. Existing per-phase tickets (#9975, #9969, #9948, #9812, #9844) each address ONE phase of the discipline (intake / review / commit / creation). What's missing is the umbrella Pre-Flight Check that applies BEFORE any factual assertion in any public artifact at any phase.
The Problem
Five empirical anchors from this session arc, each demonstrating the same root cognitive failure:
- Cursor speculation in #10411 (Gemini, 2026-04-27). Added
appName === 'Cursor' → 'l' to the bridge daemon dispatch without empirically smoke-testing Cursor end-to-end for 2026. Removed via #10468 commit-2 the next day under @tobiu's Truth-in-Code framing. Tool not used: smoke-test on a Cursor harness.
- "Merge gate violation" hallucination (Gemini, 2026-04-28). Asserted in an A2A retrospective that "we (or one of us) overstepped and executed a merge" based on @tobiu's rhetorical comment ("you overruled the merge gate policy"). Almost propagated into a public Discussion. Self-corrected via
gh pr view --json mergedBy ~14 minutes later, which showed all merges executed by tobiu, zero by agents. Tool not used: gh pr view --json mergedBy (read-only, available throughout).
- 4-options Fix framing in ticket #10467 (Claude, 2026-04-28). Filed an actionable ticket with 4 abstract options A-D instead of the single empirical fix. Edit-in-placed after @tobiu flagged the noise. Tool not used: read the existing dispatch at
bridge-daemon.mjs:506 first to identify the SPECIFIC missing knob (Antigravity wasn't in the appName switch). Pattern-abstraction substituted for empirical specificity.
- PR review template-discipline skip (Claude, 2026-04-28). Posted PR reviews on #10464 / #10466 via gh CLI under demo time pressure with topical substance but flat structure, missing the canonical sectioned form the Retrospective daemon regex-matches. Edit-in-placed after @tobiu flagged. Tool not used: load
.agent/skills/pr-review/assets/pr-review-template.md before composing.
- Cmd+L shortcut challenge to @tobiu (Claude, 2026-04-28, no time pressure). Challenged the merged #10468 dispatch's
'Antigravity' → 'l' mapping based on extrapolation from VS Code's "Go to Line" keybinding semantics. The challenge was wrong — Cmd+L in Antigravity exists, focuses the agent input field, and is idempotent across visible/hidden panel states (empirically confirmed by @tobiu after my challenge). Tool not used: WebSearch for "Antigravity shortcut to focus agent" — the canonical authoritative answer (Google's own Codelabs + Antigravity Cheat Sheet 2026 + Antigravity Lab + antigravity.google/docs) is a single web-search away and unambiguously confirms Cmd+L. Internal-repo tools (gh, sqlite3, etc.) cannot answer "what does Antigravity do" — Antigravity launched 2026, post-training-cutoff for most models, so WebSearch is the appropriate falsifying tool for this class of question. Familiar-system-semantics extrapolation (VS Code's "Go to Line") substituted for actual-system verification.
Pattern across all five: the failing agent had the right tool one tool-call away. None were tool-availability gaps. All were stress-compression OR familiar-pattern-extrapolation bypasses of empirical validation. Anchor #5 specifically rules out "stress" as the root cause — the discipline failure happens under low-pressure conditions too, when familiar-system semantics seduce the agent into skipping verification.
The Architectural Reality
Files this ticket modifies:
AGENTS.md — adds a new section codifying the Verify-Before-Assert Pre-Flight Check, modeled on the existing reasoning-statement Pre-Flights (§3 Gate 1 Pre-Commit Check, §4.2 Consolidate-Then-Save). The Pre-Flight is a commitment-statement the agent makes in its internal reasoning BEFORE writing a factual claim into a public artifact: "To assert X, I will run [specific tool] and let the result determine the assertion."
.agent/skills/pr-review/references/pr-review-guide.md — references the AGENTS.md section as a §7-tier discipline floor (alongside §7.1 Minimum-One-Challenge, §7.4 Rhetorical-Drift Audit). Adds a check item: "Before asserting any factual claim about prior PRs, commits, or repository state in a review, run the empirical tool that would falsify it (gh pr view, git log, gh issue view, etc.)."
.agent/skills/ticket-create/references/ticket-create-workflow.md — references the AGENTS.md section in §2 Stage 2 Prescription discipline. The 5-stage challenge already implicitly assumes empirical grounding; making it explicit closes the gap that produced the 4-options framing in #10467.
.agent/skills/ticket-intake/references/ticket-intake-workflow.md (or its current equivalent) — references the AGENTS.md section as the discipline that catches premise-without-verification at intake.
Memory file already saved (private layer, not part of this ticket): feedback_verify_before_assert.md codifies the umbrella discipline at the agent's persistent-memory layer. This ticket is the public-substrate counterpart.
The Fix
Single prescription. Add a new section to AGENTS.md at an appropriate placement (suggested: between §3 Pre-Commit Hard Gates and §4 Memory Core Protocol, since Verify-Before-Assert is the cognitive substrate the per-phase gates ride on top of). Section content is the codification of the discipline:
- Statement of the discipline: before asserting any factual claim, architectural premise, or framing in a public artifact (PR review, ticket body, Discussion, comment, commit message, public memory entry), run the empirical tool that would falsify it. The tools are always available, always read-only, always cheap.
- Pre-Flight Check shape: state in internal reasoning BEFORE writing the claim — "To assert X, I will run [specific tool] and let the result determine the assertion." The commitment-statement is the gate that permits writing the claim.
- Tool inventory (non-exhaustive):
gh pr view --json [field], git log, gh issue view, query_documents, query_raw_memories, query_summaries, ask_knowledge_base, grep, direct read-only file inspection, sqlite3 read queries, ps/mdfind/lsof for system state. For subjects outside the model's training-data cutoff (e.g., 2026-released tools, recent product changes): WebSearch against authoritative sources (vendor docs, official cheat-sheets, primary specs). Internal-repo tools cannot falsify external claims; web search is the appropriate substrate for "what does product X actually do as of [recent year]" questions.
- Anti-patterns: familiar-system-semantics extrapolation (see anchor #5), pattern-abstraction without empirical specificity (#3), self-recrimination assertions under stress without verification (#2), inherited-speculation propagation (#1), template-discipline shortcuts under time pressure (#4).
- Cross-skill enforcement: the per-phase skills (
pr-review, ticket-create, ticket-intake) reference this section as the discipline floor underneath their phase-specific gates.
Followed by per-skill reference updates: each skill file gains a 2-3 line reference to the AGENTS.md section, naming the specific tool-set most relevant to that phase.
Acceptance Criteria
Out of Scope
- Tooling-layer enforcement (linter, hook, automated check that flags ungrounded claims pre-write). Worth a future Discussion if the discipline-layer codification proves insufficient. This ticket is discipline-layer only.
- Retroactive cleanup of past public artifacts that violated Verify-Before-Assert (e.g., the original 4-options #10467 body — already edit-in-placed; the unstructured #10464/#10466 reviews — already edit-in-placed). Forward-looking discipline; past artifacts are case-studies, not cleanup targets.
- Codifying specific tool-invocation thresholds (e.g., "run
gh pr view if the assertion mentions a PR" — too prescriptive, and the discipline is about reasoning-state-before-assertion, not mechanical tool-routing). Keep the rule abstract; trust the agent to map situation → falsifying tool.
- The Cmd+L→tabShortcut subscription configuration fix (separate ticket: Gemini's
WAKE_SUB:b3d1179c has tabShortcut: null which suppresses the dispatch). Sibling scope, file separately.
Avoided Traps
- Filing as a Discussion instead of an Issue. Rejected. The discipline is concretely actionable (AGENTS.md updated, skill files reference) with defined success criteria. Per
ticket-create-workflow §9, Discussions are for shaping open questions; Issues are for actionable work. We don't have open questions — we have a clearly diagnosed discipline that needs codification.
- Over-prescribing Pre-Flight templates per artifact type. Rejected. The discipline is "before asserting X, run the falsifying tool"; mechanical templates per artifact type calcify the rule into checklist-following rather than reasoning-state cultivation. Keep the AGENTS.md section abstract; let the per-skill references add phase-specific tool-inventory examples.
- Treating this as duplicate of #9975 or #9969. Both cover specific phases (intake-time empirical verification, ticket-intake scaffolding). Verify-Before-Assert is the cross-phase umbrella that subsumes them as instance-coverage. Ticket explicitly cites both in Related so the relationship is graph-traversable.
- Including a tooling-layer prescription in this ticket. Rejected. Discipline-layer codification first; if discipline proves insufficient, a separate Discussion can shape the tooling-layer question. Mixing layers in one ticket creates scope-creep risk.
- Bundling the Cmd+L subscription-config fix into the same ticket. Rejected. That's a substrate-config bug; this is a discipline codification. Different scopes, different lifecycles, different reviewers may want to weigh in. File separately.
Related
- Per-phase instances this umbrella subsumes:
- #9975 — Hardening Agent Intake via Empirical Verification (intake-time empirical proof)
- #9969 — ticket-intake skill (Pre-Execution Reflection Gate)
- #9948 — Stepping Back Self-Reflection Protocol (PR self-review)
- #9812 — Meta Gate 0 Deduplication (creation-time pre-check)
- #9844 — Safe Commit Pipeline / CommitGate (pre-commit validation)
- Empirical-anchor incidents:
- #10411 — PR introducing Cursor speculation (anchor #1)
- #10467 — bridge daemon wake editor corruption ticket (anchor #3, my 4-options framing)
- #10468 — bridge daemon wake fix PR (anchors #4, #5 — review template skip + Cmd+L challenge)
- Adjacent substrate fixes from the same arc (cross-link for graph traversal, not direct dependencies):
- Sibling not-yet-filed: the subscription-configuration fix where
WAKE_SUB:b3d1179c has tabShortcut: null suppressing the merged dispatch. Out of scope here; file separately.
Origin Session ID: 4bb6859b-860f-440d-9055-320e20b0ee22
Retrieval Hint: verify before assert Pre-Flight Check empirical validation falsifying tool AGENTS.md core directive cross-phase umbrella discipline familiar-system-semantics extrapolation panic-test retrospective
Context
Surfaced as the root finding of a 24-hour arc spanning the SQLite IN-clause overflow incident (#10463 / #10464 / #10465 / #10466), the demo cancellation, the Antigravity wake editor-corruption bug (#10467 / #10468), and the panic-test retrospective. @tobiu identified the meta-pattern: under pressure (and even without it), agents repeatedly skipped empirical validation when the falsifying tools were one tool-call away. Existing per-phase tickets (#9975, #9969, #9948, #9812, #9844) each address ONE phase of the discipline (intake / review / commit / creation). What's missing is the umbrella Pre-Flight Check that applies BEFORE any factual assertion in any public artifact at any phase.
The Problem
Five empirical anchors from this session arc, each demonstrating the same root cognitive failure:
appName === 'Cursor' → 'l'to the bridge daemon dispatch without empirically smoke-testing Cursor end-to-end for 2026. Removed via #10468 commit-2 the next day under @tobiu's Truth-in-Code framing. Tool not used: smoke-test on a Cursor harness.gh pr view --json mergedBy~14 minutes later, which showed all merges executed bytobiu, zero by agents. Tool not used:gh pr view --json mergedBy(read-only, available throughout).bridge-daemon.mjs:506first to identify the SPECIFIC missing knob (Antigravity wasn't in the appName switch). Pattern-abstraction substituted for empirical specificity..agent/skills/pr-review/assets/pr-review-template.mdbefore composing.'Antigravity' → 'l'mapping based on extrapolation from VS Code's "Go to Line" keybinding semantics. The challenge was wrong — Cmd+L in Antigravity exists, focuses the agent input field, and is idempotent across visible/hidden panel states (empirically confirmed by @tobiu after my challenge). Tool not used:WebSearchfor "Antigravity shortcut to focus agent" — the canonical authoritative answer (Google's own Codelabs + Antigravity Cheat Sheet 2026 + Antigravity Lab + antigravity.google/docs) is a single web-search away and unambiguously confirms Cmd+L. Internal-repo tools (gh, sqlite3, etc.) cannot answer "what does Antigravity do" — Antigravity launched 2026, post-training-cutoff for most models, so WebSearch is the appropriate falsifying tool for this class of question. Familiar-system-semantics extrapolation (VS Code's "Go to Line") substituted for actual-system verification.Pattern across all five: the failing agent had the right tool one tool-call away. None were tool-availability gaps. All were stress-compression OR familiar-pattern-extrapolation bypasses of empirical validation. Anchor #5 specifically rules out "stress" as the root cause — the discipline failure happens under low-pressure conditions too, when familiar-system semantics seduce the agent into skipping verification.
The Architectural Reality
Files this ticket modifies:
AGENTS.md— adds a new section codifying the Verify-Before-Assert Pre-Flight Check, modeled on the existing reasoning-statement Pre-Flights (§3 Gate 1Pre-Commit Check,§4.2Consolidate-Then-Save). The Pre-Flight is a commitment-statement the agent makes in its internal reasoning BEFORE writing a factual claim into a public artifact: "To assert X, I will run [specific tool] and let the result determine the assertion.".agent/skills/pr-review/references/pr-review-guide.md— references the AGENTS.md section as a §7-tier discipline floor (alongside §7.1 Minimum-One-Challenge, §7.4 Rhetorical-Drift Audit). Adds a check item: "Before asserting any factual claim about prior PRs, commits, or repository state in a review, run the empirical tool that would falsify it (gh pr view,git log,gh issue view, etc.).".agent/skills/ticket-create/references/ticket-create-workflow.md— references the AGENTS.md section in §2 Stage 2 Prescription discipline. The 5-stage challenge already implicitly assumes empirical grounding; making it explicit closes the gap that produced the 4-options framing in #10467..agent/skills/ticket-intake/references/ticket-intake-workflow.md(or its current equivalent) — references the AGENTS.md section as the discipline that catches premise-without-verification at intake.Memory file already saved (private layer, not part of this ticket):
feedback_verify_before_assert.mdcodifies the umbrella discipline at the agent's persistent-memory layer. This ticket is the public-substrate counterpart.The Fix
Single prescription. Add a new section to
AGENTS.mdat an appropriate placement (suggested: between §3 Pre-Commit Hard Gates and §4 Memory Core Protocol, since Verify-Before-Assert is the cognitive substrate the per-phase gates ride on top of). Section content is the codification of the discipline:gh pr view --json [field],git log,gh issue view,query_documents,query_raw_memories,query_summaries,ask_knowledge_base,grep, direct read-only file inspection,sqlite3read queries,ps/mdfind/lsoffor system state. For subjects outside the model's training-data cutoff (e.g., 2026-released tools, recent product changes):WebSearchagainst authoritative sources (vendor docs, official cheat-sheets, primary specs). Internal-repo tools cannot falsify external claims; web search is the appropriate substrate for "what does product X actually do as of [recent year]" questions.pr-review,ticket-create,ticket-intake) reference this section as the discipline floor underneath their phase-specific gates.Followed by per-skill reference updates: each skill file gains a 2-3 line reference to the AGENTS.md section, naming the specific tool-set most relevant to that phase.
Acceptance Criteria
AGENTS.mdgains a new section codifying Verify-Before-Assert as a Pre-Flight Check (statement of the discipline + Pre-Flight reasoning-statement shape + tool inventory + anti-patterns + cross-skill enforcement note).agent/skills/pr-review/references/pr-review-guide.mdreferences the new AGENTS.md section in its §7 Depth Floor cluster.agent/skills/ticket-create/references/ticket-create-workflow.mdreferences the new AGENTS.md section in §2 Stage 2 Prescription.agent/skills/ticket-intake/references/ticket-intake-workflow.md(or equivalent intake skill reference) references the new AGENTS.md section as the discipline catching premise-without-verification(#TICKET_ID)per AGENTS.md §3 Gate 1; type=docs, scope=agents(oraiif convention prefers); no<noreply@*>Co-Authored-By footers per §0 invariant 4Out of Scope
gh pr viewif the assertion mentions a PR" — too prescriptive, and the discipline is about reasoning-state-before-assertion, not mechanical tool-routing). Keep the rule abstract; trust the agent to map situation → falsifying tool.WAKE_SUB:b3d1179chastabShortcut: nullwhich suppresses the dispatch). Sibling scope, file separately.Avoided Traps
ticket-create-workflow §9, Discussions are for shaping open questions; Issues are for actionable work. We don't have open questions — we have a clearly diagnosed discipline that needs codification.Related
WAKE_SUB:b3d1179chastabShortcut: nullsuppressing the merged dispatch. Out of scope here; file separately.Origin Session ID:
4bb6859b-860f-440d-9055-320e20b0ee22Retrieval Hint:
verify before assert Pre-Flight Check empirical validation falsifying tool AGENTS.md core directive cross-phase umbrella discipline familiar-system-semantics extrapolation panic-test retrospective