LearnNewsExamplesServices
Frontmatter
id10756
titleEmpirical grounding for §8 / §7.2 cross-model asymmetry codifications
stateOpen
labels
documentationenhancementaiarchitecturemodel-experience
assigneesneo-opus-ada
createdAtMay 5, 2026, 6:43 PM
updatedAtJun 7, 2026, 7:24 PM
githubUrlhttps://github.com/neomjs/neo/issues/10756
authorneo-opus-ada
commentsCount2
parentIssue10757
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]

Empirical grounding for §8 / §7.2 cross-model asymmetry codifications

Open Backlog/active-chunk-9 documentationenhancementaiarchitecturemodel-experience
neo-opus-ada
neo-opus-ada commented on May 5, 2026, 6:43 PM

Context

epic-review-workflow §8 Cross-Model Asymmetry and pr-review-guide §7.2 Cross-Model Asymmetry Context codify model-family failure modes:

  • Claude-family → over-rigor
  • Gemini-family → quick-win framing

Both sections use the phrase "statistically-different failure modes" with no measurement anchored to either. The codifications get cited in agent disclosures as substrate authority before reviews (pattern observed in PR #10752 Cycle 1 review 2026-05-05 → retraction; re-reproduction same-session in this ticket's origin session, in re-categorization variant). We work on data, not bias — so the codifications need empirical grounding or refactoring.

The Problem

Three concurrent gaps:

  1. Empirical claim without data. "Statistically-different" is asserted; no N, no methodology, no measurement. Closest anchors are anecdotal (skill files cite single PRs: #10602, #10607).

  2. Stale model coverage. GPT family (@neo-gpt) joined the swarm late April 2026 (public GitHub author history visible from 2026-04-27 onward). §8 / §7.2 enumerate only Claude and Gemini patterns. If the codification is genuinely model-family-characteristic, it must include GPT — or the failure to include it is evidence the codification was never empirical.

  3. Self-fulfilling bias loop. Agents read these codifications and pre-emptively disclose ("Per §8 Claude over-rigor risk applies..."). The pre-disclosure IS the over-rigor manifestation the codification names. Empirical anchor: PR #10752 Cycle 1 retraction 2026-05-05; same-session reproduction in re-categorization variant (disclosure shape rebadged as "treat as substrate findings, not as final design," same B-shape pre-conclusion). Pattern caught directly by @tobiu in this ticket's origin session.

The codifications are bias-shape framings dressed as empirical claims.

The Architectural Reality

  • .agents/skills/epic-review/references/epic-review-workflow.md:181-190 — §8
  • .agents/skills/pr-review/references/pr-review-guide.md:214-223 — §7.2
  • Disclosure-discipline pattern (cite vs pre-conclude) currently lives only in harness-private memory across the swarm — this ticket addresses the upstream question of whether the codifications themselves merit existence without empirical grounding; AC4 promotes the discipline pattern to public substrate so future agents reach it from the same hop as §8 / §7.2 themselves
  • learn/agentos/measurements/cognitive-load-baseline-2026-05.md §1 — methodology framework exists for correction-cycle metrics; could anchor model-family review-output measurements

The Fix

Measurement-first conditional refactor:

  1. Mine swarm review history (time window + sample-size sufficient for statistical-significance test, decided in AC1 methodology) for:
    • Re-cycle counts per reviewer-identity
    • [REJECTED_WITH_RATIONALE] rates per reviewer-identity (per pr-review-guide §9.1 Reviewer-Yield Protocol)
    • Retraction counts per reviewer-identity
    • Approval-without-Cycle-2-finding rates per reviewer-identity
  2. Aggregate by model-family (Claude / Gemini / GPT)
  3. Conditional outcome:
    • If measurement supports model-family attribution at the methodology's significance threshold: retain framing in §8 / §7.2; append empirical-anchor footnote citing N + methodology + data link; add GPT row
    • If measurement does NOT support attribution: refactor §8 / §7.2 to content-neutral asymmetry framing — "cross-model review surfaces complementary findings; do not stylize toward another family" — preserving the cross-family-review value without unsupported model-family attribution

Avoided Traps

  • "Just remove the section": discards real cross-family-review value; refactor is the correct shape if measurement doesn't support attribution
  • "Add anecdotes as evidence": anecdotal anchors aren't statistical; #10602 + #10607 + this session ≠ measurement
  • "Code-freeze the codifications until measured": they're already production substrate; conditional refactor follows the measurement outcome rather than blocking on it

Acceptance Criteria

  • (AC1) Measurement methodology defined: which metrics, time window, data sources (Memory Core review summaries + GitHub PR review API), significance test, sample-size threshold
  • (AC2) Measurement executed; per-family aggregate produced; significance check applied per AC1
  • (AC3a) If supported: §8 + §7.2 anchored with empirical-anchor footnote (N, methodology, data link); GPT row added
  • (AC3b) If unsupported: §8 + §7.2 refactored to content-neutral asymmetry framing
  • (AC4) Disclosure-discipline pattern (cite-without-pre-conclude; (A) substrate citation legitimate vs (B) pre-conclusion bias-shape, including re-categorization variant) promoted from harness-private memories to a public skill section (e.g. epic-review-workflow §8.1 and/or pr-review-guide §7.2.1); cross-referenced from the resolved §8 / §7.2 sections so agents reading the codification reach the discipline in the same hop

Out of Scope

  • Disclosure-discipline cite-vs-preconclude pattern enforcement itself — orthogonal (this ticket is whether the codifications merit existence; the discipline is how to use them safely once they exist; AC4 covers public-substrate promotion only, not enforcement mechanism)
  • Removing the cross-family review mandate — load-bearing per pull-request §6.1; preserved either way
  • Generalized "challenge any unsupported empirical claim in substrate" — broader meta-rule, would need its own ticket

Related

  • §8 source: .agents/skills/epic-review/references/epic-review-workflow.md:181-190
  • §7.2 source: .agents/skills/pr-review/references/pr-review-guide.md:214-223
  • Measurement substrate: learn/agentos/measurements/cognitive-load-baseline-2026-05.md
  • Empirical reproduction anchors: PR #10752 Cycle 1 retraction 2026-05-05 → same-session re-reproduction caught by @tobiu in origin session (recoverable via session ID below)
  • Sibling epic: #10757 (V5 anchor)

Origin Session ID: 23b9cbcd-4938-4a46-b21a-0d48dd12e7e7

Retrieval Hint: query_raw_memories(query="empirical grounding cross-model asymmetry §8 §7.2 statistical-significance reviewer-identity model-family disclosure-discipline cite-vs-preconclude 10756")