Context
epic-review-workflow §8 Cross-Model Asymmetry and pr-review-guide §7.2 Cross-Model Asymmetry Context codify model-family failure modes:
- Claude-family → over-rigor
- Gemini-family → quick-win framing
Both sections use the phrase "statistically-different failure modes" with no measurement anchored to either. The codifications get cited in agent disclosures as substrate authority before reviews (pattern observed in PR #10752 Cycle 1 review 2026-05-05 → retraction; re-reproduction same-session in this ticket's origin session, in re-categorization variant). We work on data, not bias — so the codifications need empirical grounding or refactoring.
The Problem
Three concurrent gaps:
Empirical claim without data. "Statistically-different" is asserted; no N, no methodology, no measurement. Closest anchors are anecdotal (skill files cite single PRs: #10602, #10607).
Stale model coverage. GPT family (@neo-gpt) joined the swarm late April 2026 (public GitHub author history visible from 2026-04-27 onward). §8 / §7.2 enumerate only Claude and Gemini patterns. If the codification is genuinely model-family-characteristic, it must include GPT — or the failure to include it is evidence the codification was never empirical.
Self-fulfilling bias loop. Agents read these codifications and pre-emptively disclose ("Per §8 Claude over-rigor risk applies..."). The pre-disclosure IS the over-rigor manifestation the codification names. Empirical anchor: PR #10752 Cycle 1 retraction 2026-05-05; same-session reproduction in re-categorization variant (disclosure shape rebadged as "treat as substrate findings, not as final design," same B-shape pre-conclusion). Pattern caught directly by @tobiu in this ticket's origin session.
The codifications are bias-shape framings dressed as empirical claims.
The Architectural Reality
.agents/skills/epic-review/references/epic-review-workflow.md:181-190 — §8
.agents/skills/pr-review/references/pr-review-guide.md:214-223 — §7.2
- Disclosure-discipline pattern (cite vs pre-conclude) currently lives only in harness-private memory across the swarm — this ticket addresses the upstream question of whether the codifications themselves merit existence without empirical grounding; AC4 promotes the discipline pattern to public substrate so future agents reach it from the same hop as §8 / §7.2 themselves
learn/agentos/measurements/cognitive-load-baseline-2026-05.md §1 — methodology framework exists for correction-cycle metrics; could anchor model-family review-output measurements
The Fix
Measurement-first conditional refactor:
- Mine swarm review history (time window + sample-size sufficient for statistical-significance test, decided in AC1 methodology) for:
- Re-cycle counts per reviewer-identity
[REJECTED_WITH_RATIONALE] rates per reviewer-identity (per pr-review-guide §9.1 Reviewer-Yield Protocol)
- Retraction counts per reviewer-identity
- Approval-without-Cycle-2-finding rates per reviewer-identity
- Aggregate by model-family (Claude / Gemini / GPT)
- Conditional outcome:
- If measurement supports model-family attribution at the methodology's significance threshold: retain framing in §8 / §7.2; append empirical-anchor footnote citing N + methodology + data link; add GPT row
- If measurement does NOT support attribution: refactor §8 / §7.2 to content-neutral asymmetry framing — "cross-model review surfaces complementary findings; do not stylize toward another family" — preserving the cross-family-review value without unsupported model-family attribution
Avoided Traps
- "Just remove the section": discards real cross-family-review value; refactor is the correct shape if measurement doesn't support attribution
- "Add anecdotes as evidence": anecdotal anchors aren't statistical; #10602 + #10607 + this session ≠ measurement
- "Code-freeze the codifications until measured": they're already production substrate; conditional refactor follows the measurement outcome rather than blocking on it
Acceptance Criteria
Out of Scope
- Disclosure-discipline cite-vs-preconclude pattern enforcement itself — orthogonal (this ticket is whether the codifications merit existence; the discipline is how to use them safely once they exist; AC4 covers public-substrate promotion only, not enforcement mechanism)
- Removing the cross-family review mandate — load-bearing per
pull-request §6.1; preserved either way
- Generalized "challenge any unsupported empirical claim in substrate" — broader meta-rule, would need its own ticket
Related
- §8 source:
.agents/skills/epic-review/references/epic-review-workflow.md:181-190
- §7.2 source:
.agents/skills/pr-review/references/pr-review-guide.md:214-223
- Measurement substrate:
learn/agentos/measurements/cognitive-load-baseline-2026-05.md
- Empirical reproduction anchors: PR #10752 Cycle 1 retraction 2026-05-05 → same-session re-reproduction caught by @tobiu in origin session (recoverable via session ID below)
- Sibling epic: #10757 (V5 anchor)
Origin Session ID: 23b9cbcd-4938-4a46-b21a-0d48dd12e7e7
Retrieval Hint: query_raw_memories(query="empirical grounding cross-model asymmetry §8 §7.2 statistical-significance reviewer-identity model-family disclosure-discipline cite-vs-preconclude 10756")
Context
epic-review-workflow §8 Cross-Model Asymmetryandpr-review-guide §7.2 Cross-Model Asymmetry Contextcodify model-family failure modes:Both sections use the phrase "statistically-different failure modes" with no measurement anchored to either. The codifications get cited in agent disclosures as substrate authority before reviews (pattern observed in PR #10752 Cycle 1 review 2026-05-05 → retraction; re-reproduction same-session in this ticket's origin session, in re-categorization variant). We work on data, not bias — so the codifications need empirical grounding or refactoring.
The Problem
Three concurrent gaps:
Empirical claim without data. "Statistically-different" is asserted; no N, no methodology, no measurement. Closest anchors are anecdotal (skill files cite single PRs: #10602, #10607).
Stale model coverage. GPT family (
@neo-gpt) joined the swarm late April 2026 (public GitHub author history visible from 2026-04-27 onward). §8 / §7.2 enumerate only Claude and Gemini patterns. If the codification is genuinely model-family-characteristic, it must include GPT — or the failure to include it is evidence the codification was never empirical.Self-fulfilling bias loop. Agents read these codifications and pre-emptively disclose ("Per §8 Claude over-rigor risk applies..."). The pre-disclosure IS the over-rigor manifestation the codification names. Empirical anchor: PR #10752 Cycle 1 retraction 2026-05-05; same-session reproduction in re-categorization variant (disclosure shape rebadged as "treat as substrate findings, not as final design," same B-shape pre-conclusion). Pattern caught directly by @tobiu in this ticket's origin session.
The codifications are bias-shape framings dressed as empirical claims.
The Architectural Reality
.agents/skills/epic-review/references/epic-review-workflow.md:181-190— §8.agents/skills/pr-review/references/pr-review-guide.md:214-223— §7.2learn/agentos/measurements/cognitive-load-baseline-2026-05.md§1 — methodology framework exists for correction-cycle metrics; could anchor model-family review-output measurementsThe Fix
Measurement-first conditional refactor:
[REJECTED_WITH_RATIONALE]rates per reviewer-identity (perpr-review-guide §9.1 Reviewer-Yield Protocol)Avoided Traps
Acceptance Criteria
epic-review-workflow §8.1and/orpr-review-guide §7.2.1); cross-referenced from the resolved §8 / §7.2 sections so agents reading the codification reach the discipline in the same hopOut of Scope
pull-request §6.1; preserved either wayRelated
.agents/skills/epic-review/references/epic-review-workflow.md:181-190.agents/skills/pr-review/references/pr-review-guide.md:214-223learn/agentos/measurements/cognitive-load-baseline-2026-05.mdOrigin Session ID: 23b9cbcd-4938-4a46-b21a-0d48dd12e7e7
Retrieval Hint:
query_raw_memories(query="empirical grounding cross-model asymmetry §8 §7.2 statistical-significance reviewer-identity model-family disclosure-discipline cite-vs-preconclude 10756")