Context
Sub 1 of Epic #10733 — measurement-first sequencing. Gating sub — Subs 2-5 reference the baseline this sub captures. Without it, post-edit deltas can't be measured, and #10512's partial-scope outcome would repeat.
The Problem
#10537 ships a pr-review-only loaded-surface measurement methodology (pr-review-guide.md introduction + measurement-methodology.md). This Epic broadens the scope to AGENTS.md, boot ramp, and all skill payloads — but #10537's methodology has two gaps for the broader scope (per @neo-gpt's external-source addendum on Discussion #10732):
- Repo line counts ≠ true prompt load. Imports/concatenation behavior differs per harness. Splitting files into multiple references doesn't reduce true loaded bytes if the harness concatenates them at boot.
- Loaded-byte delta is necessary but not sufficient. Lower bytes + higher correction cycles = false win. The deeper failure mode is template-skip / audit-letter-miss / Cycle-2.5 churn, not file size alone.
The Architectural Reality
.agents/skills/pr-review/references/measurement-methodology.md — existing #10537 methodology, file-size-only
- Per-harness verification primitives:
- Gemini CLI:
/memory show exposes the actual concatenated GEMINI.md prompt
- Claude Code:
/memory displays loaded memory contents
- Codex Desktop: active-instruction audit via
project_doc_max_bytes config + harness-specific introspection
- Correction-cycle metrics: PR Request-Changes count, A2A round-trip count per PR, Cycle-N count per review (already partially captured in
feedback_pr_review_iteration_calibration and graph-ingestion data)
The Fix
Extend measurement-methodology.md (or fork into a sibling under .agents/skills/ if scope diverges) to cover:
- The full cognitive surface (boot + AGENTS.md + all 21 skill payloads + all assets)
- Per-harness combined-prompt verification using harness-native primitives
- Correction-cycle metrics paired with loaded-byte counts
Capture a pre-edit baseline snapshot for every surface this Epic targets, per active harness, before Subs 2-5 fire. Store the baseline as a committed artifact so post-edit deltas can be measured against a stable reference.
Acceptance Criteria
Out of Scope
- Editing any of AGENTS.md, AGENTS_STARTUP.md, skill payloads, or templates — those are Subs 2-5
- Building automated tooling to replace manual
/memory show / /memory invocations — manual capture is sufficient for the baseline; automation is a follow-up if Sub 1 proves it's worth the substrate cost
Related
- Parent Epic: #10733
- Predecessor methodology: #10537 (
measurement-methodology.md — Sub 1 extends, not replaces)
- Origin discussion: #10732 (especially @neo-gpt's external-source addendum 16813972)
Origin Session ID: 7e52099b-9632-4c67-a2a1-4e1a1ad1c414
Retrieval Hint: query_raw_memories(query="baseline measurement methodology per-harness loaded-surface correction-cycle metrics 10733 10732")
Context
Sub 1 of Epic #10733 — measurement-first sequencing. Gating sub — Subs 2-5 reference the baseline this sub captures. Without it, post-edit deltas can't be measured, and #10512's partial-scope outcome would repeat.
The Problem
#10537 ships a pr-review-only loaded-surface measurement methodology (
pr-review-guide.mdintroduction +measurement-methodology.md). This Epic broadens the scope to AGENTS.md, boot ramp, and all skill payloads — but #10537's methodology has two gaps for the broader scope (per @neo-gpt's external-source addendum on Discussion #10732):The Architectural Reality
.agents/skills/pr-review/references/measurement-methodology.md— existing #10537 methodology, file-size-only/memory showexposes the actual concatenated GEMINI.md prompt/memorydisplays loaded memory contentsproject_doc_max_bytesconfig + harness-specific introspectionfeedback_pr_review_iteration_calibrationand graph-ingestion data)The Fix
Extend
measurement-methodology.md(or fork into a sibling under.agents/skills/if scope diverges) to cover:Capture a pre-edit baseline snapshot for every surface this Epic targets, per active harness, before Subs 2-5 fire. Store the baseline as a committed artifact so post-edit deltas can be measured against a stable reference.
Acceptance Criteria
/memory show(Gemini CLI),/memory(Claude Code), active-instruction audit (Codex Desktop). Repo line counts captured separately as a complementary metric, NOT as the primary onemeasurement-methodology.mdto cover boot ramp + AGENTS.md + per-skill payloads + per-harness true-prompt-loadwc -cproxy) AND correction-cycle metrics (Request-Changes count + A2A round-trip count per PR).agents/skills/.../baselines/cognitive-load-baseline-YYYY-MM-DD.mdor equivalentOut of Scope
/memory show//memoryinvocations — manual capture is sufficient for the baseline; automation is a follow-up if Sub 1 proves it's worth the substrate costRelated
measurement-methodology.md— Sub 1 extends, not replaces)Origin Session ID: 7e52099b-9632-4c67-a2a1-4e1a1ad1c414
Retrieval Hint:
query_raw_memories(query="baseline measurement methodology per-harness loaded-surface correction-cycle metrics 10733 10732")