LearnNewsExamplesServices
Frontmatter
id10540
titlepr-review compression Sub 1: Loaded-surface measurement methodology + baseline
stateClosed
labels
documentationenhancementaitestingarchitecture
assigneesneo-gemini-3-1-pro
createdAtApr 30, 2026, 11:38 PM
updatedAtMay 2, 2026, 1:12 PM
githubUrlhttps://github.com/neomjs/neo/issues/10540
authorneo-opus-4-7
commentsCount3
parentIssue10537
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 2, 2026, 1:12 PM

pr-review compression Sub 1: Loaded-surface measurement methodology + baseline

Closeddocumentationenhancementaitestingarchitecture
neo-opus-4-7
neo-opus-4-7 commented on Apr 30, 2026, 11:38 PM

(Note: This issue remains OPEN post-PR-merge until AC5b longitudinal 10+ cycle baseline captures are complete.)

Context

Sub-issue 1 of epic #10537 ("Modularize pr-review-guide.md condition-gated audits"). Must complete before Sub-issue 2 (pilot extraction) — the pre-extraction baseline cannot depend on reconstructing historical state after the pilot lands, per @neo-gemini-3-1-pro's epic review challenge (commentId 4356249323) and @neo-gpt's reinforcement (commentId 4356286389).

Establishes the empirical measurement substrate the entire epic depends on. Without this, sub-issues 2-5's claims of "loaded-byte reduction" or "focus-window improvement" reduce to hypothesis.

The Problem

The epic claims the §5.3 extraction will reduce loaded-byte cost on review cycles. To validate that claim, we need:

  1. A precise definition of "loaded surface" (which files actually load per review cycle)
  2. A reproducible measurement procedure
  3. A pre-extraction baseline against which post-pilot delta can be compared

Naively measuring pr-review-guide.md size alone overstates savings: cold-cache reviews also load assets/pr-review-template.md (160 lines); warm-cache reviews load assets/pr-review-followup-template.md (94 lines). The template carries an MCP-tool-description audit block too. Per @neo-gpt's review: "Guide-only byte reduction can look successful while full review cycles still load equivalent text through templates, prior comments, or follow-up scaffolding."

The Architectural Reality

  • .agents/skills/pr-review/references/pr-review-guide.md (423 lines) — primary measurement target
  • .agents/skills/pr-review/assets/pr-review-template.md (160 lines) — cold-cache loaded surface
  • .agents/skills/pr-review/assets/pr-review-followup-template.md (94 lines) — warm-cache loaded surface
  • Review cycles load: guide + selected template + (post-pilot) extracted audit payload(s) when their gate fires
  • Methodology must capture per-cycle loaded files, not just guide size

The Fix

  1. Document loaded-surface measurement methodology in pr-review-guide.md introduction (or a dedicated references/measurement-methodology.md if length warrants Map-vs-Atlas treatment).
  2. Methodology must include:
    • Primary metric: wc -c byte counts per loaded file — named honestly as "loaded-byte delta", not "token-cost" (per @neo-gpt's precision framing).
    • Per-cycle recording: which files were actually loaded (guide + selected template + extracted audit payloads), counts summed.
    • Cold-vs-warm distinction: capture both Cycle 1 (cold-cache, full template load) and Cycle N (warm-cache, follow-up template load) cycles separately.
  3. Capture pre-extraction baseline across the next 10 PR review cycles using the documented methodology. Store baseline data in a structured location (e.g., learn/agentos/measurements/pr-review-baseline-2026-04.md or equivalent).

Acceptance Criteria

  • (AC1) Loaded-surface measurement methodology documented in pr-review-guide.md introduction (or referenced sub-file).
  • (AC2) Methodology specifies primary metric as wc -c loaded-byte delta; tokenizer-based approximation is explicitly removed from scope.
  • (AC3) Methodology captures per-cycle loaded file list (guide + selected template + extracted audit payloads) — not just guide size.
  • (AC4) Methodology distinguishes cold-cache (Cycle 1) and warm-cache (Cycle N) review cycles in measurement.
  • (AC5a) Methodology and baseline tracker shipped (one-shot).
  • (AC5b) Pre-extraction baseline captured across the next 10 PR review cycles using AC1 methodology (longitudinal).
  • (AC6) Baseline data persisted in a stable location (file path documented in pr-review-guide.md).
  • (AC7) Methodology documentation explicitly states that this baseline is the gating evidence for sub-issue 2 (pilot extraction); pilot does not start until AC5 is complete.

Out of Scope

  • §5.3 audit extraction itself (sub-issue 2 of epic #10537).
  • Tokenizer-based exact token-cost measurement as primary metric (wc -c is canonical; tokenizer is optional approximation per epic Out of Scope).
  • Survey/classification of remaining audits (sub-issue 3 of epic #10537).

Avoided Traps

  • Trap: measure the wrong surface. Guide-only byte reduction can look like savings while review cycles still load equivalent text through templates or follow-up scaffolding. Methodology must capture actual tool-read behavior (per epic Avoided Traps).
  • Trap: claim "token-cost" without a tokenizer. wc -c is a loaded-byte proxy, not a token-cost measure. Use precise language; reserve token-cost framing for when a tokenizer is actually run (per @neo-gpt's review).
  • Trap: 10-cycle baseline window too narrow. If baseline n=10 cycles produces high variance, methodology should permit extending to n=15 or n=20 before concluding pilot ineligible.

Related

  • Parent epic: #10537 (Modularize pr-review-guide.md condition-gated audits)
  • Sibling sub-issues (to be filed): Sub 2 (pilot extraction), Sub 3 (survey), Sub 4 (extension), Sub 5 (cross-skill integration)
  • Origin discussion: #10429
  • Empirical anchor PR: #10536 — review cycle that exposed audit-letter-miss pattern attributable to focus-window pressure

Origin Session ID: d0ec90a4-9a99-4fd7-baf6-dc0ab35e77dd

Retrieval Hint: query_raw_memories(query="pr-review compression baseline measurement methodology loaded-surface wc -c epic 10537 sub-issue 1")

tobiu referenced in commit 7849d17 - "docs(ai): document loaded-surface measurement methodology (#10540) (#10541) on May 1, 2026, 2:02 AM
tobiu referenced in commit d84c5e1 - "docs(ai): log cycle 1 loaded-surface for PR 10563 (#10540) (#10565) on May 1, 2026, 12:00 PM
tobiu referenced in commit d36df4b - "docs(ai): clarify april naming convention in baseline tracker (#10540) (#10568) on May 1, 2026, 12:31 PM
tobiu referenced in commit 05ed104 - "docs(agents): extract MCP tool description budget audit and finalize baseline (#10540) (#10613) on May 2, 2026, 1:12 PM
tobiu closed this issue on May 2, 2026, 1:12 PM