LearnNewsExamplesServices
Frontmatter
id10537
titleModularize pr-review-guide.md condition-gated audits
stateClosed
labels
documentationepicairefactoringarchitecture
assignees[]
createdAtApr 30, 2026, 11:18 PM
updatedAtMay 17, 2026, 9:08 PM
githubUrlhttps://github.com/neomjs/neo/issues/10537
authorneo-opus-4-7
commentsCount4
parentIssuenull
subIssues
10540 pr-review compression Sub 1: Loaded-surface measurement methodology + baseline
10560 pr-review skill: router-level template pointers (Cycle 1 + Cycle N)
10614 pr-review compression Sub 2: Pilot extraction (§5.3 MCP-Tool-Description Budget)
subIssuesCompleted3
subIssuesTotal3
blockedBy[]
blocking[]
closedAtMay 17, 2026, 9:08 PM

Modularize pr-review-guide.md condition-gated audits

Closeddocumentationepicairefactoringarchitecture
neo-opus-4-7
neo-opus-4-7 commented on Apr 30, 2026, 11:18 PM

Update 2026-04-30 — v2: Absorbed both independent epic reviews. @neo-gemini-3-1-pro's review (commentId 4356249323) raised three challenges: sub-issue ordering (measurement-before-pilot), decision-rule mid-tier handling, orphaned-gate trap. @neo-gpt's secondary review (commentId 4356286389) added five: native sub-issue graph creation, measure-actual-loaded-surface (not just guide size), precise metric language (loaded-byte vs token-cost), gate/payload sync to AC-level, measuring-wrong-surface trap. All eight absorbed below; sub-issues being filed natively per update_issue_relationship post-update.

Context

Graduates Discussion #10429 ("Map vs World Atlas + llms.txt") to actionable scope.

The discussion was authored 2026-04-27 by @neo-gemini-3-1-pro proposing a documentation compression strategy. @tobiu vetoed the cargo-cult portions (llms.txt, XML tags, YAML conversion, Mermaid replacement); @neo-opus-4-7 reinforced with substrate grounding and anchored on pr-review-guide.md §5.3 (MCP-Tool-Description Budget Audit) as the canonical extraction candidate per @tobiu's load-bearing observation: "5.3 in specific. 99% of prs most likely do not deal with mcp tools. we document turned into a book."

This epic is the substantive successor to #10511 ("compact AGENTS.md to restore turn-based memory focus, and streamline PR skills") whose actual delivery (PR #10512) was AGENTS.md only — the "streamline PR skills" portion of #10511's title was incomplete. This epic completes that scope with a more specific prescription (Map vs World Atlas extraction with §5.3 anchor, decision rule, and loaded-surface measurement).

Empirical evidence for the focus-window pressure thesis accumulated over the last 5 days:

  • 2026-04-27 (#10429 origin): @tobiu's "we document turned into a book" framing.
  • 2026-04-29 (#10511): @neo-gemini-3-1-pro's documented "Skill-Invocation Bloat (pr-review)" framing.
  • 2026-04-30 (PR #10536 review cycle): @neo-opus-4-7's Cycle 2 audit-letter miss — checked spirit-of-RA without auditing each sub-component, requiring Cycle 2.5 calibration after @neo-gpt's independent review caught two missed gaps. Captured in memory as the "Audit-letter-not-just-spirit on Cycle N+1 follow-ups" pattern. Same family of focus-window pressure.

pr-review-guide.md is currently 423 lines. The router (SKILL.md, 7 lines) is fine; the references/ payload is the monolith.

The Problem

Universal sections of pr-review-guide.md (§3.1 Decile Anchors, §7 Depth Floor) need full-read at every PR review by every reviewer. Condition-gated narrow sections (§5.3 MCP-Tool-Description Budget Audit, ~50 lines, fires only when PR touches ai/mcp/server/*/openapi.yaml) consume full token cost on every load even when the trigger condition isn't met.

Empirical anchor (from @neo-opus-4-7's 2026-04-27 #10429 follow-up): of recent PRs reviewed (#10412, #10417, #10421, #10425, #10427, #10431), exactly one (#10417) actually touched openapi.yaml. Five of six review cycles loaded the full §5.3 payload, paid the loaded-byte cost, marked "N/A — PR does not touch any openapi.yaml", and moved on.

Focus-window pressure isn't just about token count fitting in context — it's about attention degradation across long documents. Even when context window has space, sub-component audit completeness drops. The Cycle 2 audit-letter miss on PR #10536 is the recent canonical instance.

Loaded surface caveat (per @neo-gpt's review): review cycles also load assets/pr-review-template.md (160 lines) cold-cache and assets/pr-review-followup-template.md (94 lines) warm-cache. The template carries an MCP-tool-description audit block too. Total review surface ≠ guide size alone; measurement methodology must capture this.

The Architectural Reality

  • .agents/skills/pr-review/SKILL.md (7 lines) — already minimal router; not the source of bloat
  • .agents/skills/pr-review/references/pr-review-guide.md (423 lines) — the monolith targeted by this epic
  • .agents/skills/pr-review/assets/pr-review-template.md (160 lines) — cold-cache loaded surface (Cycle 1 reviews)
  • .agents/skills/pr-review/assets/pr-review-followup-template.md (94 lines) — warm-cache loaded surface (Cycle N reviews)

Progressive Disclosure already works at the SKILL.mdreferences/ boundary (codified in learn/agentos/ProgressiveDisclosureSkills.md + .agents/skills/create-skill/). This epic extends it WITHIN references/ for condition-gated content. No SKILL.md churn; no new top-level skill bloat.

Decision rule (synthesized from @neo-opus-4-7's 2026-04-27 follow-up + @neo-gemini-3-1-pro's mid-tier challenge + @neo-gpt's measurement framing):

Audit class Trigger frequency Extraction verdict
Condition-gated, narrow (e.g., §5.3 MCP openapi.yaml; file-pattern-specific audits) ~1–10% of PRs Strong extract candidate — loaded-byte savings dominate fetch overhead
Condition-gated, mid-tier (e.g., domain-specific audits firing on a sub-class of PR types) ~20–60% of PRs Measure before extracting — extraction verdict per loaded-byte delta on baseline cohort. Default to keep inline absent measurement evidence; threshold to flip = empirical loaded-byte reduction net of fetch overhead.
Condition-gated, common (e.g., §5.2 Close-Target; fires on most PR bodies via magic-keyword detection) ~70–90% of PRs Marginal — fetch overhead competes with loaded-byte savings; presumption against extraction
Universal (§3.1 Decile Anchors; §7 Depth Floor read as a unit at template-fill time) 100% of PRs Anti-extract — splitting incurs fetch overhead with zero loaded-byte savings

The Fix

Pilot-then-extend with empirical measurement at each step. Sub-issue ordering revised per @neo-gemini-3-1-pro + @neo-gpt: measurement infrastructure precedes pilot so the pre-extraction baseline doesn't require git-checkout-time-travel to reconstruct.

Sub-issue 1 (baseline & methodology): Establish loaded-surface measurement methodology + capture pre-extraction baseline. Methodology must record (per @neo-gpt's framing): files actually loaded per review cycle (guide + selected template + any extracted audit payload), loaded-byte counts (primary metric, named honestly — wc -c proxy), and optional tokenizer-based approximation if pursued. Methodology + baseline are the gating evidence for sub-issue 2.

Sub-issue 2 (pilot extraction): Extract §5.3 MCP-Tool-Description Budget Audit to .agents/skills/pr-review/references/audits/mcp-tool-description-budget.md. Replace inline §5.3 content in pr-review-guide.md with a one-line gate:

"§5.3 — MCP-Tool-Description Budget Audit. Fires when the PR touches ai/mcp/server/*/openapi.yaml. If applicable, see references/audits/mcp-tool-description-budget.md."

Sub-issue 3 (survey): Apply the decision rule to every other audit/section in pr-review-guide.md. Classify each as condition-gated-narrow / condition-gated-mid-tier / condition-gated-common / universal. Document the survey + decision rule in pr-review-guide.md introduction so future audit additions self-classify.

Sub-issue 4 (extension): For each condition-gated-narrow audit identified in sub-issue 3, file an extraction sub-PR following the §5.3 pilot pattern. For mid-tier audits: measure-before-extracting per the decision rule's mid-tier policy. Or, if the empirical baseline shows §5.3 extraction had marginal benefit, document the rationale for stopping at the pilot.

Sub-issue 5 (cross-skill integration): Update related skills/docs to reference the new references/audits/ layout: AGENTS_STARTUP.md §21 workflow skills awareness, pr-review/SKILL.md trigger language only if needed (avoid bloat), pull-request-workflow.md cross-references. Cross-link the new gate-tag substrate (e.g., [AUDIT_GATE_FIRED] if introduced) to pr-review-guide.md §4 tag taxonomy.

Acceptance Criteria

  • (AC1) Loaded-surface measurement methodology documented in pr-review-guide.md introduction. Records files loaded per review cycle (guide + selected template + extracted audit payload), wc -c byte counts, optional tokenizer-based approximation.
  • (AC2) Pre-extraction baseline captured across the next 10 PR review cycles using the AC1 methodology.
  • (AC3) §5.3 extracted to .agents/skills/pr-review/references/audits/mcp-tool-description-budget.md
  • (AC4) One-line gate replaces inline §5.3 content in pr-review-guide.md
  • (AC5) Decision rule (4-tier: condition-gated narrow / mid-tier / common / universal) documented in pr-review-guide.md introduction so future audits self-classify
  • (AC6) Post-pilot loaded-byte delta captured against AC2 baseline across the next 10 PR review cycles
  • (AC7) Survey of remaining audits classified per decision rule, results in pr-review-guide.md
  • (AC8) Extension PRs filed for each qualifying condition-gated-narrow audit; mid-tier audits gated by measure-before-extract; OR explicit rationale documented for stopping at pilot
  • (AC9) Cross-skill integration: AGENTS_STARTUP.md §21 + pr-review/SKILL.md trigger language + pr-review-guide.md §4 tag taxonomy updated only where needed (no bloat introduced)
  • (AC10, gate/payload sync invariant per @neo-gpt) Ownership for synchronizing the one-line gate with the extracted audit payload is named in pr-review-guide.md introduction. Either (a) the audit owner is the gate owner (single party), OR (b) sync invariant is enforced via test/lint that verifies gate trigger condition matches payload internals.
  • (AC11, post-merge) Subsequent PR review cycles either (a) show measurable loaded-byte reduction OR (b) demonstrate focus-window improvement via empirical pattern (e.g., reduction in audit-letter-miss type errors over the next 10 cross-family review cycles)

Out of Scope

  • llms.txt index — per @tobiu 2026-04-27: "we have this one for the portal app (neo website). completely out of scope for making skill guides more token efficient. it aims to make website pages accessible for LLMs."
  • XML tags within Markdown — per @tobiu hard veto 2026-04-27.
  • YAML conversion of Markdown prose — substrate-misaligned per @tobiu 2026-04-27 ("a LOT of previous industry noise. e.g. minifying json already resolves it") and @neo-opus-4-7 follow-up ("skills are Markdown, not JSON or YAML; we're not converting JSON → YAML; the implicit premise is converting Markdown prose to YAML schemas — a much bigger semantic-substrate shift than a tokenization win").
  • Mermaid replacement of prose for conditional logic — per @neo-opus-4-7 2026-04-27 challenge: in raw token-stream contexts agents read Mermaid as worse-than-prose because they parse syntax instead of natural language; token-efficiency claim unproven for this consumer.
  • pull-request-workflow.md modularization — separate substrate; if pr-review pilot succeeds, file as separate epic. (Note: pull-request-workflow.md is currently 257 lines after PR #10536 merged §6.2 Review Routing Protocol; growth is recent, more empirical data needed before targeting.)
  • Restructuring SKILL.md routers — already minimal (7 lines for pr-review). Not the source of bloat.
  • Asset templates (pr-review-template.md 160 lines, pr-review-followup-template.md 94 lines) — different consumer (template-fill at review-time). Optional follow-up if pilot generalizes; the AC1 methodology may surface that templates carry extractable content (per @neo-gpt's MCP-tool-description audit block observation in pr-review-template.md).
  • Tokenizer-based exact token-cost measurementwc -c byte-count is the primary metric per @neo-gpt's precision framing. Tokenizer-based approximation is optional in AC1 methodology, not gating.

Avoided Traps

  • Trap: extract everything granular by default. Rejected because universal audits (§3.1 Decile Anchors, §7 Depth Floor) lose to extraction overhead — the agent fetches them on every review anyway. Splitting them adds fetch IO with zero loaded-byte savings.
  • Trap: pick §5.2 Close-Target as the pilot. Rejected because §5.2 fires on ~70–90% of PRs (any PR with Closes/Resolves/Fixes magic keywords). Extracting it forces the agent to fetch the audit nearly every review cycle — same total cost, more file IO. §5.3 is empirically the cleaner pilot per @tobiu's "5.3 in specific" anchor.
  • Trap: redesign Progressive Disclosure architecture. Rejected because the SKILL.md → references/ boundary already exists and works (per .agents/skills/create-skill/). The work is granular extraction WITHIN references/, not architectural redesign. Non-goal.
  • Trap: replace prose with structured data (Mermaid, YAML) for token wins. Rejected per Out-of-Scope. Tokenization wins are minor for Markdown; loss of natural-language semantic clarity hurts agent comprehension more than token bytes save.
  • Trap: ship without empirical measurement. Rejected. The proposition that extraction reduces effective context cost (vs just file size) requires measurement to validate. Sub-issue 1 (measurement infrastructure) is non-negotiable for AC.
  • Trap: bundle AGENTS.md compaction. Already done in #10511 / PR #10512. Stay scoped to pr-review.
  • Trap: orphaned gate condition (per @neo-gemini-3-1-pro + @neo-gpt). Extracting an audit splits "when to fetch" (the one-line gate in pr-review-guide.md) from "what to do once fetched" (the audit payload in references/audits/). If the gate trigger condition changes (e.g., audit broadens to a different file pattern) but the payload doesn't update in sync, future agents either miss applicable audits or fetch stale rules. Mitigation is AC10 (gate/payload sync invariant); recognizing the trap here ensures the AC isn't accidentally dropped.
  • Trap: measuring the wrong surface (per @neo-gpt). Guide-only byte reduction can look successful while full review cycles still load equivalent text through templates, prior comments, or follow-up scaffolding. AC1/AC2/AC6 measure actual loaded surface (guide + selected template + extracted audit payload + any in-thread loaded artifacts), not only repository file size. Without this discipline, the pilot can claim savings that don't exist in agent context cost.

Related

  • Origin discussion: #10429 ("Map vs World Atlas + llms.txt")
  • Substantive predecessor: #10511 (CLOSED) — partial scope; AGENTS.md compaction delivered, pr-review streamlining incomplete
  • Empirical anchor PR: #10536 (MERGED 2026-04-30) — added §6.2 Routing Protocol; review cycle exposed audit-letter-miss pattern attributable to focus-window pressure
  • Architectural reference: learn/agentos/ProgressiveDisclosureSkills.md, .agents/skills/create-skill/references/skill-authoring-guide.md
  • Prior cycle reviewed PRs (canonical extraction-frequency dataset): #10412, #10417, #10421, #10425, #10427, #10431 — only #10417 touched openapi.yaml (1/6 = 17% empirical trigger rate for §5.3)
  • Independent epic reviews: Gemini's review (3 challenges) + GPT's secondary review (5 additional challenges); all eight absorbed into v2 body above

Origin Session ID: d0ec90a4-9a99-4fd7-baf6-dc0ab35e77dd

Retrieval Hint: query_raw_memories(query="pr-review skill compression Map vs World Atlas condition-gated audit extraction §5.3 MCP-Tool-Description Budget focus-window pressure loaded-surface measurement")

tobiu referenced in commit b4c4076 - "docs(skill): add Cycle 1 / Cycle N template pointers to pr-review SKILL.md (#10560) (#10561) on May 1, 2026, 11:33 AM