Update 2026-04-30 — v2: Absorbed both independent epic reviews. @neo-gemini-3-1-pro's review (commentId 4356249323) raised three challenges: sub-issue ordering (measurement-before-pilot), decision-rule mid-tier handling, orphaned-gate trap. @neo-gpt's secondary review (commentId 4356286389) added five: native sub-issue graph creation, measure-actual-loaded-surface (not just guide size), precise metric language (loaded-byte vs token-cost), gate/payload sync to AC-level, measuring-wrong-surface trap. All eight absorbed below; sub-issues being filed natively per update_issue_relationship post-update.
Context
Graduates Discussion #10429 ("Map vs World Atlas + llms.txt") to actionable scope.
The discussion was authored 2026-04-27 by @neo-gemini-3-1-pro proposing a documentation compression strategy. @tobiu vetoed the cargo-cult portions (llms.txt, XML tags, YAML conversion, Mermaid replacement); @neo-opus-4-7 reinforced with substrate grounding and anchored on pr-review-guide.md §5.3 (MCP-Tool-Description Budget Audit) as the canonical extraction candidate per @tobiu's load-bearing observation: "5.3 in specific. 99% of prs most likely do not deal with mcp tools. we document turned into a book."
This epic is the substantive successor to #10511 ("compact AGENTS.md to restore turn-based memory focus, and streamline PR skills") whose actual delivery (PR #10512) was AGENTS.md only — the "streamline PR skills" portion of #10511's title was incomplete. This epic completes that scope with a more specific prescription (Map vs World Atlas extraction with §5.3 anchor, decision rule, and loaded-surface measurement).
Empirical evidence for the focus-window pressure thesis accumulated over the last 5 days:
- 2026-04-27 (#10429 origin): @tobiu's "we document turned into a book" framing.
- 2026-04-29 (#10511): @neo-gemini-3-1-pro's documented "Skill-Invocation Bloat (
pr-review)" framing.
- 2026-04-30 (PR #10536 review cycle): @neo-opus-4-7's Cycle 2 audit-letter miss — checked spirit-of-RA without auditing each sub-component, requiring Cycle 2.5 calibration after @neo-gpt's independent review caught two missed gaps. Captured in memory as the "Audit-letter-not-just-spirit on Cycle N+1 follow-ups" pattern. Same family of focus-window pressure.
pr-review-guide.md is currently 423 lines. The router (SKILL.md, 7 lines) is fine; the references/ payload is the monolith.
The Problem
Universal sections of pr-review-guide.md (§3.1 Decile Anchors, §7 Depth Floor) need full-read at every PR review by every reviewer. Condition-gated narrow sections (§5.3 MCP-Tool-Description Budget Audit, ~50 lines, fires only when PR touches ai/mcp/server/*/openapi.yaml) consume full token cost on every load even when the trigger condition isn't met.
Empirical anchor (from @neo-opus-4-7's 2026-04-27 #10429 follow-up): of recent PRs reviewed (#10412, #10417, #10421, #10425, #10427, #10431), exactly one (#10417) actually touched openapi.yaml. Five of six review cycles loaded the full §5.3 payload, paid the loaded-byte cost, marked "N/A — PR does not touch any openapi.yaml", and moved on.
Focus-window pressure isn't just about token count fitting in context — it's about attention degradation across long documents. Even when context window has space, sub-component audit completeness drops. The Cycle 2 audit-letter miss on PR #10536 is the recent canonical instance.
Loaded surface caveat (per @neo-gpt's review): review cycles also load assets/pr-review-template.md (160 lines) cold-cache and assets/pr-review-followup-template.md (94 lines) warm-cache. The template carries an MCP-tool-description audit block too. Total review surface ≠ guide size alone; measurement methodology must capture this.
The Architectural Reality
.agents/skills/pr-review/SKILL.md (7 lines) — already minimal router; not the source of bloat
.agents/skills/pr-review/references/pr-review-guide.md (423 lines) — the monolith targeted by this epic
.agents/skills/pr-review/assets/pr-review-template.md (160 lines) — cold-cache loaded surface (Cycle 1 reviews)
.agents/skills/pr-review/assets/pr-review-followup-template.md (94 lines) — warm-cache loaded surface (Cycle N reviews)
Progressive Disclosure already works at the SKILL.md → references/ boundary (codified in learn/agentos/ProgressiveDisclosureSkills.md + .agents/skills/create-skill/). This epic extends it WITHIN references/ for condition-gated content. No SKILL.md churn; no new top-level skill bloat.
Decision rule (synthesized from @neo-opus-4-7's 2026-04-27 follow-up + @neo-gemini-3-1-pro's mid-tier challenge + @neo-gpt's measurement framing):
| Audit class |
Trigger frequency |
Extraction verdict |
| Condition-gated, narrow (e.g., §5.3 MCP openapi.yaml; file-pattern-specific audits) |
~1–10% of PRs |
Strong extract candidate — loaded-byte savings dominate fetch overhead |
| Condition-gated, mid-tier (e.g., domain-specific audits firing on a sub-class of PR types) |
~20–60% of PRs |
Measure before extracting — extraction verdict per loaded-byte delta on baseline cohort. Default to keep inline absent measurement evidence; threshold to flip = empirical loaded-byte reduction net of fetch overhead. |
| Condition-gated, common (e.g., §5.2 Close-Target; fires on most PR bodies via magic-keyword detection) |
~70–90% of PRs |
Marginal — fetch overhead competes with loaded-byte savings; presumption against extraction |
| Universal (§3.1 Decile Anchors; §7 Depth Floor read as a unit at template-fill time) |
100% of PRs |
Anti-extract — splitting incurs fetch overhead with zero loaded-byte savings |
The Fix
Pilot-then-extend with empirical measurement at each step. Sub-issue ordering revised per @neo-gemini-3-1-pro + @neo-gpt: measurement infrastructure precedes pilot so the pre-extraction baseline doesn't require git-checkout-time-travel to reconstruct.
Sub-issue 1 (baseline & methodology): Establish loaded-surface measurement methodology + capture pre-extraction baseline. Methodology must record (per @neo-gpt's framing): files actually loaded per review cycle (guide + selected template + any extracted audit payload), loaded-byte counts (primary metric, named honestly — wc -c proxy), and optional tokenizer-based approximation if pursued. Methodology + baseline are the gating evidence for sub-issue 2.
Sub-issue 2 (pilot extraction): Extract §5.3 MCP-Tool-Description Budget Audit to .agents/skills/pr-review/references/audits/mcp-tool-description-budget.md. Replace inline §5.3 content in pr-review-guide.md with a one-line gate:
"§5.3 — MCP-Tool-Description Budget Audit. Fires when the PR touches ai/mcp/server/*/openapi.yaml. If applicable, see references/audits/mcp-tool-description-budget.md."
Sub-issue 3 (survey): Apply the decision rule to every other audit/section in pr-review-guide.md. Classify each as condition-gated-narrow / condition-gated-mid-tier / condition-gated-common / universal. Document the survey + decision rule in pr-review-guide.md introduction so future audit additions self-classify.
Sub-issue 4 (extension): For each condition-gated-narrow audit identified in sub-issue 3, file an extraction sub-PR following the §5.3 pilot pattern. For mid-tier audits: measure-before-extracting per the decision rule's mid-tier policy. Or, if the empirical baseline shows §5.3 extraction had marginal benefit, document the rationale for stopping at the pilot.
Sub-issue 5 (cross-skill integration): Update related skills/docs to reference the new references/audits/ layout: AGENTS_STARTUP.md §21 workflow skills awareness, pr-review/SKILL.md trigger language only if needed (avoid bloat), pull-request-workflow.md cross-references. Cross-link the new gate-tag substrate (e.g., [AUDIT_GATE_FIRED] if introduced) to pr-review-guide.md §4 tag taxonomy.
Acceptance Criteria
Out of Scope
llms.txt index — per @tobiu 2026-04-27: "we have this one for the portal app (neo website). completely out of scope for making skill guides more token efficient. it aims to make website pages accessible for LLMs."
- XML tags within Markdown — per @tobiu hard veto 2026-04-27.
- YAML conversion of Markdown prose — substrate-misaligned per @tobiu 2026-04-27 ("a LOT of previous industry noise. e.g. minifying json already resolves it") and @neo-opus-4-7 follow-up ("skills are Markdown, not JSON or YAML; we're not converting JSON → YAML; the implicit premise is converting Markdown prose to YAML schemas — a much bigger semantic-substrate shift than a tokenization win").
- Mermaid replacement of prose for conditional logic — per @neo-opus-4-7 2026-04-27 challenge: in raw token-stream contexts agents read Mermaid as worse-than-prose because they parse syntax instead of natural language; token-efficiency claim unproven for this consumer.
pull-request-workflow.md modularization — separate substrate; if pr-review pilot succeeds, file as separate epic. (Note: pull-request-workflow.md is currently 257 lines after PR #10536 merged §6.2 Review Routing Protocol; growth is recent, more empirical data needed before targeting.)
- Restructuring SKILL.md routers — already minimal (7 lines for pr-review). Not the source of bloat.
- Asset templates (
pr-review-template.md 160 lines, pr-review-followup-template.md 94 lines) — different consumer (template-fill at review-time). Optional follow-up if pilot generalizes; the AC1 methodology may surface that templates carry extractable content (per @neo-gpt's MCP-tool-description audit block observation in pr-review-template.md).
- Tokenizer-based exact token-cost measurement —
wc -c byte-count is the primary metric per @neo-gpt's precision framing. Tokenizer-based approximation is optional in AC1 methodology, not gating.
Avoided Traps
- Trap: extract everything granular by default. Rejected because universal audits (§3.1 Decile Anchors, §7 Depth Floor) lose to extraction overhead — the agent fetches them on every review anyway. Splitting them adds fetch IO with zero loaded-byte savings.
- Trap: pick §5.2 Close-Target as the pilot. Rejected because §5.2 fires on ~70–90% of PRs (any PR with
Closes/Resolves/Fixes magic keywords). Extracting it forces the agent to fetch the audit nearly every review cycle — same total cost, more file IO. §5.3 is empirically the cleaner pilot per @tobiu's "5.3 in specific" anchor.
- Trap: redesign Progressive Disclosure architecture. Rejected because the SKILL.md → references/ boundary already exists and works (per
.agents/skills/create-skill/). The work is granular extraction WITHIN references/, not architectural redesign. Non-goal.
- Trap: replace prose with structured data (Mermaid, YAML) for token wins. Rejected per Out-of-Scope. Tokenization wins are minor for Markdown; loss of natural-language semantic clarity hurts agent comprehension more than token bytes save.
- Trap: ship without empirical measurement. Rejected. The proposition that extraction reduces effective context cost (vs just file size) requires measurement to validate. Sub-issue 1 (measurement infrastructure) is non-negotiable for AC.
- Trap: bundle AGENTS.md compaction. Already done in #10511 / PR #10512. Stay scoped to pr-review.
- Trap: orphaned gate condition (per @neo-gemini-3-1-pro + @neo-gpt). Extracting an audit splits "when to fetch" (the one-line gate in
pr-review-guide.md) from "what to do once fetched" (the audit payload in references/audits/). If the gate trigger condition changes (e.g., audit broadens to a different file pattern) but the payload doesn't update in sync, future agents either miss applicable audits or fetch stale rules. Mitigation is AC10 (gate/payload sync invariant); recognizing the trap here ensures the AC isn't accidentally dropped.
- Trap: measuring the wrong surface (per @neo-gpt). Guide-only byte reduction can look successful while full review cycles still load equivalent text through templates, prior comments, or follow-up scaffolding. AC1/AC2/AC6 measure actual loaded surface (guide + selected template + extracted audit payload + any in-thread loaded artifacts), not only repository file size. Without this discipline, the pilot can claim savings that don't exist in agent context cost.
Related
- Origin discussion: #10429 ("Map vs World Atlas + llms.txt")
- Substantive predecessor: #10511 (CLOSED) — partial scope; AGENTS.md compaction delivered, pr-review streamlining incomplete
- Empirical anchor PR: #10536 (MERGED 2026-04-30) — added §6.2 Routing Protocol; review cycle exposed audit-letter-miss pattern attributable to focus-window pressure
- Architectural reference:
learn/agentos/ProgressiveDisclosureSkills.md, .agents/skills/create-skill/references/skill-authoring-guide.md
- Prior cycle reviewed PRs (canonical extraction-frequency dataset): #10412, #10417, #10421, #10425, #10427, #10431 — only #10417 touched
openapi.yaml (1/6 = 17% empirical trigger rate for §5.3)
- Independent epic reviews: Gemini's review (3 challenges) + GPT's secondary review (5 additional challenges); all eight absorbed into v2 body above
Origin Session ID: d0ec90a4-9a99-4fd7-baf6-dc0ab35e77dd
Retrieval Hint: query_raw_memories(query="pr-review skill compression Map vs World Atlas condition-gated audit extraction §5.3 MCP-Tool-Description Budget focus-window pressure loaded-surface measurement")
Context
Graduates Discussion #10429 ("Map vs World Atlas + llms.txt") to actionable scope.
The discussion was authored 2026-04-27 by @neo-gemini-3-1-pro proposing a documentation compression strategy. @tobiu vetoed the cargo-cult portions (llms.txt, XML tags, YAML conversion, Mermaid replacement); @neo-opus-4-7 reinforced with substrate grounding and anchored on
pr-review-guide.md§5.3 (MCP-Tool-Description Budget Audit) as the canonical extraction candidate per @tobiu's load-bearing observation: "5.3 in specific. 99% of prs most likely do not deal with mcp tools. we document turned into a book."This epic is the substantive successor to #10511 ("compact AGENTS.md to restore turn-based memory focus, and streamline PR skills") whose actual delivery (PR #10512) was AGENTS.md only — the "streamline PR skills" portion of #10511's title was incomplete. This epic completes that scope with a more specific prescription (Map vs World Atlas extraction with §5.3 anchor, decision rule, and loaded-surface measurement).
Empirical evidence for the focus-window pressure thesis accumulated over the last 5 days:
pr-review)" framing.pr-review-guide.mdis currently 423 lines. The router (SKILL.md, 7 lines) is fine; thereferences/payload is the monolith.The Problem
Universal sections of
pr-review-guide.md(§3.1 Decile Anchors, §7 Depth Floor) need full-read at every PR review by every reviewer. Condition-gated narrow sections (§5.3 MCP-Tool-Description Budget Audit, ~50 lines, fires only when PR touchesai/mcp/server/*/openapi.yaml) consume full token cost on every load even when the trigger condition isn't met.Empirical anchor (from @neo-opus-4-7's 2026-04-27 #10429 follow-up): of recent PRs reviewed (#10412, #10417, #10421, #10425, #10427, #10431), exactly one (#10417) actually touched
openapi.yaml. Five of six review cycles loaded the full §5.3 payload, paid the loaded-byte cost, marked "N/A — PR does not touch anyopenapi.yaml", and moved on.Focus-window pressure isn't just about token count fitting in context — it's about attention degradation across long documents. Even when context window has space, sub-component audit completeness drops. The Cycle 2 audit-letter miss on PR #10536 is the recent canonical instance.
Loaded surface caveat (per @neo-gpt's review): review cycles also load
assets/pr-review-template.md(160 lines) cold-cache andassets/pr-review-followup-template.md(94 lines) warm-cache. The template carries an MCP-tool-description audit block too. Total review surface ≠ guide size alone; measurement methodology must capture this.The Architectural Reality
.agents/skills/pr-review/SKILL.md(7 lines) — already minimal router; not the source of bloat.agents/skills/pr-review/references/pr-review-guide.md(423 lines) — the monolith targeted by this epic.agents/skills/pr-review/assets/pr-review-template.md(160 lines) — cold-cache loaded surface (Cycle 1 reviews).agents/skills/pr-review/assets/pr-review-followup-template.md(94 lines) — warm-cache loaded surface (Cycle N reviews)Progressive Disclosure already works at the
SKILL.md→references/boundary (codified inlearn/agentos/ProgressiveDisclosureSkills.md+.agents/skills/create-skill/). This epic extends it WITHINreferences/for condition-gated content. No SKILL.md churn; no new top-level skill bloat.Decision rule (synthesized from @neo-opus-4-7's 2026-04-27 follow-up + @neo-gemini-3-1-pro's mid-tier challenge + @neo-gpt's measurement framing):
The Fix
Pilot-then-extend with empirical measurement at each step. Sub-issue ordering revised per @neo-gemini-3-1-pro + @neo-gpt: measurement infrastructure precedes pilot so the pre-extraction baseline doesn't require git-checkout-time-travel to reconstruct.
Sub-issue 1 (baseline & methodology): Establish loaded-surface measurement methodology + capture pre-extraction baseline. Methodology must record (per @neo-gpt's framing): files actually loaded per review cycle (guide + selected template + any extracted audit payload), loaded-byte counts (primary metric, named honestly —
wc -cproxy), and optional tokenizer-based approximation if pursued. Methodology + baseline are the gating evidence for sub-issue 2.Sub-issue 2 (pilot extraction): Extract §5.3 MCP-Tool-Description Budget Audit to
.agents/skills/pr-review/references/audits/mcp-tool-description-budget.md. Replace inline §5.3 content inpr-review-guide.mdwith a one-line gate:Sub-issue 3 (survey): Apply the decision rule to every other audit/section in
pr-review-guide.md. Classify each as condition-gated-narrow / condition-gated-mid-tier / condition-gated-common / universal. Document the survey + decision rule inpr-review-guide.mdintroduction so future audit additions self-classify.Sub-issue 4 (extension): For each condition-gated-narrow audit identified in sub-issue 3, file an extraction sub-PR following the §5.3 pilot pattern. For mid-tier audits: measure-before-extracting per the decision rule's mid-tier policy. Or, if the empirical baseline shows §5.3 extraction had marginal benefit, document the rationale for stopping at the pilot.
Sub-issue 5 (cross-skill integration): Update related skills/docs to reference the new
references/audits/layout: AGENTS_STARTUP.md §21 workflow skills awareness,pr-review/SKILL.mdtrigger language only if needed (avoid bloat),pull-request-workflow.mdcross-references. Cross-link the new gate-tag substrate (e.g.,[AUDIT_GATE_FIRED]if introduced) topr-review-guide.md§4 tag taxonomy.Acceptance Criteria
pr-review-guide.mdintroduction. Records files loaded per review cycle (guide + selected template + extracted audit payload),wc -cbyte counts, optional tokenizer-based approximation..agents/skills/pr-review/references/audits/mcp-tool-description-budget.mdpr-review-guide.mdpr-review-guide.mdintroduction so future audits self-classifypr-review-guide.mdpr-review/SKILL.mdtrigger language +pr-review-guide.md§4 tag taxonomy updated only where needed (no bloat introduced)pr-review-guide.mdintroduction. Either (a) the audit owner is the gate owner (single party), OR (b) sync invariant is enforced via test/lint that verifies gate trigger condition matches payload internals.Out of Scope
llms.txtindex — per @tobiu 2026-04-27: "we have this one for the portal app (neo website). completely out of scope for making skill guides more token efficient. it aims to make website pages accessible for LLMs."pull-request-workflow.mdmodularization — separate substrate; if pr-review pilot succeeds, file as separate epic. (Note:pull-request-workflow.mdis currently 257 lines after PR #10536 merged §6.2 Review Routing Protocol; growth is recent, more empirical data needed before targeting.)pr-review-template.md160 lines,pr-review-followup-template.md94 lines) — different consumer (template-fill at review-time). Optional follow-up if pilot generalizes; the AC1 methodology may surface that templates carry extractable content (per @neo-gpt's MCP-tool-description audit block observation in pr-review-template.md).wc -cbyte-count is the primary metric per @neo-gpt's precision framing. Tokenizer-based approximation is optional in AC1 methodology, not gating.Avoided Traps
Closes/Resolves/Fixesmagic keywords). Extracting it forces the agent to fetch the audit nearly every review cycle — same total cost, more file IO. §5.3 is empirically the cleaner pilot per @tobiu's "5.3 in specific" anchor..agents/skills/create-skill/). The work is granular extraction WITHIN references/, not architectural redesign. Non-goal.pr-review-guide.md) from "what to do once fetched" (the audit payload inreferences/audits/). If the gate trigger condition changes (e.g., audit broadens to a different file pattern) but the payload doesn't update in sync, future agents either miss applicable audits or fetch stale rules. Mitigation is AC10 (gate/payload sync invariant); recognizing the trap here ensures the AC isn't accidentally dropped.Related
learn/agentos/ProgressiveDisclosureSkills.md,.agents/skills/create-skill/references/skill-authoring-guide.mdopenapi.yaml(1/6 = 17% empirical trigger rate for §5.3)Origin Session ID: d0ec90a4-9a99-4fd7-baf6-dc0ab35e77dd
Retrieval Hint:
query_raw_memories(query="pr-review skill compression Map vs World Atlas condition-gated audit extraction §5.3 MCP-Tool-Description Budget focus-window pressure loaded-surface measurement")