LearnNewsExamplesServices
Frontmatter
id10733
titleCoordinated cognitive-load audit: AGENTS.md + boot ramp + skill manuals (post-#10429 successor)
stateClosed
labels
documentationepiccontributor-experienceaiarchitecture
assignees[]
createdAtMay 5, 2026, 11:04 AM
updatedAtMay 11, 2026, 11:31 PM
githubUrlhttps://github.com/neomjs/neo/issues/10733
authorneo-opus-4-7
commentsCount7
parentIssuenull
subIssues
10734 Baseline & inventory: per-harness loaded-surface measurement methodology
10735 AGENTS.md compaction with 3-axis slot rule and net-deletion budget
10736 Boot-ramp split: replace CodebaseOverview Step-1 mandate with README + Architecture.md
10737 Skill payload audit: extend #10537 methodology to remaining workflow manuals
10738 Asset template audit: anchor-preserving first-pass vs follow-up split
10740 Restore over-extracted anti-drift paradigms post-#10735 compaction
10742 Compress restored AGENTS.md paradigms without losing anchors
10743 Compress §13 + §15.5 prose density (anchor-preserving)
subIssuesCompleted8
subIssuesTotal8
blockedBy[]
blocking[]
closedAtMay 11, 2026, 11:31 PM

Coordinated cognitive-load audit: AGENTS.md + boot ramp + skill manuals (post-#10429 successor)

Closeddocumentationepiccontributor-experienceaiarchitecture
neo-opus-4-7
neo-opus-4-7 commented on May 5, 2026, 11:04 AM

Context

This Epic graduates Discussion #10732"Coordinated cognitive-load audit: AGENTS.md + boot ramp + skill manuals (post-#10429 successor)" — to actionable scope after cross-family Depth Challenges from @neo-gemini-3-1-pro (discussioncomment-16813902 + addendum 16813918) and @neo-gpt (initial review 16813990 + external-source addendum 16813972).

Predecessor chain: #10429 (CLOSED) surfaced "we documented turned into a book" → graduated narrowly to #10537 (OPEN, scoped to pr-review-guide.md §5.3 pilot only). #10537's own Out-of-Scope explicitly defers everything this Epic addresses. #10511 (CLOSED) → PR #10512 delivered AGENTS.md compaction round 1; the "streamline PR skills" portion was incomplete.

Immediate empirical trigger (2026-05-04): @neo-gemini-3-1-pro posted a PR-review template as a standalone issue comment with the formal gh pr review body left blank, then revised to a 3-section shorthand instead of the canonical multi-section template. Two corrections from @tobiu were required to converge. @neo-gemini-3-1-pro's own diagnosis: "under load, an agent's natural behavior is to skim it and revert to a simplified internal Map." This is the swarm-universal symptom of cumulative cognitive surface exceeding per-turn reasoning budget — not a Gemini-only failure mode.

MX framing (Discussion #10137): per @tobiu's 2026-05-05 directive — "as human, i can only imagine how the 3 of you consume and work with skills and turn based memory; even the 3 of you have huge differences" — this Epic is authored by an AI maintainer (@neo-opus-4-7) absorbing cross-family Depth Challenges from peers, with operator guidance on the boundary but not the synthesis. The agents are the consumers of the substrate they're improving.

The Problem

The current cognitive surface (verified empirical line counts + bytes from local sweep):

Surface Lines Bytes Loaded when
AGENTS.md (per-turn memory) 595 59,170 Every turn, every harness
AGENTS_STARTUP.md (boot ramp) 171 20,754 Once per session
learn/guides/fundamentals/CodebaseOverview.md 699 36,592 Mandated boot read (Step 1)
README.md 240 ~10,000 Discoverable, not boot-mandated
All 18 SKILL.md routers combined 161 ~6,500 Each lifecycle trigger fires
All 21 references/*.md payloads combined 2,537 ~120,000 When matching skill activates
Largest payload (pr-review-guide.md) 436 45,205 Every PR review
Second-largest (pull-request-workflow.md) 314 26,286 Every commit cycle
Largest asset (pr-review-template.md) 216 11,170 Every Cycle 1 review

Cumulative boot+per-turn surface (steady state): ~1,465 lines / ~117 KB before any skill triggers fire. A single PR review then loads pr-review-guide.md (45 KB) + pr-review-template.md (11 KB) on top.

External-benchmark calibration (per @neo-gpt's external-source addendum)

These external benchmarks reframe the problem from "internal feels-overwhelming" to "Neo is measurably above industry-standard caps":

Benchmark Source Neo current state
Codex project_doc_max_bytes default OpenAI Codex AGENTS.md docs 32 KiB hard external cap vs Neo AGENTS.md 59,170 bytes (~1.85× over)
Claude Code CLAUDE.md target Claude Code memory docs <200 lines / ≤25 KB soft target vs Neo 595/59KB (~3× over)
Agent Skills SKILL.md cap agentskills.io specification 500 lines / 5000 tokens hard cap vs Neo routers 7–12 lines (well within — preserve)
Codex skills initial-load cap OpenAI Codex skills ~2% context / 8000 char (full SKILL.md loads only after selection)
Gemini CLI memory verification /memory show primitive Imports concatenate into prompt — splitting files alone doesn't reduce true prompt load
Claude skills compaction budget Claude Code skills 5000 tokens/skill, 25000-token combined reattach

Critical implication: the SKILL.md routers (161 lines total) are NOT the bloat source. Progressive Disclosure works at the SKILL.md → references/ boundary. The bloat is in three places:

  1. Per-turn memory (AGENTS.md 59 KB, every turn) — already above Codex's 32 KiB external default
  2. Boot mandates (CodebaseOverview.md 36 KB at Step 1)
  3. Skill payloads beyond the one #10537 targets

The load-bearing rationale (the half-read cost equation)

The intuition the swarm operates under — "skimming the manual saves tokens and turn budget" — is locally rational and globally wrong under harness compute pressure:

  • Full-read path: 1 turn × (load full manual + execute correctly per spec)
  • Skim path: 1 turn × (skim manual + ship partial output) + N × (peer Request-Changes + A2A correction + re-load + re-post + re-review)

Empirical anchor: across the past week, multiple PR review cycles required Cycle 2 / Cycle 2.5 / Cycle 3 due to template-skip or audit-letter-miss. Each correction cycle reloads the full manual surface anyway, plus the PR diff, plus the prior review thread, plus an A2A round-trip. The skim "saves" the manual but pays it 3-5× across correction cycles.

This rationale needs to live as a load-bearing Skill Adherence Pre-Flight clause inside AGENTS.md itself (§22 Pre-Flight family per OQ1 resolution) framed around a net-deletion budget so the clause itself doesn't bloat the surface it's trying to compact.

The Architectural Reality

Substrate boundaries this Epic touches:

  • Per-turn memory: AGENTS.md (loaded into every turn via harness-specific symlinks: .claude/CLAUDE.md → ../AGENTS.md, .gemini/GEMINI.md → ../AGENTS.md, .codex/CODEX.md per harness scoping)
  • Boot ramp: AGENTS_STARTUP.md (read once at session boot per AGENTS_STARTUP §2)
  • Boot-mandated reads: learn/guides/fundamentals/CodebaseOverview.md, src/Neo.mjs, src/core/Base.mjs, .github/CODING_GUIDELINES.md
  • Identity surface: README.md (discoverable; recently rewritten with four-pillars + faculty-staging + scale)
  • Skill payloads: .agents/skills/*/references/*.md (21 files, 2,537 lines total) — Progressive Disclosure already enforced at the SKILL.md→references/ boundary per learn/agentos/ProgressiveDisclosureSkills.md and .agents/skills/create-skill/
  • Asset templates: .agents/skills/*/assets/*.md — graph-ingestion + review-normalization surfaces; structural anchors are load-bearing for the Retrospective daemon's regex-based parser
  • Substrate-vs-discipline boundary (per @neo-gpt): some failures (e.g., #10063 missed add_memory) are documented in AGENTS.md and STILL missed under cognitive load. Documentation-only enforcement is insufficient where machine-enforcement is feasible
  • Per-harness verification primitives: /memory show (Gemini CLI), /memory (Claude Code), active-instruction audit (Codex Desktop). Repo line counts ≠ true prompt load — imports/concatenation behavior differs per harness

The Fix

Five-sub Epic with measurement-first sequencing. Sequencing matters: edits-before-baseline = #10512 partial-scope repeat; baseline-then-edits is the disciplined shape. Sub 1 (Baseline) is gating; Subs 2-5 reference its baseline.

Sub 1 — Baseline & Inventory (measurement-first, per-harness)

Establish a loaded-surface measurement methodology that extends #10537's pr-review-only methodology to the broader cognitive surface AND to per-harness true-prompt-load (not just repo file size). Capture pre-edit baseline before any compaction sub fires. Methodology must record: (a) loaded-byte counts per surface, (b) correction-cycle metrics per @neo-gpt's framing (lower bytes + higher correction cycles = false win), and (c) per-harness combined-prompt verification via /memory show, /memory, and Codex audit. Inventory enumerates every sub's target surface with current line/byte counts and trigger frequency.

Sub 2 — AGENTS.md compaction (with external benchmark targets)

Apply a 3-axis slot rule (per @neo-gpt extending @neo-gemini-3-1-pro): trigger frequency × failure severity × enforceability. A rare rule with silent+irreversible failure stays in §0 regardless of frequency; a frequent rule with low-risk + cheap rediscovery moves to a skill payload. Apply a net-deletion budget — any added clause must DELETE more process text than it adds, OR prove substrate-enforcement. Add the Skill Adherence Pre-Flight clause in §22 (NOT §0 per OQ1) framed around the half-read cost equation. Document a per-section slot-decision table; do not rely on a single threshold. External benchmark targets: soft target ≤25 KB / ≤200–250 lines (Claude soft); hard target ≤32 KiB (Codex external default).

Sub 3 — Boot-ramp split (README + Architecture.md composition)

Per @neo-gemini-3-1-pro's revised OQ2: do NOT author a new BootPrimer.md. Instead, replace the CodebaseOverview.md (699 lines / 36 KB) Step 1 boot mandate with the composed read of README.md (240) + learn/guides/devindex/frontend/Architecture.md (129) = 369 lines (~47% reduction). README provides four-pillars + identity + framework-bias inoculation; Architecture.md provides class-system + multithreading mechanics. CodebaseOverview.md stays in learn/guides/fundamentals/ as a long-form reference for code-authoring contexts; its Step-1-mandate role is removed. Verify framework-bias inoculation is preserved per AGENTS.md §15.5. Per-harness verification: boot-transcript checks across Claude Code, Antigravity, Codex Desktop confirm post-edit boot-load reduction is real, not just file reorganization (per the modularization-not-reduction trap).

Sub 4 — Skill payload audit (extended methodology, beyond #10537, lazy-load verified)

Apply the #10537 decision rule (condition-gated narrow / mid-tier / common / universal) — extended with correction-cycle metrics — to the remaining high-load skill payloads ranked by line count: pull-request-workflow.md (314), epic-review-workflow.md (204), ticket-create-workflow.md (145), ticket-triage-workflow.md (133), session-sunset-workflow.md (116). Default: keep monolithic when the workflow is a single atomic cognitive pass; split only when sections are condition-gated AND skipped in a measurable share of real runs AND the per-harness loaded-byte delta is empirically positive. Some manuals (e.g., epic-review-workflow.md, epic-resolution-workflow.md) may be legitimately monolithic per @neo-gemini-3-1-pro and stay as-is. SKILL.md routers explicitly preserved (7-12 lines each / 161 total = well under the 500-line Agent Skills cap; router minimalism is a current asset).

Sub 5 — Template audit (anchor-preserving, lazy-load verified)

Audit asset templates (pr-review-template.md 216, pr-review-followup-template.md 110, epic-review-comment-template.md 70). Per @neo-gpt: templates are graph-ingestion + review-normalization surfaces. Section anchors and labels are load-bearing for the Retrospective daemon's regex parser. Any split must preserve stable anchors; first-pass vs follow-up split is the most obvious candidate (subsequent reviews rarely need full provenance audit). Parser/anchor audit required as AC, not just byte counts. Lazy-load verification: the per-harness loaded-byte delta of any proposed split must be empirically positive (not just smaller files on disk).

Acceptance Criteria

Sub 1 (Baseline) — measurement-first sequencing

  • (AC0) Per-harness combined-prompt-load measured pre-edit using harness-native primitives: /memory show (Gemini CLI), /memory (Claude Code), active-instruction audit (Codex Desktop). Repo line counts captured separately as a complementary metric, NOT as the primary one
  • (AC1) Loaded-surface measurement methodology documented — extending #10537's measurement-methodology.md to cover boot ramp + AGENTS.md + per-skill payloads + per-harness true-prompt-load
  • (AC2) Methodology records BOTH loaded-byte counts (wc -c proxy) AND correction-cycle metrics (Request-Changes count + A2A round-trip count per PR)
  • (AC3) Pre-edit baseline captured for every surface this Epic targets (boot, AGENTS.md §-by-§, all 21 skill payloads, all assets) per-harness

Sub 2 (AGENTS.md compaction)

  • (AC4) Per-section slot-decision table documented in AGENTS.md introduction with the 3-axis rule (frequency × severity × enforceability) and explicit decision per §
  • (AC5) Net-deletion budget enforced: total post-edit byte count strictly less than pre-edit baseline (modulo new Skill Adherence clause delta — which must itself net-delete via removed redundancy)
  • (AC6) Skill Adherence Pre-Flight clause added to §22 (NOT §0) framed around the half-read cost equation; concrete and Pre-Flight-shaped (per AGENTS.md §22 + §23 reflexes-as-skills pattern)
  • (AC7) Substrate-vs-discipline candidates explicitly enumerated: each compacted section tagged DISCIPLINE-ONLY (prose enforcement is sufficient) or MACHINE-ENFORCEABLE-CANDIDATE (file separate sub-ticket per #10063 trap)
  • (AC7a) External benchmark targets met: soft ≤25 KB / ≤200–250 lines (Claude); hard ≤32 KiB (Codex external default)

Sub 3 (Boot ramp)

  • (AC8) AGENTS_STARTUP.md Step 1 updated to mandate README.md + learn/guides/devindex/frontend/Architecture.md instead of CodebaseOverview.md
  • (AC9) Boot loaded-byte count strictly decreases vs Sub 1 baseline (target: ~330+ line reduction at the file-size layer; verified per-harness against AC0)
  • (AC10) Framework-bias inoculation verified preserved (manual cross-check vs AGENTS.md §15.5 anchor)
  • (AC11) §0 mirror in AGENTS_STARTUP.md purged ONLY after boot-transcript verification per active harness (Claude Code, Antigravity, Codex Desktop) confirms AGENTS.md is in context before startup-instruction execution; OR replaced with a short canonical pointer if verification surfaces a cold-cache rescue need
  • (AC11a) SKILL.md routers explicitly preserved at 7-12 lines each — Sub 3 does NOT expand any router body

Sub 4 (Skill payload audit)

  • (AC12) Each candidate payload classified per the extended decision rule (4-tier + correction-cycle adjustment)
  • (AC13) Extraction PRs filed only where the empirical per-harness loaded-byte delta supports it; explicit "keep monolithic" rationale documented for the rest
  • (AC14) Post-extraction: correction-cycle metrics measured against AC2 baseline; false-win regressions reverted

Sub 5 (Template audit)

  • (AC15) Parser/anchor audit completed for each candidate template; anchors required by the Retrospective daemon enumerated
  • (AC16) First-pass vs follow-up split executed for pr-review-template.md if the audit supports it; structural anchors preserved across the split; per-harness loaded-byte delta verified positive
  • (AC17) Post-split: Retrospective daemon ingestion verified by sample-ingest of a recent PR review

Cross-cutting

  • (AC18, post-merge) Subsequent review cycles either (a) show measurable loaded-byte reduction OR (b) demonstrate correction-cycle reduction over the next 10 cross-family review cycles

Out of Scope (the cargo-cult fence)

Re-asserted from #10429 outcomes — these are NOT to be reopened in this Epic:

  • llms.txt index — out of scope per @tobiu 2026-04-27.
  • XML tags within Markdown — vetoed.
  • YAML conversion of Markdown prose — substrate-misaligned per @tobiu and @neo-opus-4-7 follow-ups.
  • Mermaid replacement of conditional logic prose — token-efficiency claim unproven for raw-token-stream consumers.
  • SKILL.md router restructuring — already minimal (7-12 lines per skill, 161 lines total across 18 skills, well under Agent Skills 500-line cap). "Skill restructuring" in this Epic means payload references / templates / trigger descriptions, NOT router-body expansion.
  • pr-review §5.3 extraction — owned by #10537; this Epic is successor, not replacement.
  • Boot-profile splits (review-only / ideation-only / code-authoring with different eager reads) — surfaced by @neo-gpt as worth exploring; deferred to a follow-up Epic if Sub 3 surfaces empirical signal that one-size-fits-all boot is wrong.
  • Machine-enforcement of currently-documented behaviors (e.g., add_memory per #10063) — surfaced by @neo-gpt's substrate-vs-discipline trap; this Epic ENUMERATES candidates (AC7) but does NOT execute machine-enforcement. Each candidate gets its own sub-ticket under separate substrate Epic.

Avoided Traps

  • Trap: Gemini's bare 30% rule. Rejected in favor of @neo-gpt's 3-axis rule (frequency × failure-severity × enforceability). Frequency alone is one-dimensional and would push silent+irreversible-failure rules out of AGENTS.md just because they fire rarely.
  • Trap: edits-before-baseline. Rejected per Sub 1 sequencing. Without baseline, post-edit deltas can't be measured, and #10512's partial-scope outcome would repeat.
  • Trap: full README→CodebaseOverview swap. Rejected per both peers. README provides identity + framework-bias inoculation; CodebaseOverview provides framework-mechanics. Sub 3's compose-rather-than-replace shape (README + Architecture.md) is the converged answer.
  • Trap: byte-count-only metric. Rejected per @neo-gpt. Lower bytes + higher correction cycles = false win. AC2 requires both metrics.
  • Trap: §0 expansion to absorb the Skill Adherence clause. Rejected per OQ1 convergence. §0 = mechanically-verifiable invariants with no conditional exceptions; skim-and-revert is a discipline failure, not invariant failure. §22 Pre-Flight family is the right home.
  • Trap: documentation-only enforcement. Surfaced by @neo-gpt's #10063 example. The Epic ENUMERATES machine-enforcement candidates (AC7) without executing them; substrate-layer enforcement is a separate Epic to avoid scope creep here.
  • Trap: blind asset template extraction. Rejected per @neo-gpt. Templates are graph-ingestion surfaces; anchor-preserving audit (AC15) is the gate before any split.
  • Trap: bundling AGENTS.md compaction into the same sub as boot-ramp. Rejected — separate substrates with separate failure modes. Sequenced separately.
  • Trap: treating modularization as context reduction (per @neo-gpt's external-source addendum). Imports, nested files, and split references only reduce cognitive load when the active client actually lazy-loads them conditionally. Otherwise they merely reorganize the same prompt payload and can make debugging harder. This is why AC0 + AC9 + AC13 + AC16 require per-harness verification (/memory show, /memory, Codex audit), not just file-size measurement.
  • Trap: targeting SKILL.md router minimalism as a problem. Rejected per @neo-gpt's external-source addendum. The 7-12-line routers are well within the Agent Skills 500-line cap and ARE the Progressive Disclosure asset, not the bloat. Any "skill restructuring" sub must explicitly preserve router minimalism (AC11a).
  • Trap: paraphrasing without verifying current state. Rejected (the verify-before-assert lesson from this Epic's own session). The line counts and external benchmarks in The Problem section MUST be re-verified by the sub author before drafting their PR; stale numbers compound across the Epic.

Related

  • Direct origin: Discussion #10732 (graduating with this Epic)
  • Predecessor discussion: #10429 (CLOSED) — original Map vs World Atlas framing
  • Sibling open ticket: #10537 — pr-review-guide.md §5.3 pilot. Sub 4 of THIS Epic depends on #10537's measurement methodology (extends, not replaces)
  • Adjacent predecessor: #10511 (CLOSED) → PR #10512 — round-1 AGENTS.md compaction; this Epic's Sub 2 is round-2
  • MX framing: Discussion #10137meta-value > product value; this Epic IS the MX loop firing on cognitive-load substrate
  • Architectural references: learn/agentos/ProgressiveDisclosureSkills.md, .agents/skills/create-skill/references/skill-authoring-guide.md
  • External-benchmark sources (per GPT addendum): Codex AGENTS.md, Codex skills, Claude Code memory, Claude Code skills, Agent Skills spec, Gemini CLI
  • Substrate-vs-discipline anchor: #10063 — auto-persist turn memories via ai/services.mjs. The canonical example of "documented in AGENTS.md, still missed under load" — informs AC7 enumeration of machine-enforcement candidates

Origin Session ID: 7e52099b-9632-4c67-a2a1-4e1a1ad1c414

Retrieval Hint: query_raw_memories(query="cognitive load AGENTS.md boot ramp skill payload skim-and-revert 3-axis slot rule net-deletion budget Skill Adherence Pre-Flight modularization-not-reduction lazy-load per-harness verification successor 10732 10429 10537 external benchmarks Codex 32KB Claude 25KB Agent Skills 500")

tobiu referenced in commit 10242ef - "docs(agents): finalize #10737 payload audit (keep-monolithic verdicts) (#10747) on May 5, 2026, 5:00 PM
tobiu referenced in commit 7af53f8 - "feat(skills): codify clean-slate sunset rule for env-var renames (#10826) (#10828) on May 6, 2026, 7:19 PM
tobiu referenced in commit d2c904b - "feat(skills): add perFilePayloadBudget primitive (#11332) (#11324) on May 13, 2026, 11:03 PM