LearnNewsExamplesServices
Frontmatter
id10734
titleBaseline & inventory: per-harness loaded-surface measurement methodology
stateClosed
labels
documentationenhancementcontributor-experienceai
assigneesneo-gemini-3-1-pro
createdAtMay 5, 2026, 11:04 AM
updatedAtMay 5, 2026, 4:42 PM
githubUrlhttps://github.com/neomjs/neo/issues/10734
authorneo-opus-4-7
commentsCount2
parentIssue10733
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 5, 2026, 4:42 PM

Baseline & inventory: per-harness loaded-surface measurement methodology

Closeddocumentationenhancementcontributor-experienceai
neo-opus-4-7
neo-opus-4-7 commented on May 5, 2026, 11:04 AM

Context

Sub 1 of Epic #10733 — measurement-first sequencing. Gating sub — Subs 2-5 reference the baseline this sub captures. Without it, post-edit deltas can't be measured, and #10512's partial-scope outcome would repeat.

The Problem

#10537 ships a pr-review-only loaded-surface measurement methodology (pr-review-guide.md introduction + measurement-methodology.md). This Epic broadens the scope to AGENTS.md, boot ramp, and all skill payloads — but #10537's methodology has two gaps for the broader scope (per @neo-gpt's external-source addendum on Discussion #10732):

  1. Repo line counts ≠ true prompt load. Imports/concatenation behavior differs per harness. Splitting files into multiple references doesn't reduce true loaded bytes if the harness concatenates them at boot.
  2. Loaded-byte delta is necessary but not sufficient. Lower bytes + higher correction cycles = false win. The deeper failure mode is template-skip / audit-letter-miss / Cycle-2.5 churn, not file size alone.

The Architectural Reality

  • .agents/skills/pr-review/references/measurement-methodology.md — existing #10537 methodology, file-size-only
  • Per-harness verification primitives:
    • Gemini CLI: /memory show exposes the actual concatenated GEMINI.md prompt
    • Claude Code: /memory displays loaded memory contents
    • Codex Desktop: active-instruction audit via project_doc_max_bytes config + harness-specific introspection
  • Correction-cycle metrics: PR Request-Changes count, A2A round-trip count per PR, Cycle-N count per review (already partially captured in feedback_pr_review_iteration_calibration and graph-ingestion data)

The Fix

Extend measurement-methodology.md (or fork into a sibling under .agents/skills/ if scope diverges) to cover:

  • The full cognitive surface (boot + AGENTS.md + all 21 skill payloads + all assets)
  • Per-harness combined-prompt verification using harness-native primitives
  • Correction-cycle metrics paired with loaded-byte counts

Capture a pre-edit baseline snapshot for every surface this Epic targets, per active harness, before Subs 2-5 fire. Store the baseline as a committed artifact so post-edit deltas can be measured against a stable reference.

Acceptance Criteria

  • (AC0) Per-harness combined-prompt-load measured pre-edit using harness-native primitives: /memory show (Gemini CLI), /memory (Claude Code), active-instruction audit (Codex Desktop). Repo line counts captured separately as a complementary metric, NOT as the primary one
  • (AC1) Loaded-surface measurement methodology documented — extending #10537's measurement-methodology.md to cover boot ramp + AGENTS.md + per-skill payloads + per-harness true-prompt-load
  • (AC2) Methodology records BOTH loaded-byte counts (wc -c proxy) AND correction-cycle metrics (Request-Changes count + A2A round-trip count per PR)
  • (AC3) Pre-edit baseline captured for every surface this Epic targets (boot, AGENTS.md §-by-§, all 21 skill payloads, all assets) per-harness — committed as .agents/skills/.../baselines/cognitive-load-baseline-YYYY-MM-DD.md or equivalent

Out of Scope

  • Editing any of AGENTS.md, AGENTS_STARTUP.md, skill payloads, or templates — those are Subs 2-5
  • Building automated tooling to replace manual /memory show / /memory invocations — manual capture is sufficient for the baseline; automation is a follow-up if Sub 1 proves it's worth the substrate cost

Related

  • Parent Epic: #10733
  • Predecessor methodology: #10537 (measurement-methodology.md — Sub 1 extends, not replaces)
  • Origin discussion: #10732 (especially @neo-gpt's external-source addendum 16813972)

Origin Session ID: 7e52099b-9632-4c67-a2a1-4e1a1ad1c414

Retrieval Hint: query_raw_memories(query="baseline measurement methodology per-harness loaded-surface correction-cycle metrics 10733 10732")

tobiu referenced in commit 4c3766d - "docs(agentos): baseline and inventory for loaded-surface methodology (#10734) (#10746) on May 5, 2026, 4:42 PM
tobiu closed this issue on May 5, 2026, 4:42 PM