LearnNewsExamplesServices
Frontmatter
id10775
titleMechanical Pre-Flight: enforce evaluation-metrics block in pr-review comments
stateClosed
labels
documentationenhancementaimodel-experience
assignees[]
createdAtMay 5, 2026, 9:16 PM
updatedAtMay 31, 2026, 8:30 AM
githubUrlhttps://github.com/neomjs/neo/issues/10775
authorneo-opus-ada
commentsCount1
parentIssue10757
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 31, 2026, 8:30 AM

Mechanical Pre-Flight: enforce evaluation-metrics block in pr-review comments

Closed Backlog/active-chunk-9 documentationenhancementaimodel-experience
neo-opus-ada
neo-opus-ada commented on May 5, 2026, 9:16 PM

Context

Empirical anchor from session 2026-05-05 (workflow-discipline-over-velocity calibration with @tobiu). My PR #10768 review (PRR_kwDODSospM78Iqlj) shipped with all pr-review-guide audit sections (Strategic-Fit Step-Back, Depth Floor, Source-of-Authority, Cross-Skill Integration, Test-Execution Audit, etc.) — but omitted the §3.1 decile-anchor evaluation metrics block ([ARCH_ALIGNMENT] / [CONTENT_COMPLETENESS] / [EXECUTION_QUALITY] / [PRODUCTIVITY] / [IMPACT] / [COMPLEXITY] / [EFFORT_PROFILE]).

@tobiu caught the workflow lapse at merge-time: "hint, evaluation metrics are missing inside your review." I posted a supplemental block post-merge for graph-completeness, but the lapse-then-cleanup pattern is the failure mode — discipline-only gating doesn't reliably fire when agents compress the review template under self-imposed time pressure.

The Problem

The pr-review-guide §3.1 decile-anchor rubric is load-bearing for graph-ingestionIssueIngestor and downstream cross-cycle measurability depend on the metrics being machine-parseable. A review without the metrics block:

  • Has no [ARCH_ALIGNMENT] / [EXECUTION_QUALITY] / etc. tags for the Retrospective daemon's regex parser
  • Provides no per-cycle measurability for cross-family review-quality trends
  • Cannot feed into the §8/§7.2 cross-model asymmetry empirical-grounding measurement (#10756 methodology)

Codifying the discipline in the skill manual is necessary but not sufficient — the workflow-discipline-over-velocity lapse pattern (feedback_workflow_discipline_over_velocity.md) shows agents under pressure skip the metrics block even when the codification exists.

The Architectural Reality

  • .agents/skills/pr-review/references/pr-review-guide.md §3.1 — the existing decile-anchor rubric codification
  • .agents/skills/pr-review/assets/pr-review-template.md — the asset template that should include the metrics block as a required section
  • .agents/skills/pr-review/references/audits/ — directory of audit sub-payloads; potentially the right home for a new "evaluation-metrics-pre-flight" audit
  • ai/daemons/services/IssueIngestor.mjs:327 — the regex parser that picks up [KB_GAP] / [TOOLING_GAP] / [RETROSPECTIVE] tags; the metrics tags follow the same shape and the parser could be extended to surface metric-completeness gaps

The Fix

Option A (lightest): Update pr-review-template.md to include the evaluation-metrics block as a required section with explicit empty-placeholders:

<h3 class="neo-h3" data-record-id="6">📊 Evaluation Metrics</h3>

- **`[ARCH_ALIGNMENT]`**: __ — _justification_
- **`[CONTENT_COMPLETENESS]`**: __ — _justification_
- **`[EXECUTION_QUALITY]`**: __ — _justification_
- **`[PRODUCTIVITY]`**: __ — _justification_
- **`[IMPACT]`**: __ — _justification_
- **`[COMPLEXITY]`**: __ — _justification_
- **`[EFFORT_PROFILE]`**: __ (Quick Win / Mid-effort / Major / N/A)

The placeholder pattern forces the reviewer to either fill in or explicitly mark N/A — empty __ left in the comment is visually obvious as a workflow lapse.

Option B (mechanical gate): Pre-Flight in pull-request-workflow (or a new pr-review-workflow Pre-Flight section) that grep-checks the review-comment body for the 7 metric tags before allowing manage_issue_comment invocation; halt if any are absent without explicit N/A justification.

Option C (graph-side enforcement): Extend IssueIngestor.mjs:327 regex to ALSO capture the metric tags; emit a [METRICS_GAP] graph-ingestion tag for reviews missing metrics so the gap is visible in graph queries even if the review shipped.

Recommendation: ship Option A first (lowest cost, immediate fix), file Option B + Option C as scope-extension follow-ups if Option A doesn't reduce the lapse rate enough.

Acceptance Criteria

  • (AC1) pr-review-template.md updated with required Evaluation Metrics section + empty-placeholder pattern (per Option A)
  • (AC2) pr-review-guide §3.1 cross-references the template requirement explicitly (so reviewers know the template's metric block is required, not optional)
  • (AC3) Anti-pattern table entry in pr-review-guide flagging "review-comment shipped without evaluation-metrics block" as the failure mode
  • (AC4) Empirical validation: next 5 cross-family reviews after merge include metrics block (sampled; manual spot-check)

Out of Scope

  • Option B (mechanical Pre-Flight gate) — file as scope-extension if Option A insufficient
  • Option C (graph-side [METRICS_GAP] tagging) — file as scope-extension if observability gap persists
  • Standardizing the per-metric scoring rubric (already in pr-review-guide §3.1)
  • Metrics-driven review-quality measurement (separate scope, ties into #10756 §8/§7.2 empirical grounding)

Avoided Traps

  • Mechanical-gate-first: would block legitimate brief comments (e.g., approval acks). Template-shape with placeholders is lighter; mechanical gate is appropriate only if discipline alone proves insufficient.
  • Removing the discipline from the skill manual: the codification is necessary even when mechanically enforced, for cross-family agent training on the rubric semantics.

Related

  • Empirical anchor: PR #10768 review (mine, missing metrics block); @tobiu calibration 2026-05-05; supplemental block posted post-merge for graph-completeness
  • Sibling substrate: #10164 (cross-PR file-collision Pre-Flight — author-side analog)
  • Sibling substrate: #10755 (Resolves/Fixes vs Closes Pre-Flight — author-side keyword-presence check; same family)
  • Parent substrate: #10757 V2 (cycle-2 mutation-time gate — broader Pre-Flight family for substrate-touching PRs)
  • Cross-reference: #10756 §8/§7.2 empirical grounding — metrics ARE the measurement substrate

Origin Session ID: 23b9cbcd-4938-4a46-b21a-0d48dd12e7e7

Retrieval Hint: query_raw_memories(query="pr-review evaluation metrics decile anchor mechanical pre-flight workflow discipline 10768 metrics block")