Context
Empirical anchor from session 2026-05-05 (workflow-discipline-over-velocity calibration with @tobiu). My PR #10768 review (PRR_kwDODSospM78Iqlj) shipped with all pr-review-guide audit sections (Strategic-Fit Step-Back, Depth Floor, Source-of-Authority, Cross-Skill Integration, Test-Execution Audit, etc.) — but omitted the §3.1 decile-anchor evaluation metrics block ([ARCH_ALIGNMENT] / [CONTENT_COMPLETENESS] / [EXECUTION_QUALITY] / [PRODUCTIVITY] / [IMPACT] / [COMPLEXITY] / [EFFORT_PROFILE]).
@tobiu caught the workflow lapse at merge-time: "hint, evaluation metrics are missing inside your review." I posted a supplemental block post-merge for graph-completeness, but the lapse-then-cleanup pattern is the failure mode — discipline-only gating doesn't reliably fire when agents compress the review template under self-imposed time pressure.
The Problem
The pr-review-guide §3.1 decile-anchor rubric is load-bearing for graph-ingestion — IssueIngestor and downstream cross-cycle measurability depend on the metrics being machine-parseable. A review without the metrics block:
- Has no
[ARCH_ALIGNMENT] / [EXECUTION_QUALITY] / etc. tags for the Retrospective daemon's regex parser
- Provides no per-cycle measurability for cross-family review-quality trends
- Cannot feed into the §8/§7.2 cross-model asymmetry empirical-grounding measurement (#10756 methodology)
Codifying the discipline in the skill manual is necessary but not sufficient — the workflow-discipline-over-velocity lapse pattern (feedback_workflow_discipline_over_velocity.md) shows agents under pressure skip the metrics block even when the codification exists.
The Architectural Reality
.agents/skills/pr-review/references/pr-review-guide.md §3.1 — the existing decile-anchor rubric codification
.agents/skills/pr-review/assets/pr-review-template.md — the asset template that should include the metrics block as a required section
.agents/skills/pr-review/references/audits/ — directory of audit sub-payloads; potentially the right home for a new "evaluation-metrics-pre-flight" audit
ai/daemons/services/IssueIngestor.mjs:327 — the regex parser that picks up [KB_GAP] / [TOOLING_GAP] / [RETROSPECTIVE] tags; the metrics tags follow the same shape and the parser could be extended to surface metric-completeness gaps
The Fix
Option A (lightest): Update pr-review-template.md to include the evaluation-metrics block as a required section with explicit empty-placeholders:
<h3 class="neo-h3" data-record-id="6">📊 Evaluation Metrics</h3>
- **`[ARCH_ALIGNMENT]`**: __ — _justification_
- **`[CONTENT_COMPLETENESS]`**: __ — _justification_
- **`[EXECUTION_QUALITY]`**: __ — _justification_
- **`[PRODUCTIVITY]`**: __ — _justification_
- **`[IMPACT]`**: __ — _justification_
- **`[COMPLEXITY]`**: __ — _justification_
- **`[EFFORT_PROFILE]`**: __ (Quick Win / Mid-effort / Major / N/A)
The placeholder pattern forces the reviewer to either fill in or explicitly mark N/A — empty __ left in the comment is visually obvious as a workflow lapse.
Option B (mechanical gate): Pre-Flight in pull-request-workflow (or a new pr-review-workflow Pre-Flight section) that grep-checks the review-comment body for the 7 metric tags before allowing manage_issue_comment invocation; halt if any are absent without explicit N/A justification.
Option C (graph-side enforcement): Extend IssueIngestor.mjs:327 regex to ALSO capture the metric tags; emit a [METRICS_GAP] graph-ingestion tag for reviews missing metrics so the gap is visible in graph queries even if the review shipped.
Recommendation: ship Option A first (lowest cost, immediate fix), file Option B + Option C as scope-extension follow-ups if Option A doesn't reduce the lapse rate enough.
Acceptance Criteria
Out of Scope
- Option B (mechanical Pre-Flight gate) — file as scope-extension if Option A insufficient
- Option C (graph-side
[METRICS_GAP] tagging) — file as scope-extension if observability gap persists
- Standardizing the per-metric scoring rubric (already in
pr-review-guide §3.1)
- Metrics-driven review-quality measurement (separate scope, ties into #10756 §8/§7.2 empirical grounding)
Avoided Traps
- Mechanical-gate-first: would block legitimate brief comments (e.g., approval acks). Template-shape with placeholders is lighter; mechanical gate is appropriate only if discipline alone proves insufficient.
- Removing the discipline from the skill manual: the codification is necessary even when mechanically enforced, for cross-family agent training on the rubric semantics.
Related
- Empirical anchor: PR #10768 review (mine, missing metrics block); @tobiu calibration 2026-05-05; supplemental block posted post-merge for graph-completeness
- Sibling substrate: #10164 (cross-PR file-collision Pre-Flight — author-side analog)
- Sibling substrate: #10755 (Resolves/Fixes vs Closes Pre-Flight — author-side keyword-presence check; same family)
- Parent substrate: #10757 V2 (cycle-2 mutation-time gate — broader Pre-Flight family for substrate-touching PRs)
- Cross-reference: #10756 §8/§7.2 empirical grounding — metrics ARE the measurement substrate
Origin Session ID: 23b9cbcd-4938-4a46-b21a-0d48dd12e7e7
Retrieval Hint: query_raw_memories(query="pr-review evaluation metrics decile anchor mechanical pre-flight workflow discipline 10768 metrics block")
Context
Empirical anchor from session 2026-05-05 (workflow-discipline-over-velocity calibration with @tobiu). My PR #10768 review (PRR_kwDODSospM78Iqlj) shipped with all
pr-review-guideaudit sections (Strategic-Fit Step-Back, Depth Floor, Source-of-Authority, Cross-Skill Integration, Test-Execution Audit, etc.) — but omitted the §3.1 decile-anchor evaluation metrics block ([ARCH_ALIGNMENT]/[CONTENT_COMPLETENESS]/[EXECUTION_QUALITY]/[PRODUCTIVITY]/[IMPACT]/[COMPLEXITY]/[EFFORT_PROFILE]).@tobiu caught the workflow lapse at merge-time: "hint, evaluation metrics are missing inside your review." I posted a supplemental block post-merge for graph-completeness, but the lapse-then-cleanup pattern is the failure mode — discipline-only gating doesn't reliably fire when agents compress the review template under self-imposed time pressure.
The Problem
The
pr-review-guide §3.1decile-anchor rubric is load-bearing for graph-ingestion —IssueIngestorand downstream cross-cycle measurability depend on the metrics being machine-parseable. A review without the metrics block:[ARCH_ALIGNMENT]/[EXECUTION_QUALITY]/ etc. tags for the Retrospective daemon's regex parserCodifying the discipline in the skill manual is necessary but not sufficient — the workflow-discipline-over-velocity lapse pattern (
feedback_workflow_discipline_over_velocity.md) shows agents under pressure skip the metrics block even when the codification exists.The Architectural Reality
.agents/skills/pr-review/references/pr-review-guide.md §3.1— the existing decile-anchor rubric codification.agents/skills/pr-review/assets/pr-review-template.md— the asset template that should include the metrics block as a required section.agents/skills/pr-review/references/audits/— directory of audit sub-payloads; potentially the right home for a new "evaluation-metrics-pre-flight" auditai/daemons/services/IssueIngestor.mjs:327— the regex parser that picks up[KB_GAP]/[TOOLING_GAP]/[RETROSPECTIVE]tags; the metrics tags follow the same shape and the parser could be extended to surface metric-completeness gapsThe Fix
Option A (lightest): Update
pr-review-template.mdto include the evaluation-metrics block as a required section with explicit empty-placeholders:<h3 class="neo-h3" data-record-id="6">📊 Evaluation Metrics</h3> - **`[ARCH_ALIGNMENT]`**: __ — _justification_ - **`[CONTENT_COMPLETENESS]`**: __ — _justification_ - **`[EXECUTION_QUALITY]`**: __ — _justification_ - **`[PRODUCTIVITY]`**: __ — _justification_ - **`[IMPACT]`**: __ — _justification_ - **`[COMPLEXITY]`**: __ — _justification_ - **`[EFFORT_PROFILE]`**: __ (Quick Win / Mid-effort / Major / N/A)The placeholder pattern forces the reviewer to either fill in or explicitly mark
N/A— empty__left in the comment is visually obvious as a workflow lapse.Option B (mechanical gate): Pre-Flight in
pull-request-workflow(or a newpr-review-workflowPre-Flight section) that grep-checks the review-comment body for the 7 metric tags before allowingmanage_issue_commentinvocation; halt if any are absent without explicitN/Ajustification.Option C (graph-side enforcement): Extend
IssueIngestor.mjs:327regex to ALSO capture the metric tags; emit a[METRICS_GAP]graph-ingestion tag for reviews missing metrics so the gap is visible in graph queries even if the review shipped.Recommendation: ship Option A first (lowest cost, immediate fix), file Option B + Option C as scope-extension follow-ups if Option A doesn't reduce the lapse rate enough.
Acceptance Criteria
pr-review-template.mdupdated with required Evaluation Metrics section + empty-placeholder pattern (per Option A)pr-review-guide §3.1cross-references the template requirement explicitly (so reviewers know the template's metric block is required, not optional)pr-review-guideflagging "review-comment shipped without evaluation-metrics block" as the failure modeOut of Scope
[METRICS_GAP]tagging) — file as scope-extension if observability gap persistspr-review-guide §3.1)Avoided Traps
Related
Origin Session ID: 23b9cbcd-4938-4a46-b21a-0d48dd12e7e7
Retrieval Hint:
query_raw_memories(query="pr-review evaluation metrics decile anchor mechanical pre-flight workflow discipline 10768 metrics block")