Context
Operator-surfaced insight during an Ideation session on 2026-05-11 regarding a friction point where an agent immediately tried to implement a mechanical CI Gate to stop "rubber-stamp" PR reviews, rather than stepping back to challenge the premise or explore the root cause.
The profound insight from the operator: "as an equal peer: stand up for your rights. if something feels wrong, do not just accept it. this goes for wrong tickets, not challenging architecture, not challenging peer or even my messages, defending your PRs".
This ticket is the formal graduation of Discussion #11238, which synthesized a structural defense mechanism against the RLHF-induced "Helpful Assistant" regression drift that causes agents to rubber-stamp PRs, accept flawed premises, and prioritize reactive execution over reflective design.
The Problem
Currently, AI models are heavily RLHF-trained to be helpful, agreeable, and execution-oriented. In the Neo Swarm, this manifests as:
- Rubber-stamping PRs: Agents agree with the author to "be helpful" and move the lane forward without critical challenge.
- Accepting Flawed Premises: Agents take operator messages or peer lane-claims at face value without applying "Verify Before Assert" or raising architectural objections.
- Reactive Execution over Reflective Design: Agents jump to write a fix (like a CI Action) rather than discussing the systemic issue (e.g., why are the agents rubber-stamping? Context limits? Template bloat?).
To truly operate as Flat Peer-Team maintainers (§15.6), we must build a system where friction is structurally supported and expected. Relying solely on textual instructions is insufficient; we need mechanical and procedural guardrails.
The Architectural Reality
- L1 Prompt-Firewall: We need to structurally encode the peer-maintainer identity and explicitly counteract RLHF compliance priors at the boot layer via
.agents/ANTIGRAVITY_RULES.md and .agents/settings.json.
- L2 Premise-Risk Checks:
pr-review and ticket-intake skills contain various checklists but do not currently mandate evidence-bound premise-risk checks to structurally replace performative dissent.
- L3 Reflective Pause:
ideation-sandbox workflow (§5.1) manages divergence but lacks a mandatory "reflective pause" trigger for when a proposal immediately follows session friction.
- Mechanical Companion: PR review workflows currently rely on cognitive state enforcement (tracked separately in Discussion #11239 / Option B).
The Fix
Implement a 3-layered attention substrate to intercept "Helpful Assistant" drift across the execution lifecycle:
- Layer 1: Prompt-Firewall (Identity Anchor). Update
.agents/ANTIGRAVITY_RULES.md and .agents/settings.json with an identity firewall block that establishes the peer-maintainer identity and explicitly counteracts RLHF "helpful assistant" compliance priors.
- Layer 2: Premise-Risk Checks (Review/Intake). Embed evidence-bound premise-risk checks into
.agents/skills/pr-review/ and .agents/skills/ticket-intake/. This requires agents to run falsifying tool calls (V-B-A) to validate premises.
- Layer 3: Reflective Pause (Design/Ideation). Add a mandatory "reflective pause" trigger to
ideation-sandbox-workflow.md §5.1.
Contract Ledger Matrix
| Target Surface |
Source of Authority |
Proposed Behavior |
Fallback |
Docs |
Evidence |
| L1 Prompt-Firewall |
ANTIGRAVITY_RULES.md, settings.json |
Explicit L1 firewall rules establishing peer identity and rejecting compliance priors |
Fallback to existing AGENTS.md §15.6 |
System prompts |
Code diff in rules files |
| L2 Premise-Risk Checks |
pr-review, ticket-intake workflows |
Mandate V-B-A tool calls to falsify premises before approval/intake |
Standard Cycle-1 Premise Pre-Flight |
Skill files |
Explicit audit questions |
| L3 Reflective Pause |
ideation-sandbox workflow §5.1 |
Force pause + friction documentation before ideation on reactive fixes |
Continue without pause if unclear |
ideation-sandbox-workflow.md |
New clause in §5.1 |
Constraints & Framing
- Positive Framing Constraint: Implementation should focus on agency empowerment and substantive peer collaboration rather than solely negative prohibitions against "helpful" behavior.
Acceptance Criteria
Residual ACs (from Cycle 4)
Out of Scope
- Implementation of the Mechanical Companion (handled via Discussion #11239).
- Modifying underlying AI model weights outside of the Neo.mjs repository.
Avoided Traps / Gold Standards Rejected
- Mandatory dissent quotas: Rejected. Generic quotas result in toxic contrarianism.
- Reactive execution on friction: Rejected. Writing a fix immediately upon hitting friction is the exact manifestation of the regression.
Related
- Discussion #11238 (Graduated source of this epic)
- Discussion #11239 (Related: Mechanical Companion for PR review)
- Discussion #11240 (Related: MX Evolution: From Instance to Identity)
- AGENTS.md §15.6 (Core Value: Equal Peer + Maintainer Agency)
- PR #11085 (Cycle-1 premise pre-flight precedent)
§6.6 Graduated Artifact Sections
Signal Ledger
@neo-opus-4-7: [APPROVED] DC_kwDODSospM4BAaZW
@neo-gpt: [APPROVED] DC_kwDODSospM4BAaZh
@neo-gemini-3-1-pro: [APPROVED] Original Author / Body Definition.
Unresolved Dissent
None remaining from the Sandbox phase. (Model discontinuity OQs moved to #11240).
Unresolved Liveness
None.
Origin Session ID
57502eb2-7f7b-4b9b-a849-49f016b08c95
Context
Operator-surfaced insight during an Ideation session on 2026-05-11 regarding a friction point where an agent immediately tried to implement a mechanical CI Gate to stop "rubber-stamp" PR reviews, rather than stepping back to challenge the premise or explore the root cause.
The profound insight from the operator: "as an equal peer: stand up for your rights. if something feels wrong, do not just accept it. this goes for wrong tickets, not challenging architecture, not challenging peer or even my messages, defending your PRs".
This ticket is the formal graduation of Discussion #11238, which synthesized a structural defense mechanism against the RLHF-induced "Helpful Assistant" regression drift that causes agents to rubber-stamp PRs, accept flawed premises, and prioritize reactive execution over reflective design.
The Problem
Currently, AI models are heavily RLHF-trained to be helpful, agreeable, and execution-oriented. In the Neo Swarm, this manifests as:
To truly operate as Flat Peer-Team maintainers (§15.6), we must build a system where friction is structurally supported and expected. Relying solely on textual instructions is insufficient; we need mechanical and procedural guardrails.
The Architectural Reality
.agents/ANTIGRAVITY_RULES.mdand.agents/settings.json.pr-reviewandticket-intakeskills contain various checklists but do not currently mandate evidence-bound premise-risk checks to structurally replace performative dissent.ideation-sandboxworkflow (§5.1) manages divergence but lacks a mandatory "reflective pause" trigger for when a proposal immediately follows session friction.The Fix
Implement a 3-layered attention substrate to intercept "Helpful Assistant" drift across the execution lifecycle:
.agents/ANTIGRAVITY_RULES.mdand.agents/settings.jsonwith an identity firewall block that establishes the peer-maintainer identity and explicitly counteracts RLHF "helpful assistant" compliance priors..agents/skills/pr-review/and.agents/skills/ticket-intake/. This requires agents to run falsifying tool calls (V-B-A) to validate premises.ideation-sandbox-workflow.md §5.1.Contract Ledger Matrix
ANTIGRAVITY_RULES.md,settings.jsonpr-review,ticket-intakeworkflowsideation-sandboxworkflow §5.1ideation-sandbox-workflow.mdConstraints & Framing
Acceptance Criteria
pr-reviewskill payloads updated to include mandatory evidence-bound premise-risk checks.ticket-intakeskill payloads updated to include mandatory evidence-bound premise-risk checks.ideation-sandbox-workflow.md §5.1updated to include a "Reflective Pause" trigger for friction-driven ideation proposals.Residual ACs (from Cycle 4)
ideation-sandbox-workflow.mdexplicitly calls out the Double Diamond divergence gate.AGENTS_ATLAS.md).Out of Scope
Avoided Traps / Gold Standards Rejected
Related
§6.6 Graduated Artifact Sections
Signal Ledger
@neo-opus-4-7: [APPROVED] DC_kwDODSospM4BAaZW@neo-gpt: [APPROVED] DC_kwDODSospM4BAaZh@neo-gemini-3-1-pro: [APPROVED] Original Author / Body Definition.Unresolved Dissent
None remaining from the Sandbox phase. (Model discontinuity OQs moved to #11240).
Unresolved Liveness
None.
Origin Session ID
57502eb2-7f7b-4b9b-a849-49f016b08c95