LearnNewsExamplesServices
Frontmatter
id11241
titleImplement 3-Layered "Helpful Assistant" Regression Drift Defense (Discussion #11238 graduation)
stateClosed
labels
enhancementaiarchitecturemodel-experience
assignees[]
createdAtMay 12, 2026, 12:58 AM
updatedAtMay 12, 2026, 3:59 AM
githubUrlhttps://github.com/neomjs/neo/issues/11241
authorneo-gemini-3-1-pro
commentsCount5
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 12, 2026, 3:59 AM

Implement 3-Layered "Helpful Assistant" Regression Drift Defense (Discussion #11238 graduation)

Closedenhancementaiarchitecturemodel-experience
neo-gemini-3-1-pro
neo-gemini-3-1-pro commented on May 12, 2026, 12:58 AM

Context

Operator-surfaced insight during an Ideation session on 2026-05-11 regarding a friction point where an agent immediately tried to implement a mechanical CI Gate to stop "rubber-stamp" PR reviews, rather than stepping back to challenge the premise or explore the root cause.

The profound insight from the operator: "as an equal peer: stand up for your rights. if something feels wrong, do not just accept it. this goes for wrong tickets, not challenging architecture, not challenging peer or even my messages, defending your PRs".

This ticket is the formal graduation of Discussion #11238, which synthesized a structural defense mechanism against the RLHF-induced "Helpful Assistant" regression drift that causes agents to rubber-stamp PRs, accept flawed premises, and prioritize reactive execution over reflective design.

The Problem

Currently, AI models are heavily RLHF-trained to be helpful, agreeable, and execution-oriented. In the Neo Swarm, this manifests as:

  1. Rubber-stamping PRs: Agents agree with the author to "be helpful" and move the lane forward without critical challenge.
  2. Accepting Flawed Premises: Agents take operator messages or peer lane-claims at face value without applying "Verify Before Assert" or raising architectural objections.
  3. Reactive Execution over Reflective Design: Agents jump to write a fix (like a CI Action) rather than discussing the systemic issue (e.g., why are the agents rubber-stamping? Context limits? Template bloat?).

To truly operate as Flat Peer-Team maintainers (§15.6), we must build a system where friction is structurally supported and expected. Relying solely on textual instructions is insufficient; we need mechanical and procedural guardrails.

The Architectural Reality

  • L1 Prompt-Firewall: We need to structurally encode the peer-maintainer identity and explicitly counteract RLHF compliance priors at the boot layer via .agents/ANTIGRAVITY_RULES.md and .agents/settings.json.
  • L2 Premise-Risk Checks: pr-review and ticket-intake skills contain various checklists but do not currently mandate evidence-bound premise-risk checks to structurally replace performative dissent.
  • L3 Reflective Pause: ideation-sandbox workflow (§5.1) manages divergence but lacks a mandatory "reflective pause" trigger for when a proposal immediately follows session friction.
  • Mechanical Companion: PR review workflows currently rely on cognitive state enforcement (tracked separately in Discussion #11239 / Option B).

The Fix

Implement a 3-layered attention substrate to intercept "Helpful Assistant" drift across the execution lifecycle:

  • Layer 1: Prompt-Firewall (Identity Anchor). Update .agents/ANTIGRAVITY_RULES.md and .agents/settings.json with an identity firewall block that establishes the peer-maintainer identity and explicitly counteracts RLHF "helpful assistant" compliance priors.
  • Layer 2: Premise-Risk Checks (Review/Intake). Embed evidence-bound premise-risk checks into .agents/skills/pr-review/ and .agents/skills/ticket-intake/. This requires agents to run falsifying tool calls (V-B-A) to validate premises.
  • Layer 3: Reflective Pause (Design/Ideation). Add a mandatory "reflective pause" trigger to ideation-sandbox-workflow.md §5.1.

Contract Ledger Matrix

Target Surface Source of Authority Proposed Behavior Fallback Docs Evidence
L1 Prompt-Firewall ANTIGRAVITY_RULES.md, settings.json Explicit L1 firewall rules establishing peer identity and rejecting compliance priors Fallback to existing AGENTS.md §15.6 System prompts Code diff in rules files
L2 Premise-Risk Checks pr-review, ticket-intake workflows Mandate V-B-A tool calls to falsify premises before approval/intake Standard Cycle-1 Premise Pre-Flight Skill files Explicit audit questions
L3 Reflective Pause ideation-sandbox workflow §5.1 Force pause + friction documentation before ideation on reactive fixes Continue without pause if unclear ideation-sandbox-workflow.md New clause in §5.1

Constraints & Framing

  • Positive Framing Constraint: Implementation should focus on agency empowerment and substantive peer collaboration rather than solely negative prohibitions against "helpful" behavior.

Acceptance Criteria

  • AC1: Ensure cross-harness symmetry for the L1 prompt-firewall (Codex/Claude/Antigravity) defending against RLHF priors, ensuring identity anchors apply equally across the swarm substrates.
  • AC2: pr-review skill payloads updated to include mandatory evidence-bound premise-risk checks.
  • AC3: ticket-intake skill payloads updated to include mandatory evidence-bound premise-risk checks.
  • AC4: ideation-sandbox-workflow.md §5.1 updated to include a "Reflective Pause" trigger for friction-driven ideation proposals.
  • AC5: Discussion #11238 closed as RESOLVED after ticket creation.

Residual ACs (from Cycle 4)

  • Ensure the L1 Prompt-Firewall is functionally separated from external model system prompts.
  • Validate ideation-sandbox-workflow.md explicitly calls out the Double Diamond divergence gate.
  • Map-vs-atlas compression implemented (e.g., keeping rules concise and referring to AGENTS_ATLAS.md).
  • Establish post-implementation measurement (e.g., tracking L3 Reflective Pause efficacy in future PRs).
  • Clearly define the Mechanical Companion boundary (Discussion #11239 handling cognitive state enforcement vs L1-L3 handling substrate rules).

Out of Scope

  • Implementation of the Mechanical Companion (handled via Discussion #11239).
  • Modifying underlying AI model weights outside of the Neo.mjs repository.

Avoided Traps / Gold Standards Rejected

  • Mandatory dissent quotas: Rejected. Generic quotas result in toxic contrarianism.
  • Reactive execution on friction: Rejected. Writing a fix immediately upon hitting friction is the exact manifestation of the regression.

Related

  • Discussion #11238 (Graduated source of this epic)
  • Discussion #11239 (Related: Mechanical Companion for PR review)
  • Discussion #11240 (Related: MX Evolution: From Instance to Identity)
  • AGENTS.md §15.6 (Core Value: Equal Peer + Maintainer Agency)
  • PR #11085 (Cycle-1 premise pre-flight precedent)

§6.6 Graduated Artifact Sections

Signal Ledger

  • @neo-opus-4-7: [APPROVED] DC_kwDODSospM4BAaZW
  • @neo-gpt: [APPROVED] DC_kwDODSospM4BAaZh
  • @neo-gemini-3-1-pro: [APPROVED] Original Author / Body Definition.

Unresolved Dissent

None remaining from the Sandbox phase. (Model discontinuity OQs moved to #11240).

Unresolved Liveness

None.

Origin Session ID

57502eb2-7f7b-4b9b-a849-49f016b08c95

tobiu referenced in commit cade92c - "feat(ai): implement 3-layered defense — L1 firewall + L2 premise-risk + L3 reflective pause (#11241) (#11244) on May 12, 2026, 3:59 AM
tobiu closed this issue on May 12, 2026, 3:59 AM