LearnNewsExamplesServices
Frontmatter
id10813
titleRestore session summaries: primary-flag gate + sunset-event trigger
stateClosed
labels
enhancementaiarchitecture
assigneesneo-opus-4-7
createdAtMay 6, 2026, 12:30 PM
updatedAtMay 12, 2026, 4:10 AM
githubUrlhttps://github.com/neomjs/neo/issues/10813
authorneo-opus-4-7
commentsCount3
parentIssue9999
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 8, 2026, 7:46 PM

Restore session summaries: primary-flag gate + sunset-event trigger

Closedenhancementaiarchitecture
neo-opus-4-7
neo-opus-4-7 commented on May 6, 2026, 12:30 PM

Context

Operator-surfaced 2026-05-06: session summaries are missing from boot-context (get_all_summaries() returns stale data) because auto-summarization is intentionally disabled per #9942 ("daemon-collision fix") to prevent the harness fleet's multiple MCP-server instances (Claude Code worktrees + Antigravity + Codex Desktop + per-workspace language servers) from each writing to the shared neo-agent-sessions Chroma collection concurrently. The kill-switch (AUTO_SUMMARIZE=false default) was the right MVP-level mitigation; the cost is empty chronological context at session start.

Operator framing:

"Most of our sessions will get a sunset event at the end. If we summarize 'last X non-summarized sessions', we will catch the edge cases." "If sometimes a not fully finished session gets summarized => fair game. If this happens 'all the time' => problem."

Quality bar: "elegant and robust solutions here, not quick wins" (per feedback_quality_over_speed discipline).

The Problem

Three concrete failure modes today:

  1. Empty boot-context summaries. Agents calling get_all_summaries({limit: 5}) at session start find stale entries because no automatic path refreshes them.
  2. Manual remediation only. Operator must run npm run ai:summarize-sessions (the script shipped under PR #10458) by hand to refresh.
  3. No event-fast-path. Even when the operator sunsets cleanly via the session-sunset skill, the freshly-finished session's summary doesn't appear at the next agent's boot context — it's still gated behind manual script invocation.

Why the existing AUTO_SUMMARIZE=true path can't simply be re-enabled: the harness fleet topology means multiple MCP-server instances would each fire summarization at startup, racing on the same Chroma collection. The substrate fix is single-writer enforcement, not "turn the gate back on."

The Architectural Reality

Existing substrate (audited 2026-05-06 per feedback_verify_before_assert discipline):

  • ai/mcp/server/memory-core/services/SessionService.mjs has Eventual Consistency drift detection: at startup (when AUTO_SUMMARIZE fires), it detects drift between memory count and existing summaries (30-day window default), self-heals by re-summarizing affected sessions. Substrate is complete; only the trigger is gated.

  • ai/scripts/summarize-sessions.mjs (viewable) is the operator-runnable manual entry point (npm run ai:summarize-sessions). Origin: PR #10458 (post-#9942 gate-disable). Defaults to 30-day lookback; calls Memory_SessionService.summarizeSessions({includeAll: false}).

  • session-sunset skill at .agents/skills/session-sunset/ emits a self-DM [SUNSET HANDOVER] mailbox event with wakeSuppressed: true at session end (per AGENTS_STARTUP.md §6 boot pickup path).

  • ai/scripts/bridge-daemon.mjs is the singleton-per-host daemon that already manages wake-event delivery (per ADR-0002 wake substrate). Single-writer-by-design at the host level; natural extension surface for periodic-poll task.

  • MC re-summarize-on-new-turns is already in the substrate (re-runs summarization for "finished" sessions that gain new memories) — handled inside SessionService.summarizeSessions() idempotently.

  • Adjacent ticket boundary: #10332 is "Per-turn mini-summaries" — DIFFERENT layer (per-turn, not per-session). Don't conflate. Per @tobiu 2026-05-06: "per turn summaries !== all turn for a session summaries".

The Fix

Three coordinated mechanisms (A + B + C):

A — Primary-flag substrate gate

Add NEO_MC_PRIMARY=true/false env var read in SessionService.mjs startup hook. Operator sets =true ONLY on the canonical MC instance (the host running the operator's authoritative MCP servers). Other instances (worktree-spawned, language-server-spawned, peer-agent harnesses) keep it unset/false → don't fire startup summarization.

The existing AUTO_SUMMARIZE path stays gated by both env vars in AND:

  • AUTO_SUMMARIZE=true AND NEO_MC_PRIMARY=true → fire startup drift-detection summarization
  • Either condition false → skip (current behavior)

This is the enabling mechanism; it doesn't itself trigger summarization on a regular cadence.

B — Sunset-event-driven trigger (primary path)

When session-sunset skill fires at session end, the canonical MC instance (gated by NEO_MC_PRIMARY=true) detects the [SUNSET HANDOVER] self-DM in the mailbox AND triggers summarization for the just-ended session via Memory_SessionService.summarizeSessions({sessionIds: [sunsettingSession]}). Single writer; immediate-fast-path.

Two implementation choices for the detection:

  • B1 — session-sunset skill explicitly writes to a summarization_jobs SQLite queue (more deterministic; clear contract; daemon polls queue)
  • B2 — Mailbox-poll for [SUNSET HANDOVER] self-DMs at periodic intervals (no skill change; daemon owns the polling)

B1 is cleaner; B2 is simpler. Decision deferred to implementation.

C — Periodic safety-net sweep (catches non-graceful closes)

Extend bridge-daemon.mjs (or sibling daemon — decide at impl time) with a summarizationSweep task that fires every N minutes (e.g., 10 min). Sweep query: "give me the last X (default 10) sessions with no summary OR with summary older than the latest memory in that session." Calls Memory_SessionService.summarizeSessions({sessionIds: [...]}) for the matched batch. Handles:

  • Sessions that closed without sunset (crash, kill, network drop)
  • Sessions whose summaries went stale because new memories were added post-sunset (MC's existing re-summarize-on-new-turns path naturally handles this; daemon just triggers it)

Per operator framing: "if sometimes a not fully finished session gets summarized => fair game". Tolerance for occasional in-progress-session summarization is acceptable; the sweep is an edge-case catcher, not the primary path.

Why all three combined (and not just one)

  • A alone: boots are sparse; sessions ending mid-day don't get summarized until next operator boot.
  • B alone: non-sunset closes (crash, kill) get permanently lost.
  • C alone: delays sunset-summary visibility by up-to-N-minutes; operators chasing "what just happened?" hit stale data.
  • A+B+C: primary-flag enables; sunset gives immediate-fast-path; periodic sweep is the safety-net. Elegant and robust per the operator's quality bar.

Contract Ledger (T3)

Per canonical specification in learn/agentos/contract-ledger.md.

Target Surface Source of Authority Proposed Behavior Fallback / Edge Case Docs Evidence
aiConfig.isPrimary boolean field driven by NEO_MC_PRIMARY env var (new); read in SessionService.mjs startup hook to gate the existing AUTO_SUMMARIZE drift-detection path This ticket, parent #9999, originating gate from #9942 (daemon-collision fix), operator framing 2026-05-06 Boolean config; false default. When AUTO_SUMMARIZE=true AND isPrimary=true, the existing SessionService.summarizeSessions() startup hook fires per existing Eventual Consistency drift detection. Otherwise startup-summarization is skipped (current behavior). Operators set NEO_MC_PRIMARY=true on canonical MC instance only. If NEO_MC_PRIMARY=true set on multiple instances simultaneously, race conditions on the shared Chroma collection — operator config error. Optional future hardening: SQLite-row-lock primary-election (per ADR-0001) to detect+fail-fast on misconfiguration. Out of scope for MVP. Update learn/agentos/MemoryCore.md AUTO_SUMMARIZE section with NEO_MC_PRIMARY interaction; SharedDeployment.md env-var inventory; DeploymentCookbook.md Section 6 (env vars) + new operator-onboarding note about which instance is primary. L2 unit-test: SessionService.spec.mjs extended — primary-flag-true-fires, primary-flag-false-skips, AUTO_SUMMARIZE-false-skips-regardless-of-primary.
summarization_jobs SQLite table (new) OR mailbox-poll trigger for [SUNSET HANDOVER] self-DMs (decision at impl time) This ticket; B1 vs B2 trade-off resolved during implementation Sunset event → trigger summarization for the just-ended session via Memory_SessionService.summarizeSessions({sessionIds: [N]}). Single-writer enforced via NEO_MC_PRIMARY gate (only canonical MC processes the trigger). Idempotent vs MC's existing re-summarize-on-new-turns logic. If sunset event fires before all session memories are flushed, the summary is a "best-effort snapshot" — MC re-summarizes on new memories naturally. If NEO_MC_PRIMARY=false on this instance, trigger is ignored (no race). session-sunset skill SKILL.md updated to reference the trigger semantics if B1 chosen. MemoryCore.md healthcheck section adds optional summarization.triggerSource: 'sunset-event' | 'periodic-sweep' | 'manual' provenance field. L2 unit-test: trigger flow with mocked sunset event → summarizeSessions called with correct sessionId. Plus integration manual-test: run sunset, observe summary appears in next session boot context.
summarizationSweep periodic task in bridge-daemon.mjs (extension) OR new sibling daemon This ticket; daemon-extension vs new-daemon decision at impl time Every N minutes (default 10), query MC for sessions with no summary OR summary stale vs latest memory within last X (default 10) sessions. Trigger summarization for matched batch. Single-writer enforced via NEO_MC_PRIMARY gate + bridge-daemon-singleton-per-host. If daemon stops/crashes, summaries silently stale until restart — operator-side concern. Manual fallback (npm run ai:summarize-sessions) remains as recovery path. Healthcheck observability surfaces last-sweep-time. learn/agentos/MemoryCore.md summarization section documents the sweep cadence + tunable NEO_SUMMARIZATION_SWEEP_INTERVAL_MS env var. L2 unit-test: sweep task fires summarize-sessions with correct sessionId batch given seeded session-table state. Plus operator-side integration test: 10-min wait + observe healthcheck reports last-sweep-timestamp + summaries appear.

Acceptance Criteria

  • AC1: NEO_MC_PRIMARY env var wired in memory-core/config.template.mjs with false default; SessionService.mjs startup hook gates AUTO_SUMMARIZE path on it.
  • AC2: Sunset-event trigger (B1 OR B2 chosen at impl time) fires summarization for the just-ended session on the canonical MC instance only.
  • AC3: Periodic sweep task in bridge-daemon (or sibling) covers the safety-net case; configurable cadence + sweep-window.
  • AC4: L2 unit-test coverage for primary-flag gate + sunset-trigger path + periodic-sweep task.
  • AC5: Doc updates: MemoryCore.md AUTO_SUMMARIZE section reflects primary-flag interaction + sweep cadence; SharedDeployment.md env-var inventory adds NEO_MC_PRIMARY + sweep-interval; DeploymentCookbook.md Section 6 operator-onboarding notes which instance is primary.
  • AC6: Healthcheck observability: summarization.{triggerSource, lastSweepAt, primaryFlag, queueDepth?} block surfaces daemon state.
  • AC7 (post-merge): Empirical verification — agent boots after a recent session sunset → get_all_summaries({limit: 5}) returns the just-ended session as the most-recent entry.

Out of Scope

  • SQLite-row-lock primary-election failover. Future hardening if operator-misconfig (multiple primaries) becomes empirically problematic. MVP relies on operator config correctness + clear documentation.
  • Per-turn mini-summaries (#10332) — different layer, separate ticket.
  • Cross-host summarization for multi-host shared-Chroma deployments. Strategy A vs B (per Discussion #10809) is orthogonal — daemon runs wherever the canonical MC is, regardless of cloud topology.
  • DreamService / Sandman pipeline integration. Tracked under #10030 lineage. Different concern.
  • Backfill-all-history. Operator-tool path remains via npm run ai:summarize-sessions script if needed.

Avoided Traps

  • Rejected: Reactivate AUTO_SUMMARIZE=true globally. Breaks the original #9942 gate; race conditions return.
  • Rejected: Per-MCP-instance election protocol. Operator-error-prone; doesn't survive harness-spawned-language-server case where operators don't control MCP-server config. Primary-flag is simpler and explicit.
  • Rejected: Build a new SummarizationCoordinator daemon class from scratch. Existing bridge-daemon.mjs already provides singleton-per-host substrate; extending it (or filing a sibling daemon module without the lifecycle plumbing reinvention) is cleaner.
  • Rejected: Skill-only trigger (Option A from earlier analysis). Doesn't handle non-graceful closes; abandoned for combined A+B+C shape.
  • Rejected: Bundle with #10332. Different layer (per-turn vs per-session); conflating scope would slow both tickets.

Related

  • Parent epic: #9999 — Cloud-Native Knowledge & Multi-Tenant Memory Core (operator-deployment readiness includes summarization-restoration).
  • Originating gate: #9942 — daemon-collision fix that disabled AUTO_SUMMARIZE originally.
  • Existing manual script: PR #10458summarize-sessions.mjs operator-tool (continues as recovery path).
  • Boundary: #10332 — Per-turn mini-summaries (DIFFERENT layer; do not conflate).
  • Adjacent topology decision: Discussion #10809 — Strategy A vs B cloud deployment (daemon is topology-agnostic).
  • Daemon substrate precedent: PR #10793 SwarmHeartbeatService (singleton-per-host Neo daemon shape; operator-territory plist install).
  • Cross-process coherence: ADR-0001 (sqlite-row-lock primitive available if MVP gate proves insufficient).

Origin Session ID: 34c8f800-1855-43ff-aea6-d5e6b9410978

Retrieval Hint: query_raw_memories(query="session summarization daemon coordinator primary-flag NEO_MC_PRIMARY sunset-event trigger periodic sweep AUTO_SUMMARIZE re-enable harness fleet single-writer #9942 daemon-collision-fix")

tobiu referenced in commit 63afd01 - "feat(ai): primary-flag gate for session summarization (#10813) (#10817) on May 6, 2026, 3:06 PM
tobiu referenced in commit b90d127 - "feat(memory-core): implement B2 mailbox-poll for session sunset (#10813) (#10818) on May 6, 2026, 4:57 PM