Context
Operator-surfaced 2026-05-06: session summaries are missing from boot-context (get_all_summaries() returns stale data) because auto-summarization is intentionally disabled per #9942 ("daemon-collision fix") to prevent the harness fleet's multiple MCP-server instances (Claude Code worktrees + Antigravity + Codex Desktop + per-workspace language servers) from each writing to the shared neo-agent-sessions Chroma collection concurrently. The kill-switch (AUTO_SUMMARIZE=false default) was the right MVP-level mitigation; the cost is empty chronological context at session start.
Operator framing:
"Most of our sessions will get a sunset event at the end. If we summarize 'last X non-summarized sessions', we will catch the edge cases."
"If sometimes a not fully finished session gets summarized => fair game. If this happens 'all the time' => problem."
Quality bar: "elegant and robust solutions here, not quick wins" (per feedback_quality_over_speed discipline).
The Problem
Three concrete failure modes today:
- Empty boot-context summaries. Agents calling
get_all_summaries({limit: 5}) at session start find stale entries because no automatic path refreshes them.
- Manual remediation only. Operator must run
npm run ai:summarize-sessions (the script shipped under PR #10458) by hand to refresh.
- No event-fast-path. Even when the operator sunsets cleanly via the
session-sunset skill, the freshly-finished session's summary doesn't appear at the next agent's boot context — it's still gated behind manual script invocation.
Why the existing AUTO_SUMMARIZE=true path can't simply be re-enabled: the harness fleet topology means multiple MCP-server instances would each fire summarization at startup, racing on the same Chroma collection. The substrate fix is single-writer enforcement, not "turn the gate back on."
The Architectural Reality
Existing substrate (audited 2026-05-06 per feedback_verify_before_assert discipline):
ai/mcp/server/memory-core/services/SessionService.mjs has Eventual Consistency drift detection: at startup (when AUTO_SUMMARIZE fires), it detects drift between memory count and existing summaries (30-day window default), self-heals by re-summarizing affected sessions. Substrate is complete; only the trigger is gated.
ai/scripts/summarize-sessions.mjs (viewable) is the operator-runnable manual entry point (npm run ai:summarize-sessions). Origin: PR #10458 (post-#9942 gate-disable). Defaults to 30-day lookback; calls Memory_SessionService.summarizeSessions({includeAll: false}).
session-sunset skill at .agents/skills/session-sunset/ emits a self-DM [SUNSET HANDOVER] mailbox event with wakeSuppressed: true at session end (per AGENTS_STARTUP.md §6 boot pickup path).
ai/scripts/bridge-daemon.mjs is the singleton-per-host daemon that already manages wake-event delivery (per ADR-0002 wake substrate). Single-writer-by-design at the host level; natural extension surface for periodic-poll task.
MC re-summarize-on-new-turns is already in the substrate (re-runs summarization for "finished" sessions that gain new memories) — handled inside SessionService.summarizeSessions() idempotently.
Adjacent ticket boundary: #10332 is "Per-turn mini-summaries" — DIFFERENT layer (per-turn, not per-session). Don't conflate. Per @tobiu 2026-05-06: "per turn summaries !== all turn for a session summaries".
The Fix
Three coordinated mechanisms (A + B + C):
A — Primary-flag substrate gate
Add NEO_MC_PRIMARY=true/false env var read in SessionService.mjs startup hook. Operator sets =true ONLY on the canonical MC instance (the host running the operator's authoritative MCP servers). Other instances (worktree-spawned, language-server-spawned, peer-agent harnesses) keep it unset/false → don't fire startup summarization.
The existing AUTO_SUMMARIZE path stays gated by both env vars in AND:
AUTO_SUMMARIZE=true AND NEO_MC_PRIMARY=true → fire startup drift-detection summarization
- Either condition false → skip (current behavior)
This is the enabling mechanism; it doesn't itself trigger summarization on a regular cadence.
B — Sunset-event-driven trigger (primary path)
When session-sunset skill fires at session end, the canonical MC instance (gated by NEO_MC_PRIMARY=true) detects the [SUNSET HANDOVER] self-DM in the mailbox AND triggers summarization for the just-ended session via Memory_SessionService.summarizeSessions({sessionIds: [sunsettingSession]}). Single writer; immediate-fast-path.
Two implementation choices for the detection:
- B1 —
session-sunset skill explicitly writes to a summarization_jobs SQLite queue (more deterministic; clear contract; daemon polls queue)
- B2 — Mailbox-poll for
[SUNSET HANDOVER] self-DMs at periodic intervals (no skill change; daemon owns the polling)
B1 is cleaner; B2 is simpler. Decision deferred to implementation.
C — Periodic safety-net sweep (catches non-graceful closes)
Extend bridge-daemon.mjs (or sibling daemon — decide at impl time) with a summarizationSweep task that fires every N minutes (e.g., 10 min). Sweep query: "give me the last X (default 10) sessions with no summary OR with summary older than the latest memory in that session." Calls Memory_SessionService.summarizeSessions({sessionIds: [...]}) for the matched batch. Handles:
- Sessions that closed without sunset (crash, kill, network drop)
- Sessions whose summaries went stale because new memories were added post-sunset (MC's existing re-summarize-on-new-turns path naturally handles this; daemon just triggers it)
Per operator framing: "if sometimes a not fully finished session gets summarized => fair game". Tolerance for occasional in-progress-session summarization is acceptable; the sweep is an edge-case catcher, not the primary path.
Why all three combined (and not just one)
- A alone: boots are sparse; sessions ending mid-day don't get summarized until next operator boot.
- B alone: non-sunset closes (crash, kill) get permanently lost.
- C alone: delays sunset-summary visibility by up-to-N-minutes; operators chasing "what just happened?" hit stale data.
- A+B+C: primary-flag enables; sunset gives immediate-fast-path; periodic sweep is the safety-net. Elegant and robust per the operator's quality bar.
Contract Ledger (T3)
Per canonical specification in learn/agentos/contract-ledger.md.
| Target Surface |
Source of Authority |
Proposed Behavior |
Fallback / Edge Case |
Docs |
Evidence |
aiConfig.isPrimary boolean field driven by NEO_MC_PRIMARY env var (new); read in SessionService.mjs startup hook to gate the existing AUTO_SUMMARIZE drift-detection path |
This ticket, parent #9999, originating gate from #9942 (daemon-collision fix), operator framing 2026-05-06 |
Boolean config; false default. When AUTO_SUMMARIZE=true AND isPrimary=true, the existing SessionService.summarizeSessions() startup hook fires per existing Eventual Consistency drift detection. Otherwise startup-summarization is skipped (current behavior). Operators set NEO_MC_PRIMARY=true on canonical MC instance only. |
If NEO_MC_PRIMARY=true set on multiple instances simultaneously, race conditions on the shared Chroma collection — operator config error. Optional future hardening: SQLite-row-lock primary-election (per ADR-0001) to detect+fail-fast on misconfiguration. Out of scope for MVP. |
Update learn/agentos/MemoryCore.md AUTO_SUMMARIZE section with NEO_MC_PRIMARY interaction; SharedDeployment.md env-var inventory; DeploymentCookbook.md Section 6 (env vars) + new operator-onboarding note about which instance is primary. |
L2 unit-test: SessionService.spec.mjs extended — primary-flag-true-fires, primary-flag-false-skips, AUTO_SUMMARIZE-false-skips-regardless-of-primary. |
summarization_jobs SQLite table (new) OR mailbox-poll trigger for [SUNSET HANDOVER] self-DMs (decision at impl time) |
This ticket; B1 vs B2 trade-off resolved during implementation |
Sunset event → trigger summarization for the just-ended session via Memory_SessionService.summarizeSessions({sessionIds: [N]}). Single-writer enforced via NEO_MC_PRIMARY gate (only canonical MC processes the trigger). Idempotent vs MC's existing re-summarize-on-new-turns logic. |
If sunset event fires before all session memories are flushed, the summary is a "best-effort snapshot" — MC re-summarizes on new memories naturally. If NEO_MC_PRIMARY=false on this instance, trigger is ignored (no race). |
session-sunset skill SKILL.md updated to reference the trigger semantics if B1 chosen. MemoryCore.md healthcheck section adds optional summarization.triggerSource: 'sunset-event' | 'periodic-sweep' | 'manual' provenance field. |
L2 unit-test: trigger flow with mocked sunset event → summarizeSessions called with correct sessionId. Plus integration manual-test: run sunset, observe summary appears in next session boot context. |
summarizationSweep periodic task in bridge-daemon.mjs (extension) OR new sibling daemon |
This ticket; daemon-extension vs new-daemon decision at impl time |
Every N minutes (default 10), query MC for sessions with no summary OR summary stale vs latest memory within last X (default 10) sessions. Trigger summarization for matched batch. Single-writer enforced via NEO_MC_PRIMARY gate + bridge-daemon-singleton-per-host. |
If daemon stops/crashes, summaries silently stale until restart — operator-side concern. Manual fallback (npm run ai:summarize-sessions) remains as recovery path. Healthcheck observability surfaces last-sweep-time. |
learn/agentos/MemoryCore.md summarization section documents the sweep cadence + tunable NEO_SUMMARIZATION_SWEEP_INTERVAL_MS env var. |
L2 unit-test: sweep task fires summarize-sessions with correct sessionId batch given seeded session-table state. Plus operator-side integration test: 10-min wait + observe healthcheck reports last-sweep-timestamp + summaries appear. |
Acceptance Criteria
Out of Scope
- SQLite-row-lock primary-election failover. Future hardening if operator-misconfig (multiple primaries) becomes empirically problematic. MVP relies on operator config correctness + clear documentation.
- Per-turn mini-summaries (#10332) — different layer, separate ticket.
- Cross-host summarization for multi-host shared-Chroma deployments. Strategy A vs B (per Discussion #10809) is orthogonal — daemon runs wherever the canonical MC is, regardless of cloud topology.
- DreamService / Sandman pipeline integration. Tracked under #10030 lineage. Different concern.
- Backfill-all-history. Operator-tool path remains via
npm run ai:summarize-sessions script if needed.
Avoided Traps
- Rejected: Reactivate
AUTO_SUMMARIZE=true globally. Breaks the original #9942 gate; race conditions return.
- Rejected: Per-MCP-instance election protocol. Operator-error-prone; doesn't survive harness-spawned-language-server case where operators don't control MCP-server config. Primary-flag is simpler and explicit.
- Rejected: Build a new SummarizationCoordinator daemon class from scratch. Existing
bridge-daemon.mjs already provides singleton-per-host substrate; extending it (or filing a sibling daemon module without the lifecycle plumbing reinvention) is cleaner.
- Rejected: Skill-only trigger (Option A from earlier analysis). Doesn't handle non-graceful closes; abandoned for combined A+B+C shape.
- Rejected: Bundle with #10332. Different layer (per-turn vs per-session); conflating scope would slow both tickets.
Related
- Parent epic: #9999 — Cloud-Native Knowledge & Multi-Tenant Memory Core (operator-deployment readiness includes summarization-restoration).
- Originating gate: #9942 — daemon-collision fix that disabled
AUTO_SUMMARIZE originally.
- Existing manual script: PR #10458 —
summarize-sessions.mjs operator-tool (continues as recovery path).
- Boundary: #10332 — Per-turn mini-summaries (DIFFERENT layer; do not conflate).
- Adjacent topology decision: Discussion #10809 — Strategy A vs B cloud deployment (daemon is topology-agnostic).
- Daemon substrate precedent: PR #10793
SwarmHeartbeatService (singleton-per-host Neo daemon shape; operator-territory plist install).
- Cross-process coherence: ADR-0001 (sqlite-row-lock primitive available if MVP gate proves insufficient).
Origin Session ID: 34c8f800-1855-43ff-aea6-d5e6b9410978
Retrieval Hint: query_raw_memories(query="session summarization daemon coordinator primary-flag NEO_MC_PRIMARY sunset-event trigger periodic sweep AUTO_SUMMARIZE re-enable harness fleet single-writer #9942 daemon-collision-fix")
Context
Operator-surfaced 2026-05-06: session summaries are missing from boot-context (
get_all_summaries()returns stale data) because auto-summarization is intentionally disabled per #9942 ("daemon-collision fix") to prevent the harness fleet's multiple MCP-server instances (Claude Code worktrees + Antigravity + Codex Desktop + per-workspace language servers) from each writing to the sharedneo-agent-sessionsChroma collection concurrently. The kill-switch (AUTO_SUMMARIZE=falsedefault) was the right MVP-level mitigation; the cost is empty chronological context at session start.Operator framing:
Quality bar: "elegant and robust solutions here, not quick wins" (per
feedback_quality_over_speeddiscipline).The Problem
Three concrete failure modes today:
get_all_summaries({limit: 5})at session start find stale entries because no automatic path refreshes them.npm run ai:summarize-sessions(the script shipped under PR #10458) by hand to refresh.session-sunsetskill, the freshly-finished session's summary doesn't appear at the next agent's boot context — it's still gated behind manual script invocation.Why the existing
AUTO_SUMMARIZE=truepath can't simply be re-enabled: the harness fleet topology means multiple MCP-server instances would each fire summarization at startup, racing on the same Chroma collection. The substrate fix is single-writer enforcement, not "turn the gate back on."The Architectural Reality
Existing substrate (audited 2026-05-06 per
feedback_verify_before_assertdiscipline):ai/mcp/server/memory-core/services/SessionService.mjshas Eventual Consistency drift detection: at startup (when AUTO_SUMMARIZE fires), it detects drift between memory count and existing summaries (30-day window default), self-heals by re-summarizing affected sessions. Substrate is complete; only the trigger is gated.ai/scripts/summarize-sessions.mjs(viewable) is the operator-runnable manual entry point (npm run ai:summarize-sessions). Origin: PR #10458 (post-#9942 gate-disable). Defaults to 30-day lookback; callsMemory_SessionService.summarizeSessions({includeAll: false}).session-sunsetskill at.agents/skills/session-sunset/emits a self-DM[SUNSET HANDOVER]mailbox event withwakeSuppressed: trueat session end (perAGENTS_STARTUP.md§6 boot pickup path).ai/scripts/bridge-daemon.mjsis the singleton-per-host daemon that already manages wake-event delivery (per ADR-0002 wake substrate). Single-writer-by-design at the host level; natural extension surface for periodic-poll task.MC re-summarize-on-new-turns is already in the substrate (re-runs summarization for "finished" sessions that gain new memories) — handled inside
SessionService.summarizeSessions()idempotently.Adjacent ticket boundary: #10332 is "Per-turn mini-summaries" — DIFFERENT layer (per-turn, not per-session). Don't conflate. Per
@tobiu2026-05-06: "per turn summaries!==all turn for a session summaries".The Fix
Three coordinated mechanisms (A + B + C):
A — Primary-flag substrate gate
Add
NEO_MC_PRIMARY=true/falseenv var read inSessionService.mjsstartup hook. Operator sets=trueONLY on the canonical MC instance (the host running the operator's authoritative MCP servers). Other instances (worktree-spawned, language-server-spawned, peer-agent harnesses) keep it unset/false→ don't fire startup summarization.The existing
AUTO_SUMMARIZEpath stays gated by both env vars in AND:AUTO_SUMMARIZE=trueANDNEO_MC_PRIMARY=true→ fire startup drift-detection summarizationThis is the enabling mechanism; it doesn't itself trigger summarization on a regular cadence.
B — Sunset-event-driven trigger (primary path)
When
session-sunsetskill fires at session end, the canonical MC instance (gated byNEO_MC_PRIMARY=true) detects the[SUNSET HANDOVER]self-DM in the mailbox AND triggers summarization for the just-ended session viaMemory_SessionService.summarizeSessions({sessionIds: [sunsettingSession]}). Single writer; immediate-fast-path.Two implementation choices for the detection:
session-sunsetskill explicitly writes to asummarization_jobsSQLite queue (more deterministic; clear contract; daemon polls queue)[SUNSET HANDOVER]self-DMs at periodic intervals (no skill change; daemon owns the polling)B1 is cleaner; B2 is simpler. Decision deferred to implementation.
C — Periodic safety-net sweep (catches non-graceful closes)
Extend
bridge-daemon.mjs(or sibling daemon — decide at impl time) with asummarizationSweeptask that fires every N minutes (e.g., 10 min). Sweep query: "give me the last X (default 10) sessions with no summary OR with summary older than the latest memory in that session." CallsMemory_SessionService.summarizeSessions({sessionIds: [...]})for the matched batch. Handles:Per operator framing: "if sometimes a not fully finished session gets summarized => fair game". Tolerance for occasional in-progress-session summarization is acceptable; the sweep is an edge-case catcher, not the primary path.
Why all three combined (and not just one)
Contract Ledger (T3)
Per canonical specification in
learn/agentos/contract-ledger.md.aiConfig.isPrimaryboolean field driven byNEO_MC_PRIMARYenv var (new); read inSessionService.mjsstartup hook to gate the existing AUTO_SUMMARIZE drift-detection pathfalsedefault. WhenAUTO_SUMMARIZE=trueANDisPrimary=true, the existingSessionService.summarizeSessions()startup hook fires per existing Eventual Consistency drift detection. Otherwise startup-summarization is skipped (current behavior). Operators setNEO_MC_PRIMARY=trueon canonical MC instance only.NEO_MC_PRIMARY=trueset on multiple instances simultaneously, race conditions on the shared Chroma collection — operator config error. Optional future hardening: SQLite-row-lock primary-election (per ADR-0001) to detect+fail-fast on misconfiguration. Out of scope for MVP.learn/agentos/MemoryCore.mdAUTO_SUMMARIZEsection withNEO_MC_PRIMARYinteraction;SharedDeployment.mdenv-var inventory;DeploymentCookbook.mdSection 6 (env vars) + new operator-onboarding note about which instance is primary.SessionService.spec.mjsextended — primary-flag-true-fires, primary-flag-false-skips, AUTO_SUMMARIZE-false-skips-regardless-of-primary.summarization_jobsSQLite table (new) OR mailbox-poll trigger for[SUNSET HANDOVER]self-DMs (decision at impl time)Memory_SessionService.summarizeSessions({sessionIds: [N]}). Single-writer enforced viaNEO_MC_PRIMARYgate (only canonical MC processes the trigger). Idempotent vs MC's existing re-summarize-on-new-turns logic.NEO_MC_PRIMARY=falseon this instance, trigger is ignored (no race).session-sunsetskillSKILL.mdupdated to reference the trigger semantics if B1 chosen.MemoryCore.mdhealthcheck section adds optionalsummarization.triggerSource: 'sunset-event' | 'periodic-sweep' | 'manual'provenance field.summarizeSessionscalled with correct sessionId. Plus integration manual-test: run sunset, observe summary appears in next session boot context.summarizationSweepperiodic task inbridge-daemon.mjs(extension) OR new sibling daemonno summary OR summary stale vs latest memorywithin last X (default 10) sessions. Trigger summarization for matched batch. Single-writer enforced viaNEO_MC_PRIMARYgate + bridge-daemon-singleton-per-host.npm run ai:summarize-sessions) remains as recovery path. Healthcheck observability surfaces last-sweep-time.learn/agentos/MemoryCore.mdsummarization section documents the sweep cadence + tunableNEO_SUMMARIZATION_SWEEP_INTERVAL_MSenv var.Acceptance Criteria
NEO_MC_PRIMARYenv var wired inmemory-core/config.template.mjswithfalsedefault;SessionService.mjsstartup hook gatesAUTO_SUMMARIZEpath on it.MemoryCore.mdAUTO_SUMMARIZE section reflects primary-flag interaction + sweep cadence;SharedDeployment.mdenv-var inventory addsNEO_MC_PRIMARY+ sweep-interval;DeploymentCookbook.mdSection 6 operator-onboarding notes which instance is primary.summarization.{triggerSource, lastSweepAt, primaryFlag, queueDepth?}block surfaces daemon state.get_all_summaries({limit: 5})returns the just-ended session as the most-recent entry.Out of Scope
npm run ai:summarize-sessionsscript if needed.Avoided Traps
AUTO_SUMMARIZE=trueglobally. Breaks the original #9942 gate; race conditions return.bridge-daemon.mjsalready provides singleton-per-host substrate; extending it (or filing a sibling daemon module without the lifecycle plumbing reinvention) is cleaner.Related
AUTO_SUMMARIZEoriginally.summarize-sessions.mjsoperator-tool (continues as recovery path).SwarmHeartbeatService(singleton-per-host Neo daemon shape; operator-territory plist install).Origin Session ID:
34c8f800-1855-43ff-aea6-d5e6b9410978Retrieval Hint:
query_raw_memories(query="session summarization daemon coordinator primary-flag NEO_MC_PRIMARY sunset-event trigger periodic sweep AUTO_SUMMARIZE re-enable harness fleet single-writer #9942 daemon-collision-fix")