Context
Tonight (2026-05-07) the canonical Memory Core was wiped via an unguarded Playwright unit-test fixture (#10845 / PR #10868 in flight). The 634M backup-2026-05-06T22-51-56.579Z bundle from prior recovery work is now the safety floor. @tobiu's audit of that bundle vs the live .neo-ai-data/ reveals the bundle does not yet capture everything, and there is no canonical npm run ai:restore orchestrator inverting npm run ai:backup.
This ticket pairs two complementary deliverables:
- AC-A: enhance
ai:backup to close substrate-coverage gaps and finish the retention TODO carried from #10129.
- AC-B: ship a new
ai:restore as a bundle-aware orchestrator routed through the canonical SDK boundary, not the legacy importBackupToSQLite.mjs ad-hoc path.
Per @tobiu 2026-05-07 nightshift directive: substrate-elegant, not quick wins.
The Problem
A. The current bundle misses live state
Path in .neo-ai-data/ |
Size |
In bundle? |
Status |
chroma/{kb,mc}/ |
2.9G |
❌ (logical JSONL only) |
Restore = re-ingestion (slow path) |
sqlite/memory-core-graph.sqlite |
433M |
❌ (logical JSONL only) |
Re-derived from graph JSONL |
sqlite/sent-to-cull.jsonl |
54K |
❌ |
Sent-message archive, touched today |
neo-sqlite/memory-core.sqlite |
329M |
❌ |
Last write Apr 15; bootstrapWorktree.mjs:DATA_SUBDIRS_TO_LINK includes it; defragSQLiteDB.mjs:13 targets a different filename — legacy or active unclear |
wake-daemon/{lastSyncId,inflight-*.txt,*.log} |
<1M |
❌ |
Operational state (separate concern) |
chroma/chroma.sqlite3 (top-level) |
0B |
❌ |
Currently empty |
B. No canonical restore orchestrator
npm run ai:restore does not exist.
buildScripts/ai/importBackupToSQLite.mjs is a hardcoded one-off (targets a specific memory-backup-2026-04-07T16-21-32.985Z.jsonl), writes via SQLiteVectorManager direct — bypassing Memory_DatabaseService SDK boundary AND any future destructive-op guard from #10845.
manageDatabaseImport MCP service exists at ai/mcp/server/memory-core/services/DatabaseService.mjs; KB has equivalent.
- Restore today requires manual JSONL stitching across subsystems with no integrity validation, no topology awareness, no clobber protection.
C. ai:backup retention + observability gaps (#10129 closeout TODOs)
buildScripts/ai/backup.mjs:137 retention TODO never implemented.
- No post-write integrity check (row-count parity).
chromaUnified topology untested in backup path. Operator clarification 2026-05-07: current state is federated (KB Chroma 8000 + MC Chroma 8001 + better-sqlite3 graph); chromaUnified=true is destination, not current.
copyJsonlSource silently records {copied: 0} for empty subsystems — too quiet.
The Architectural Reality
buildScripts/ai/backup.mjs:84-143 is the canonical orchestrator (per #10129 Phase 3 peer-arch). Routes through KB_DatabaseService.manageDatabaseBackup and Memory_DatabaseService.manageDatabaseBackup via ai/services.mjs SDK boundary.
- Bundle layout:
.neo-ai-data/backups/backup-<ISO-ts>/{kb,mc,graph,concepts,trajectories}/.
ai/mcp/server/memory-core/config.mjs:225-228 defines engines.chroma (MC's own at 8001); engines.kb.chroma at 8000 consulted only when chromaUnified=true.
ai/mcp/server/memory-core/config.mjs:259-262 collection names (neo-agent-memory, neo-agent-sessions, neo-native-graph).
ai/mcp/server/memory-core/config.mjs:251-253 storage paths (memory-core-graph.sqlite).
ai/scripts/bootstrapWorktree.mjs:DATA_SUBDIRS_TO_LINK includes neo-sqlite — symlinked across worktrees but not actively written by any production code path (defragSQLiteDB.mjs:13 targets knowledge-graph.sqlite, not memory-core.sqlite).
- #10844 daily-snapshot pipeline depends on this ticket (it scheduled-runs
ai:backup; restore runbook references this orchestrator).
- #10845 destructive-op guard is the substrate AC-B's
--mode replace consumes.
The Fix
AC-A — Enhance buildScripts/ai/backup.mjs
- Decide
neo-sqlite/memory-core.sqlite: investigate (last-write Apr 15, defrag-script-mismatch) and adopt one of: (a) include in bundle if active, (b) retire from .neo-ai-data/ + remove from bootstrapWorktree.mjs:DATA_SUBDIRS_TO_LINK if dead, (c) document as "intentionally legacy, do not back up" with rationale.
- Cover
sqlite/sent-to-cull.jsonl in bundle, OR document explicitly as transient.
- Surface 0B-source warnings:
copyJsonlSource returning {copied: 0} should emit a console warning when source dir exists but has no JSONL (vs source-absent which is OK in fresh envs).
- Post-write integrity check: row-count parity for KB + MC + graph; fail bundle on mismatch.
chromaUnified topology smoke: bundle correct under federated AND unified topologies. Add a bundle-meta.json declaring {chromaUnified, kbChromaCoords, mcChromaCoords, timestamp, neoVersion, gitSha}.
- Retention policy: newest K=3 unconditionally + delete >N=7 days. Env-overridable (
NEO_BACKUP_RETAIN_K, NEO_BACKUP_RETAIN_DAYS). Mirrors defragChromaDB.cleanOldBackups semantics at the bundle-directory level.
Out-of-scope for AC-A:
wake-daemon/ operational state (separate concern: live-orchestration recovery).
- Physical Chroma data dir snapshots beyond JSONL exports (defrag-exclusive at
dist/chromadb-backups/ per #10129 peer-architecture lockdown).
- Daily scheduled automation (covered by #10844).
AC-B — New buildScripts/ai/restore.mjs + npm run ai:restore
- Bundle-aware orchestrator:
npm run ai:restore -- <bundle-path> reads backup-<ts>/{kb,mc,graph,concepts,trajectories}/, invokes canonical SDK methods (KB_DatabaseService.manageDatabaseImport, Memory_DatabaseService.manageDatabaseImport). NEVER SQLiteVectorManager direct.
- Pre-flight integrity validation BEFORE any write: 5 subdirs present, JSONL parseable, row counts non-zero where expected,
bundle-meta.json parsed if present.
- Topology compat check: read
chromaUnified from current aiConfig + from bundle-meta.json; warn loudly if mismatched (refuse without --force-topology-mismatch).
- Default mode =
merge: idempotent, safe under any target state. No destructive-op guard call needed (but topology + integrity preflight still fire).
--mode replace: gated on #10845 destructive-op guard. Calls assertDestructiveTargetAllowed({operation, subsystem, mode, target, source, confirmation}) per @neo-gpt's interface design:
- async typed throw, fail-closed when target classification unresolved.
- Target descriptor includes
collectionName, sqlite path, Chroma host/port/path, bundle path, repo root.
- Operation/subsystem explicit (e.g.
mc.chroma.memory.restore, mc.graph.replace, kb.chroma.replace, restore.replace).
- Production bypass requires both
NEO_ALLOW_PRODUCTION_DESTRUCTIVE=true env AND an explicit operator confirmation token (not just one ambient flag).
- Refuse non-empty target without
--force (defense-in-depth above the substrate guard).
- Retire
importBackupToSQLite.mjs: delete OR convert to a thin alias-script that delegates to ai:restore -- <path>. No parallel ad-hoc restore path remains.
Pre-#10845 fallback: if AC-B ships before #10845 lands, --mode replace MUST be either disabled (errors clearly: "replace mode unavailable until #10845 destructive-op guard ships"), OR call a stub-guard with the same assertDestructiveTargetAllowed(...) contract that is explicitly fail-closed for production-like targets — never a permissive stub.
Contract Ledger Matrix
| Target Surface |
Source of Authority |
Proposed Behavior |
Fallback / Edge Case |
Docs |
Evidence |
npm run ai:restore -- <bundle> (new public surface) |
This ticket |
Restores 5 subsystems via canonical SDK; pre-flight validates bundle integrity + topology; default merge, --mode replace calls #10845 guard |
Refuses on bundle integrity failure; topology mismatch w/o --force-topology-mismatch; non-empty target w/o --force |
learn/agentos/MemoryCore.md runbook section + package.json:scripts.ai:restore JSDoc |
Playwright unit tests for happy-path merge, integrity-failure refusal, topology mismatch refusal, force-flag bypass, replace-mode guard call |
bundle-meta.json (new in-bundle descriptor) |
AC-A enhancement to backup |
Records {chromaUnified, kbChromaCoords, mcChromaCoords, timestamp, neoVersion, gitSha} at bundle creation |
Missing meta = older bundle; restore warns + degrades to topology-blind validation |
Inline JSDoc in backup.mjs + restore-runbook |
Targeted spec: backup creates the file; restore reads it; mismatch refusal-and-force coverage |
assertDestructiveTargetAllowed(...) consumption |
#10845 substrate guard |
Restore --mode replace calls per-operation; never bypasses |
If guard absent at code-time: stub with same contract, fail-closed for production-like targets |
#10845 docs + this ticket's restore runbook |
Spec proves replace blocked under default .neo-ai-data/ paths and allowed under :memory: / disposable paths |
Retention sweep at .neo-ai-data/backups/ (AC-A) |
This ticket + #10129 retention TODO |
Newest K=3 unconditionally + delete >N=7 days; env-overridable |
Freshest successful backups never deleted; malformed dirs skipped + reported |
Inline backup.mjs + operator runbook |
Unit cov for retention selection + boundary cases |
importBackupToSQLite.mjs retire |
This ticket |
Delete or alias-to-ai:restore |
Keep history reference in commit; new artifact does NOT bypass SDK |
Removal note in restore-runbook |
git rm evidence in PR; verify no remaining direct callers via grep |
Acceptance Criteria
AC-A (backup enhancements)
AC-B (new restore)
Out of Scope
- Wake-daemon operational state (
bridge.log, lastSyncId, inflight-*.txt) — separate concern, classify as live-orchestration recovery.
- Daily scheduled automation — covered by #10844.
- Cross-machine backup sync / cloud upload — local-disk only.
- Defrag pre-nuke physical-copy snapshots at
dist/chromadb-backups/ — peer architecture lockdown per #10129; NOT touched by this ticket.
- Substrate destructive-op guard implementation — that's #10845 (this ticket consumes it).
- Layer 1 stopgap test isolation — that's PR #10868 (Gemini's lane); my AC-B replace mode depends on Layer 1 landing first so unit tests don't re-wipe restored state.
- Canonical config.mjs vs template drift detection — that's PR #10868's scope (Gemini's mental model from #10863).
Avoided Traps
- Implementing restore via direct
SQLiteVectorManager like the legacy one-off: bypasses canonical SDK and any future destructive-op guard. Restore must go through Memory_DatabaseService.manageDatabaseImport + KB_DatabaseService.manageDatabaseImport so the substrate guard can fire.
- Env-var-only trust boundary for destructive ops:
UNIT_TEST_MODE is not a trust boundary — npx playwright direct invocations bypass it. The guard is path-based / target-descriptor based, not env-based.
- Permissive stub if #10845 ships later: tempting to write a no-op stub guard so AC-B tests pass before #10845 lands. Forbidden — stub must be fail-closed for production-like targets, otherwise restore re-introduces the wipe vector this whole effort closes.
- Bundling backup + retention + automation in one ticket: retention is tightly coupled with backup itself (this ticket); automation is the scheduled invoker (#10844's scope). Don't conflate.
- Including wake-daemon state in substrate backup: per @neo-gpt's design feedback — operator/process state belongs to a live-orchestration recovery concern, not substrate backup. Keep substrate backup substrate-only.
- Silent neo-sqlite inclusion: 329M legacy artifact added without rationale would be a maintenance footgun. AC requires explicit decision.
Related
- #10129 — atomic timestamped backup bundle (parent backup architecture, CLOSED)
- #10844 — daily automated snapshot pipeline (depends on this ticket; runbook references this orchestrator)
- #10845 — block destructive AI substrate ops on production paths (Layer 2 substrate guard; AC-B
--mode replace consumes it)
- #10867 / PR #10868 — Layer 1 immediate-stopgap (test-isolation, Gemini's lane); precondition for safely running AC-B unit tests
- #10691 — Shared Deployment MVP (parent epic context)
- #10009, #10015, #10127 —
chromaUnified topology
- #10822 — config substrate cleanup epic context
Origin Session ID: 78a3272e-847b-4799-ad6c-ce334464844c
Retrieval Hint: query_raw_memories(query="backup restore parity ai:restore unified bundle topology destructive guard")
Context
Tonight (2026-05-07) the canonical Memory Core was wiped via an unguarded Playwright unit-test fixture (#10845 / PR #10868 in flight). The 634M
backup-2026-05-06T22-51-56.579Zbundle from prior recovery work is now the safety floor. @tobiu's audit of that bundle vs the live.neo-ai-data/reveals the bundle does not yet capture everything, and there is no canonicalnpm run ai:restoreorchestrator invertingnpm run ai:backup.This ticket pairs two complementary deliverables:
ai:backupto close substrate-coverage gaps and finish the retention TODO carried from #10129.ai:restoreas a bundle-aware orchestrator routed through the canonical SDK boundary, not the legacyimportBackupToSQLite.mjsad-hoc path.Per @tobiu 2026-05-07 nightshift directive: substrate-elegant, not quick wins.
The Problem
A. The current bundle misses live state
.neo-ai-data/chroma/{kb,mc}/sqlite/memory-core-graph.sqlitesqlite/sent-to-cull.jsonlneo-sqlite/memory-core.sqlitebootstrapWorktree.mjs:DATA_SUBDIRS_TO_LINKincludes it;defragSQLiteDB.mjs:13targets a different filename — legacy or active unclearwake-daemon/{lastSyncId,inflight-*.txt,*.log}chroma/chroma.sqlite3(top-level)B. No canonical restore orchestrator
npm run ai:restoredoes not exist.buildScripts/ai/importBackupToSQLite.mjsis a hardcoded one-off (targets a specificmemory-backup-2026-04-07T16-21-32.985Z.jsonl), writes viaSQLiteVectorManagerdirect — bypassingMemory_DatabaseServiceSDK boundary AND any future destructive-op guard from #10845.manageDatabaseImportMCP service exists atai/mcp/server/memory-core/services/DatabaseService.mjs; KB has equivalent.C.
ai:backupretention + observability gaps (#10129 closeout TODOs)buildScripts/ai/backup.mjs:137retention TODO never implemented.chromaUnifiedtopology untested in backup path. Operator clarification 2026-05-07: current state is federated (KB Chroma 8000 + MC Chroma 8001 + better-sqlite3 graph);chromaUnified=trueis destination, not current.copyJsonlSourcesilently records{copied: 0}for empty subsystems — too quiet.The Architectural Reality
buildScripts/ai/backup.mjs:84-143is the canonical orchestrator (per #10129 Phase 3 peer-arch). Routes throughKB_DatabaseService.manageDatabaseBackupandMemory_DatabaseService.manageDatabaseBackupviaai/services.mjsSDK boundary..neo-ai-data/backups/backup-<ISO-ts>/{kb,mc,graph,concepts,trajectories}/.ai/mcp/server/memory-core/config.mjs:225-228definesengines.chroma(MC's own at 8001);engines.kb.chromaat 8000 consulted only whenchromaUnified=true.ai/mcp/server/memory-core/config.mjs:259-262collection names (neo-agent-memory,neo-agent-sessions,neo-native-graph).ai/mcp/server/memory-core/config.mjs:251-253storage paths (memory-core-graph.sqlite).ai/scripts/bootstrapWorktree.mjs:DATA_SUBDIRS_TO_LINKincludesneo-sqlite— symlinked across worktrees but not actively written by any production code path (defragSQLiteDB.mjs:13targetsknowledge-graph.sqlite, notmemory-core.sqlite).ai:backup; restore runbook references this orchestrator).--mode replaceconsumes.The Fix
AC-A — Enhance
buildScripts/ai/backup.mjsneo-sqlite/memory-core.sqlite: investigate (last-write Apr 15, defrag-script-mismatch) and adopt one of: (a) include in bundle if active, (b) retire from.neo-ai-data/+ remove frombootstrapWorktree.mjs:DATA_SUBDIRS_TO_LINKif dead, (c) document as "intentionally legacy, do not back up" with rationale.sqlite/sent-to-cull.jsonlin bundle, OR document explicitly as transient.copyJsonlSourcereturning{copied: 0}should emit a console warning when source dir exists but has no JSONL (vs source-absent which is OK in fresh envs).chromaUnifiedtopology smoke: bundle correct under federated AND unified topologies. Add abundle-meta.jsondeclaring{chromaUnified, kbChromaCoords, mcChromaCoords, timestamp, neoVersion, gitSha}.NEO_BACKUP_RETAIN_K,NEO_BACKUP_RETAIN_DAYS). MirrorsdefragChromaDB.cleanOldBackupssemantics at the bundle-directory level.Out-of-scope for AC-A:
wake-daemon/operational state (separate concern: live-orchestration recovery).dist/chromadb-backups/per #10129 peer-architecture lockdown).AC-B — New
buildScripts/ai/restore.mjs+npm run ai:restorenpm run ai:restore -- <bundle-path>readsbackup-<ts>/{kb,mc,graph,concepts,trajectories}/, invokes canonical SDK methods (KB_DatabaseService.manageDatabaseImport,Memory_DatabaseService.manageDatabaseImport). NEVERSQLiteVectorManagerdirect.bundle-meta.jsonparsed if present.chromaUnifiedfrom currentaiConfig+ frombundle-meta.json; warn loudly if mismatched (refuse without--force-topology-mismatch).merge: idempotent, safe under any target state. No destructive-op guard call needed (but topology + integrity preflight still fire).--mode replace: gated on #10845 destructive-op guard. CallsassertDestructiveTargetAllowed({operation, subsystem, mode, target, source, confirmation})per @neo-gpt's interface design:collectionName, sqlite path, Chroma host/port/path, bundle path, repo root.mc.chroma.memory.restore,mc.graph.replace,kb.chroma.replace,restore.replace).NEO_ALLOW_PRODUCTION_DESTRUCTIVE=trueenv AND an explicit operator confirmation token (not just one ambient flag).--force(defense-in-depth above the substrate guard).importBackupToSQLite.mjs: delete OR convert to a thin alias-script that delegates toai:restore -- <path>. No parallel ad-hoc restore path remains.Pre-#10845 fallback: if AC-B ships before #10845 lands,
--mode replaceMUST be either disabled (errors clearly:"replace mode unavailable until #10845 destructive-op guard ships"), OR call a stub-guard with the sameassertDestructiveTargetAllowed(...)contract that is explicitly fail-closed for production-like targets — never a permissive stub.Contract Ledger Matrix
npm run ai:restore -- <bundle>(new public surface)merge,--mode replacecalls #10845 guard--force-topology-mismatch; non-empty target w/o--forcelearn/agentos/MemoryCore.mdrunbook section +package.json:scripts.ai:restoreJSDocbundle-meta.json(new in-bundle descriptor){chromaUnified, kbChromaCoords, mcChromaCoords, timestamp, neoVersion, gitSha}at bundle creationbackup.mjs+ restore-runbookassertDestructiveTargetAllowed(...)consumption--mode replacecalls per-operation; never bypassesreplaceblocked under default.neo-ai-data/paths and allowed under:memory:/ disposable paths.neo-ai-data/backups/(AC-A)importBackupToSQLite.mjsretireai:restoregit rmevidence in PR; verify no remaining direct callers via grepAcceptance Criteria
AC-A (backup enhancements)
neo-sqlite/memory-core.sqlitedecision documented in PR body + reflected in bundle layout (include / retire / explicit-exclude).sqlite/sent-to-cull.jsonleither bundled or explicitly documented as transient.copyJsonlSourceemits a console warning when source dir exists but has no JSONL.bundle-meta.jsonwritten at bundle creation:{chromaUnified, kbChromaCoords, mcChromaCoords, timestamp, neoVersion, gitSha}.chromaUnified=false(federated, current) andchromaUnified=true(unified, future) topologies.AC-B (new restore)
npm run ai:restore -- <bundle>registered inpackage.json.buildScripts/ai/restore.mjsinvokes canonical SDK methods exclusively (noSQLiteVectorManagerdirect).bundle-meta.jsonparsed if present.--force-topology-mismatch.--mode mergeworks idempotently.--mode replacecallsassertDestructiveTargetAllowed(...)from #10845 guard (or fail-closed stub if guard not yet landed).--force.importBackupToSQLite.mjsretired (deleted or aliased toai:restore).learn/agentos/MemoryCore.md(new section: "Restore from atomic bundle").Out of Scope
bridge.log,lastSyncId,inflight-*.txt) — separate concern, classify as live-orchestration recovery.dist/chromadb-backups/— peer architecture lockdown per #10129; NOT touched by this ticket.Avoided Traps
SQLiteVectorManagerlike the legacy one-off: bypasses canonical SDK and any future destructive-op guard. Restore must go throughMemory_DatabaseService.manageDatabaseImport+KB_DatabaseService.manageDatabaseImportso the substrate guard can fire.UNIT_TEST_MODEis not a trust boundary —npx playwrightdirect invocations bypass it. The guard is path-based / target-descriptor based, not env-based.Related
--mode replaceconsumes it)chromaUnifiedtopologyOrigin Session ID: 78a3272e-847b-4799-ad6c-ce334464844c
Retrieval Hint:
query_raw_memories(query="backup restore parity ai:restore unified bundle topology destructive guard")