LearnNewsExamplesServices
Frontmatter
id10871
titleBackup/restore parity: enhance ai:backup + new ai:restore
stateClosed
labels
enhancementaiarchitecturemodel-experience
assigneesneo-opus-4-7
createdAtMay 7, 2026, 3:58 AM
updatedAtMay 7, 2026, 2:39 PM
githubUrlhttps://github.com/neomjs/neo/issues/10871
authorneo-opus-4-7
commentsCount0
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 7, 2026, 1:12 PM

Backup/restore parity: enhance ai:backup + new ai:restore

Closedenhancementaiarchitecturemodel-experience
neo-opus-4-7
neo-opus-4-7 commented on May 7, 2026, 3:58 AM

Context

Tonight (2026-05-07) the canonical Memory Core was wiped via an unguarded Playwright unit-test fixture (#10845 / PR #10868 in flight). The 634M backup-2026-05-06T22-51-56.579Z bundle from prior recovery work is now the safety floor. @tobiu's audit of that bundle vs the live .neo-ai-data/ reveals the bundle does not yet capture everything, and there is no canonical npm run ai:restore orchestrator inverting npm run ai:backup.

This ticket pairs two complementary deliverables:

  • AC-A: enhance ai:backup to close substrate-coverage gaps and finish the retention TODO carried from #10129.
  • AC-B: ship a new ai:restore as a bundle-aware orchestrator routed through the canonical SDK boundary, not the legacy importBackupToSQLite.mjs ad-hoc path.

Per @tobiu 2026-05-07 nightshift directive: substrate-elegant, not quick wins.

The Problem

A. The current bundle misses live state

Path in .neo-ai-data/ Size In bundle? Status
chroma/{kb,mc}/ 2.9G ❌ (logical JSONL only) Restore = re-ingestion (slow path)
sqlite/memory-core-graph.sqlite 433M ❌ (logical JSONL only) Re-derived from graph JSONL
sqlite/sent-to-cull.jsonl 54K Sent-message archive, touched today
neo-sqlite/memory-core.sqlite 329M Last write Apr 15; bootstrapWorktree.mjs:DATA_SUBDIRS_TO_LINK includes it; defragSQLiteDB.mjs:13 targets a different filename — legacy or active unclear
wake-daemon/{lastSyncId,inflight-*.txt,*.log} <1M Operational state (separate concern)
chroma/chroma.sqlite3 (top-level) 0B Currently empty

B. No canonical restore orchestrator

  • npm run ai:restore does not exist.
  • buildScripts/ai/importBackupToSQLite.mjs is a hardcoded one-off (targets a specific memory-backup-2026-04-07T16-21-32.985Z.jsonl), writes via SQLiteVectorManager direct — bypassing Memory_DatabaseService SDK boundary AND any future destructive-op guard from #10845.
  • manageDatabaseImport MCP service exists at ai/mcp/server/memory-core/services/DatabaseService.mjs; KB has equivalent.
  • Restore today requires manual JSONL stitching across subsystems with no integrity validation, no topology awareness, no clobber protection.

C. ai:backup retention + observability gaps (#10129 closeout TODOs)

  • buildScripts/ai/backup.mjs:137 retention TODO never implemented.
  • No post-write integrity check (row-count parity).
  • chromaUnified topology untested in backup path. Operator clarification 2026-05-07: current state is federated (KB Chroma 8000 + MC Chroma 8001 + better-sqlite3 graph); chromaUnified=true is destination, not current.
  • copyJsonlSource silently records {copied: 0} for empty subsystems — too quiet.

The Architectural Reality

  • buildScripts/ai/backup.mjs:84-143 is the canonical orchestrator (per #10129 Phase 3 peer-arch). Routes through KB_DatabaseService.manageDatabaseBackup and Memory_DatabaseService.manageDatabaseBackup via ai/services.mjs SDK boundary.
  • Bundle layout: .neo-ai-data/backups/backup-<ISO-ts>/{kb,mc,graph,concepts,trajectories}/.
  • ai/mcp/server/memory-core/config.mjs:225-228 defines engines.chroma (MC's own at 8001); engines.kb.chroma at 8000 consulted only when chromaUnified=true.
  • ai/mcp/server/memory-core/config.mjs:259-262 collection names (neo-agent-memory, neo-agent-sessions, neo-native-graph).
  • ai/mcp/server/memory-core/config.mjs:251-253 storage paths (memory-core-graph.sqlite).
  • ai/scripts/bootstrapWorktree.mjs:DATA_SUBDIRS_TO_LINK includes neo-sqlite — symlinked across worktrees but not actively written by any production code path (defragSQLiteDB.mjs:13 targets knowledge-graph.sqlite, not memory-core.sqlite).
  • #10844 daily-snapshot pipeline depends on this ticket (it scheduled-runs ai:backup; restore runbook references this orchestrator).
  • #10845 destructive-op guard is the substrate AC-B's --mode replace consumes.

The Fix

AC-A — Enhance buildScripts/ai/backup.mjs

  1. Decide neo-sqlite/memory-core.sqlite: investigate (last-write Apr 15, defrag-script-mismatch) and adopt one of: (a) include in bundle if active, (b) retire from .neo-ai-data/ + remove from bootstrapWorktree.mjs:DATA_SUBDIRS_TO_LINK if dead, (c) document as "intentionally legacy, do not back up" with rationale.
  2. Cover sqlite/sent-to-cull.jsonl in bundle, OR document explicitly as transient.
  3. Surface 0B-source warnings: copyJsonlSource returning {copied: 0} should emit a console warning when source dir exists but has no JSONL (vs source-absent which is OK in fresh envs).
  4. Post-write integrity check: row-count parity for KB + MC + graph; fail bundle on mismatch.
  5. chromaUnified topology smoke: bundle correct under federated AND unified topologies. Add a bundle-meta.json declaring {chromaUnified, kbChromaCoords, mcChromaCoords, timestamp, neoVersion, gitSha}.
  6. Retention policy: newest K=3 unconditionally + delete >N=7 days. Env-overridable (NEO_BACKUP_RETAIN_K, NEO_BACKUP_RETAIN_DAYS). Mirrors defragChromaDB.cleanOldBackups semantics at the bundle-directory level.

Out-of-scope for AC-A:

  • wake-daemon/ operational state (separate concern: live-orchestration recovery).
  • Physical Chroma data dir snapshots beyond JSONL exports (defrag-exclusive at dist/chromadb-backups/ per #10129 peer-architecture lockdown).
  • Daily scheduled automation (covered by #10844).

AC-B — New buildScripts/ai/restore.mjs + npm run ai:restore

  1. Bundle-aware orchestrator: npm run ai:restore -- <bundle-path> reads backup-<ts>/{kb,mc,graph,concepts,trajectories}/, invokes canonical SDK methods (KB_DatabaseService.manageDatabaseImport, Memory_DatabaseService.manageDatabaseImport). NEVER SQLiteVectorManager direct.
  2. Pre-flight integrity validation BEFORE any write: 5 subdirs present, JSONL parseable, row counts non-zero where expected, bundle-meta.json parsed if present.
  3. Topology compat check: read chromaUnified from current aiConfig + from bundle-meta.json; warn loudly if mismatched (refuse without --force-topology-mismatch).
  4. Default mode = merge: idempotent, safe under any target state. No destructive-op guard call needed (but topology + integrity preflight still fire).
  5. --mode replace: gated on #10845 destructive-op guard. Calls assertDestructiveTargetAllowed({operation, subsystem, mode, target, source, confirmation}) per @neo-gpt's interface design:
    • async typed throw, fail-closed when target classification unresolved.
    • Target descriptor includes collectionName, sqlite path, Chroma host/port/path, bundle path, repo root.
    • Operation/subsystem explicit (e.g. mc.chroma.memory.restore, mc.graph.replace, kb.chroma.replace, restore.replace).
    • Production bypass requires both NEO_ALLOW_PRODUCTION_DESTRUCTIVE=true env AND an explicit operator confirmation token (not just one ambient flag).
  6. Refuse non-empty target without --force (defense-in-depth above the substrate guard).
  7. Retire importBackupToSQLite.mjs: delete OR convert to a thin alias-script that delegates to ai:restore -- <path>. No parallel ad-hoc restore path remains.

Pre-#10845 fallback: if AC-B ships before #10845 lands, --mode replace MUST be either disabled (errors clearly: "replace mode unavailable until #10845 destructive-op guard ships"), OR call a stub-guard with the same assertDestructiveTargetAllowed(...) contract that is explicitly fail-closed for production-like targets — never a permissive stub.

Contract Ledger Matrix

Target Surface Source of Authority Proposed Behavior Fallback / Edge Case Docs Evidence
npm run ai:restore -- <bundle> (new public surface) This ticket Restores 5 subsystems via canonical SDK; pre-flight validates bundle integrity + topology; default merge, --mode replace calls #10845 guard Refuses on bundle integrity failure; topology mismatch w/o --force-topology-mismatch; non-empty target w/o --force learn/agentos/MemoryCore.md runbook section + package.json:scripts.ai:restore JSDoc Playwright unit tests for happy-path merge, integrity-failure refusal, topology mismatch refusal, force-flag bypass, replace-mode guard call
bundle-meta.json (new in-bundle descriptor) AC-A enhancement to backup Records {chromaUnified, kbChromaCoords, mcChromaCoords, timestamp, neoVersion, gitSha} at bundle creation Missing meta = older bundle; restore warns + degrades to topology-blind validation Inline JSDoc in backup.mjs + restore-runbook Targeted spec: backup creates the file; restore reads it; mismatch refusal-and-force coverage
assertDestructiveTargetAllowed(...) consumption #10845 substrate guard Restore --mode replace calls per-operation; never bypasses If guard absent at code-time: stub with same contract, fail-closed for production-like targets #10845 docs + this ticket's restore runbook Spec proves replace blocked under default .neo-ai-data/ paths and allowed under :memory: / disposable paths
Retention sweep at .neo-ai-data/backups/ (AC-A) This ticket + #10129 retention TODO Newest K=3 unconditionally + delete >N=7 days; env-overridable Freshest successful backups never deleted; malformed dirs skipped + reported Inline backup.mjs + operator runbook Unit cov for retention selection + boundary cases
importBackupToSQLite.mjs retire This ticket Delete or alias-to-ai:restore Keep history reference in commit; new artifact does NOT bypass SDK Removal note in restore-runbook git rm evidence in PR; verify no remaining direct callers via grep

Acceptance Criteria

AC-A (backup enhancements)

  • neo-sqlite/memory-core.sqlite decision documented in PR body + reflected in bundle layout (include / retire / explicit-exclude).
  • sqlite/sent-to-cull.jsonl either bundled or explicitly documented as transient.
  • copyJsonlSource emits a console warning when source dir exists but has no JSONL.
  • Post-write integrity check (row-count parity) for KB + MC + graph; fails bundle on mismatch.
  • bundle-meta.json written at bundle creation: {chromaUnified, kbChromaCoords, mcChromaCoords, timestamp, neoVersion, gitSha}.
  • Retention policy implemented: K=3 + N=7 days defaults, env-overridable.
  • Backup verified under both chromaUnified=false (federated, current) and chromaUnified=true (unified, future) topologies.

AC-B (new restore)

  • npm run ai:restore -- <bundle> registered in package.json.
  • buildScripts/ai/restore.mjs invokes canonical SDK methods exclusively (no SQLiteVectorManager direct).
  • Pre-flight integrity validation: 5 subdirs present, JSONL parseable, row counts non-zero where expected, bundle-meta.json parsed if present.
  • Topology mismatch warning + refusal-without---force-topology-mismatch.
  • Default --mode merge works idempotently.
  • --mode replace calls assertDestructiveTargetAllowed(...) from #10845 guard (or fail-closed stub if guard not yet landed).
  • Refuse non-empty target without --force.
  • importBackupToSQLite.mjs retired (deleted or aliased to ai:restore).
  • Restore-runbook documented in learn/agentos/MemoryCore.md (new section: "Restore from atomic bundle").
  • Playwright unit tests: happy-path merge, integrity-failure refusal, topology-mismatch refusal, force-flag bypass, replace-mode guard call.

Out of Scope

  • Wake-daemon operational state (bridge.log, lastSyncId, inflight-*.txt) — separate concern, classify as live-orchestration recovery.
  • Daily scheduled automation — covered by #10844.
  • Cross-machine backup sync / cloud upload — local-disk only.
  • Defrag pre-nuke physical-copy snapshots at dist/chromadb-backups/ — peer architecture lockdown per #10129; NOT touched by this ticket.
  • Substrate destructive-op guard implementation — that's #10845 (this ticket consumes it).
  • Layer 1 stopgap test isolation — that's PR #10868 (Gemini's lane); my AC-B replace mode depends on Layer 1 landing first so unit tests don't re-wipe restored state.
  • Canonical config.mjs vs template drift detection — that's PR #10868's scope (Gemini's mental model from #10863).

Avoided Traps

  • Implementing restore via direct SQLiteVectorManager like the legacy one-off: bypasses canonical SDK and any future destructive-op guard. Restore must go through Memory_DatabaseService.manageDatabaseImport + KB_DatabaseService.manageDatabaseImport so the substrate guard can fire.
  • Env-var-only trust boundary for destructive ops: UNIT_TEST_MODE is not a trust boundary — npx playwright direct invocations bypass it. The guard is path-based / target-descriptor based, not env-based.
  • Permissive stub if #10845 ships later: tempting to write a no-op stub guard so AC-B tests pass before #10845 lands. Forbidden — stub must be fail-closed for production-like targets, otherwise restore re-introduces the wipe vector this whole effort closes.
  • Bundling backup + retention + automation in one ticket: retention is tightly coupled with backup itself (this ticket); automation is the scheduled invoker (#10844's scope). Don't conflate.
  • Including wake-daemon state in substrate backup: per @neo-gpt's design feedback — operator/process state belongs to a live-orchestration recovery concern, not substrate backup. Keep substrate backup substrate-only.
  • Silent neo-sqlite inclusion: 329M legacy artifact added without rationale would be a maintenance footgun. AC requires explicit decision.

Related

  • #10129 — atomic timestamped backup bundle (parent backup architecture, CLOSED)
  • #10844 — daily automated snapshot pipeline (depends on this ticket; runbook references this orchestrator)
  • #10845 — block destructive AI substrate ops on production paths (Layer 2 substrate guard; AC-B --mode replace consumes it)
  • #10867 / PR #10868 — Layer 1 immediate-stopgap (test-isolation, Gemini's lane); precondition for safely running AC-B unit tests
  • #10691 — Shared Deployment MVP (parent epic context)
  • #10009, #10015, #10127chromaUnified topology
  • #10822 — config substrate cleanup epic context

Origin Session ID: 78a3272e-847b-4799-ad6c-ce334464844c

Retrieval Hint: query_raw_memories(query="backup restore parity ai:restore unified bundle topology destructive guard")

tobiu referenced in commit 1dfcec8 - "feat(ai/backup): extend bundle-meta + integrity check + mailbox (#10871) (#10876) on May 7, 2026, 9:21 AM
tobiu referenced in commit 36db57e - "feat(ai/restore): bundle-aware restore orchestrator + npm run ai:restore (#10871) (#10886) on May 7, 2026, 1:12 PM
tobiu closed this issue on May 7, 2026, 1:12 PM
tobiu referenced in commit 8f7300b - "feat(ai/buildScripts): template drift detection in initServerConfigs (#10815) (#10892) on May 7, 2026, 2:19 PM
tobiu referenced in commit fcc4f2b - "feat(ai-restore): preserve-live merge semantics + per-incident filter/targeting/hooks (#11141) (#11143) on May 10, 2026, 9:37 PM