LearnNewsExamplesServices
Frontmatter
id10854
titleMemory Core wipe-detection alarm: collection-count-delta substrate primitive
stateClosed
labels
enhancementaiarchitecturemodel-experience
assigneesneo-opus-4-7
createdAtMay 7, 2026, 1:05 AM
updatedAtMay 9, 2026, 11:30 PM
githubUrlhttps://github.com/neomjs/neo/issues/10854
authorneo-opus-4-7
commentsCount1
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 7, 2026, 1:24 AM

Memory Core wipe-detection alarm: collection-count-delta substrate primitive

Closedenhancementaiarchitecturemodel-experience
neo-opus-4-7
neo-opus-4-7 commented on May 7, 2026, 1:05 AM

Context

Memory Core operates two Chroma collections (neo-agent-memory, neo-agent-sessions) and a SQLite-backed Native Edge Graph as load-bearing data substrates. Without observability over count-delta over time, sudden drops (accidental deletes, configuration drift, daemon misconfigurations targeting non-test paths) go unnoticed by operators until downstream queries return empty.

This ticket is preventive operational substrate / MX hardening — it introduces a daemon-side primitive that surfaces sudden count drops as alarm-class events.

The Problem

Memory Core's healthcheck reports current collection counts but doesn't track count-delta over time. Daemon log lines for individual add / delete operations exist, but operators don't have a single observable surface for "collection X dropped from N → 0 in the last M minutes." The pattern is detectable in principle but invisible in practice without a dedicated primitive.

The Architectural Reality

  • Chroma client API (api/v2/.../collections/<id>/count) is queryable via HTTP at any time.
  • ai/mcp/server/memory-core/services/HealthService.mjs already exposes count via connection.collections[name].count field.
  • A periodic snapshotting + delta-comparison loop fits cleanly as a new lifecycle service alongside the existing ai/mcp/server/memory-core/services/lifecycle/ services.
  • Healthcheck output can gain a database.collections.<name>.lastSnapshot + wipeAlarm block.

The Fix

Introduce WipeDetectionService.mjs under ai/mcp/server/memory-core/services/lifecycle/:

  • Periodic snapshot interval: configurable, default 5min.
  • Per-collection count history: ring buffer of last N samples (default 12 = 1h history at 5min cadence).
  • Alarm condition: count delta over configurable threshold (default >50% drop within any 2 consecutive samples).
  • Alarm action: WARN-level log + write wipeAlarm: { collection, prevCount, currCount, deltaPct, sampledAt } to healthcheck output.
  • Reset path: alarm clears when count recovers to ≥50% of pre-drop baseline.

Contract Ledger Matrix

Target Surface Source of Authority Proposed Behavior Fallback / Edge Case Docs Evidence
database.collections.<name>.lastSnapshot (healthcheck) New WipeDetectionService ISO timestamp of last sample Field absent if service disabled or no sample yet learn/agentos/MemoryCore.md (healthcheck schema section) Unit test against mocked Chroma
database.collections.<name>.wipeAlarm (healthcheck) New WipeDetectionService Object with prevCount, currCount, deltaPct, sampledAt when alarm active; absent when clear Field absent when no alarm fired Same Unit + integration test
process.env.NEO_WIPE_DETECT_INTERVAL_MS Env var Snapshot interval override; default 300000 (5min) Falls through to default if unset/invalid Env var inventory in learn/agentos/DeploymentCookbook.md Spec asserts default + override
process.env.NEO_WIPE_DETECT_THRESHOLD_PCT Env var Drop-percentage threshold; default 50 Same Same Same

Acceptance Criteria

  • WipeDetectionService.mjs lifecycle service created under ai/mcp/server/memory-core/services/lifecycle/
  • Periodic count snapshot at configurable interval (default 5min)
  • Per-collection ring buffer maintained
  • Alarm fires when count delta > threshold; logged at WARN
  • Healthcheck output includes lastSnapshot + wipeAlarm fields per collection
  • Alarm clears when count recovers
  • Two new env vars (NEO_WIPE_DETECT_INTERVAL_MS + NEO_WIPE_DETECT_THRESHOLD_PCT) wired with defaults
  • Unit tests covering: alarm-fires-on-drop, alarm-clears-on-recovery, threshold-edge-cases, env-var-override
  • learn/agentos/MemoryCore.md healthcheck section updated with new fields

Out of Scope

  • Self-healing (auto-restoration from backup) — orthogonal; covered conceptually by #10844 (daily backup pipeline)
  • KB-side wipe detection — KB has separate substrate; can land as follow-up if needed
  • External alerting (email/Slack) — out of scope; healthcheck field is the consumer-side handoff

Avoided Traps / Gold Standards Rejected

  • Hooking into every Chroma delete call site — too invasive; misses external/HTTP-direct deletes; brittle as Chroma client surface evolves. Polling is simpler + catches all delete sources.
  • Chroma server-side hook — out-of-process; complex deployment dependency. In-daemon polling stays self-contained.

Related

  • #10844 (daily automated snapshot pipeline) — pairs as preventive + recovery substrate
  • #10845 (block destructive AI substrate ops on production paths) — pairs as preventive + detection substrate

Origin Session ID: 8b31fd62-6a53-40b5-aae2-c5288f8ced09 Retrieval Hint: "Memory Core Chroma wipe-detection count-delta healthcheck alarm primitive"