Context
Memory Core operates two Chroma collections (neo-agent-memory, neo-agent-sessions) and a SQLite-backed Native Edge Graph as load-bearing data substrates. Without observability over count-delta over time, sudden drops (accidental deletes, configuration drift, daemon misconfigurations targeting non-test paths) go unnoticed by operators until downstream queries return empty.
This ticket is preventive operational substrate / MX hardening — it introduces a daemon-side primitive that surfaces sudden count drops as alarm-class events.
The Problem
Memory Core's healthcheck reports current collection counts but doesn't track count-delta over time. Daemon log lines for individual add / delete operations exist, but operators don't have a single observable surface for "collection X dropped from N → 0 in the last M minutes." The pattern is detectable in principle but invisible in practice without a dedicated primitive.
The Architectural Reality
- Chroma client API (
api/v2/.../collections/<id>/count) is queryable via HTTP at any time.
ai/mcp/server/memory-core/services/HealthService.mjs already exposes count via connection.collections[name].count field.
- A periodic snapshotting + delta-comparison loop fits cleanly as a new lifecycle service alongside the existing
ai/mcp/server/memory-core/services/lifecycle/ services.
- Healthcheck output can gain a
database.collections.<name>.lastSnapshot + wipeAlarm block.
The Fix
Introduce WipeDetectionService.mjs under ai/mcp/server/memory-core/services/lifecycle/:
- Periodic snapshot interval: configurable, default 5min.
- Per-collection count history: ring buffer of last N samples (default 12 = 1h history at 5min cadence).
- Alarm condition: count delta over configurable threshold (default >50% drop within any 2 consecutive samples).
- Alarm action: WARN-level log + write
wipeAlarm: { collection, prevCount, currCount, deltaPct, sampledAt } to healthcheck output.
- Reset path: alarm clears when count recovers to ≥50% of pre-drop baseline.
Contract Ledger Matrix
| Target Surface |
Source of Authority |
Proposed Behavior |
Fallback / Edge Case |
Docs |
Evidence |
database.collections.<name>.lastSnapshot (healthcheck) |
New WipeDetectionService |
ISO timestamp of last sample |
Field absent if service disabled or no sample yet |
learn/agentos/MemoryCore.md (healthcheck schema section) |
Unit test against mocked Chroma |
database.collections.<name>.wipeAlarm (healthcheck) |
New WipeDetectionService |
Object with prevCount, currCount, deltaPct, sampledAt when alarm active; absent when clear |
Field absent when no alarm fired |
Same |
Unit + integration test |
process.env.NEO_WIPE_DETECT_INTERVAL_MS |
Env var |
Snapshot interval override; default 300000 (5min) |
Falls through to default if unset/invalid |
Env var inventory in learn/agentos/DeploymentCookbook.md |
Spec asserts default + override |
process.env.NEO_WIPE_DETECT_THRESHOLD_PCT |
Env var |
Drop-percentage threshold; default 50 |
Same |
Same |
Same |
Acceptance Criteria
Out of Scope
- Self-healing (auto-restoration from backup) — orthogonal; covered conceptually by #10844 (daily backup pipeline)
- KB-side wipe detection — KB has separate substrate; can land as follow-up if needed
- External alerting (email/Slack) — out of scope; healthcheck field is the consumer-side handoff
Avoided Traps / Gold Standards Rejected
- Hooking into every Chroma
delete call site — too invasive; misses external/HTTP-direct deletes; brittle as Chroma client surface evolves. Polling is simpler + catches all delete sources.
- Chroma server-side hook — out-of-process; complex deployment dependency. In-daemon polling stays self-contained.
Related
- #10844 (daily automated snapshot pipeline) — pairs as preventive + recovery substrate
- #10845 (block destructive AI substrate ops on production paths) — pairs as preventive + detection substrate
Origin Session ID: 8b31fd62-6a53-40b5-aae2-c5288f8ced09
Retrieval Hint: "Memory Core Chroma wipe-detection count-delta healthcheck alarm primitive"
Context
Memory Core operates two Chroma collections (
neo-agent-memory,neo-agent-sessions) and a SQLite-backed Native Edge Graph as load-bearing data substrates. Without observability over count-delta over time, sudden drops (accidental deletes, configuration drift, daemon misconfigurations targeting non-test paths) go unnoticed by operators until downstream queries return empty.This ticket is preventive operational substrate / MX hardening — it introduces a daemon-side primitive that surfaces sudden count drops as alarm-class events.
The Problem
Memory Core's healthcheck reports current collection counts but doesn't track count-delta over time. Daemon log lines for individual
add/deleteoperations exist, but operators don't have a single observable surface for "collection X dropped from N → 0 in the last M minutes." The pattern is detectable in principle but invisible in practice without a dedicated primitive.The Architectural Reality
api/v2/.../collections/<id>/count) is queryable via HTTP at any time.ai/mcp/server/memory-core/services/HealthService.mjsalready exposes count viaconnection.collections[name].countfield.ai/mcp/server/memory-core/services/lifecycle/services.database.collections.<name>.lastSnapshot+wipeAlarmblock.The Fix
Introduce
WipeDetectionService.mjsunderai/mcp/server/memory-core/services/lifecycle/:wipeAlarm: { collection, prevCount, currCount, deltaPct, sampledAt }to healthcheck output.Contract Ledger Matrix
database.collections.<name>.lastSnapshot(healthcheck)learn/agentos/MemoryCore.md(healthcheck schema section)database.collections.<name>.wipeAlarm(healthcheck)prevCount,currCount,deltaPct,sampledAtwhen alarm active; absent when clearprocess.env.NEO_WIPE_DETECT_INTERVAL_MSlearn/agentos/DeploymentCookbook.mdprocess.env.NEO_WIPE_DETECT_THRESHOLD_PCTAcceptance Criteria
WipeDetectionService.mjslifecycle service created underai/mcp/server/memory-core/services/lifecycle/lastSnapshot+wipeAlarmfields per collectionNEO_WIPE_DETECT_INTERVAL_MS+NEO_WIPE_DETECT_THRESHOLD_PCT) wired with defaultslearn/agentos/MemoryCore.mdhealthcheck section updated with new fieldsOut of Scope
Avoided Traps / Gold Standards Rejected
deletecall site — too invasive; misses external/HTTP-direct deletes; brittle as Chroma client surface evolves. Polling is simpler + catches all delete sources.Related
Origin Session ID: 8b31fd62-6a53-40b5-aae2-c5288f8ced09 Retrieval Hint: "Memory Core Chroma wipe-detection count-delta healthcheck alarm primitive"