Context
Operator V-B-A surfaced 2026-05-19 during graph-backup analysis: the latest atomic-bundle (backup-2026-05-19T13-08-14.283Z/graph/) contains 3,168,310 graph elements vs. the ~50k baseline from ~10 days ago. Statistical sampling (awk 'NR%1000==0' then awk 'NR%2000==0') showed FILE nodes ~81.7% + DIRECTORY nodes ~16% + CONTAINS edges ~98.4% of the corpus.
Operator confirmation: "for the graph cleanup: .claude folder is a huge culprit indeed and needs to get ignored. same for .gemini, .codex, .idea and other unrelated folders. although .claude is the biggest one for sure."
The Problem
ai/services/memory-core/FileSystemIngestor.mjs:31 ignorePatterns_:
ignorePatterns_: ['node_modules', 'dist', '.git', '.DS_Store', 'build', '.env', '.neo-ai-data', 'docs/output', 'tmp', '.idea', '.gemini', '.agents', 'resources/images', 'resources/fonts']
Missing: .claude, .codex.
The Claude Code harness creates .claude/worktrees/<name>/ per agent session. Each worktree carries its own multi-GB node_modules AND its own .neo-ai-data (Chroma data + graph SQLite). The Codex harness presumably uses a similar pattern. Neither directory is source-of-truth code; both should be indexer-invisible.
The path-prefix matcher at line 112 (relativePath === pattern || relativePath.startsWith(pattern + '/')) currently DOES catch .claude/... once .claude is in the ignore list — adding the two missing patterns is sufficient for this bug.
Statistical evidence
FILE node path-prefix distribution from awk 'NR%2000==0' sample (≈1584 records):
| Pattern (top-15) |
Sample count |
Extrapolated to full corpus |
.claude/worktrees/<X>/node_modules (per worktree) |
11-14 each × ~25 worktrees visible |
~750k+ records |
.claude/worktrees/<X>/.neo-ai-data (per worktree) |
8-11 each × ~25 worktrees visible |
~500k+ records |
| All other paths combined |
<50 |
<100k |
~98% of FILE/DIRECTORY/CONTAINS elements trace to .claude/worktrees/. The legitimate semantic graph (memories, sessions, identities, capabilities, concepts, tickets, discussions, agent-interactions, A2A messages, etc.) is ~50k-70k elements per operator's 10-day-ago baseline.
The Architectural Reality
This touches:
The Fix
1. Add .claude and .codex to ignorePatterns_
Per operator framing, both harness directories are unrelated to source-of-truth. The path-prefix matcher catches all nested paths under these once the pattern is in the list.
2. Extend FileSystemIngestor.spec.mjs
Add mock filesystem entries under .claude/worktrees/test-worktree/node_modules/ and .codex/test-data/ and assert exclusion.
3. Cleanup of existing 3.1M stale records — OUT OF SCOPE
This is a SEPARATE concern requiring backup-first + destructive operation. walkDirectory has no orphan-cleanup logic (only upserts; never deletes). After this fix merges:
- Future sync runs will NOT add new
.claude/.codex FILE/DIRECTORY/CONTAINS records ✅
- Existing 3.1M stale records persist until explicitly cleaned ❌
Operator decides cleanup approach (mass-delete via SQLite DELETE statement, OR full nuke-and-rebuild of FILE/DIRECTORY graph slice). Tracked in follow-up.
Acceptance Criteria
Out of Scope
- Path-component (gitignore-style) matcher refactor — current path-prefix matcher is sufficient because every leak path starts with
.claude/ or .codex/. Path-component refactor is general robustness, not required by this bug. Could be future ticket if other nested-pattern leaks emerge.
- Cleanup of existing 3.1M stale records — needs operator-orchestrated mass-delete OR full rebuild; backup-first required.
- Reconciliation daemon to auto-prune stale FILE/DIRECTORY/CONTAINS records — Phase 4B (#11640) reconciliation daemon scope.
Avoided Traps
| Trap |
Why rejected |
| Path-component matcher refactor in this PR |
Out of bug-fix scope; surgical add is sufficient per operator framing + sample data (100% of bloat under .claude/worktrees/) |
| Bundling cleanup into this PR |
Cleanup is destructive (mass-delete ~3.1M records); needs backup-first + operator orchestration. Code fix is reversible (revert if regression); cleanup is not. |
Adding .husky / .github to ignore list |
Both are source-of-truth (committed to repo; CI workflows + git hooks). Should remain indexed. |
Related
- Operator framing 2026-05-19: "for the graph cleanup: .claude folder is a huge culprit indeed and needs to get ignored. same for .gemini, .codex, .idea and other unrelated folders."
- Statistical evidence: graph-backup-2026-05-19T13-08-22.938Z.jsonl analysis (1.3 GB / 3,168,310 lines / 1/1000 + 1/2000 sampling)
- Related substrate: Phase 4B reconciliation daemon #11640 — will own the orphan-cleanup substrate going forward; this bug fixes the prevention layer
- Sibling cleanup follow-up: TBD ticket for mass-delete of existing 3.1M stale records (operator-orchestrated; backup-first)
Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838
Handoff Retrieval Hints
query_raw_memories({query: 'FileSystemIngestor .claude ignorePatterns'})
- Operator V-B-A on 3.17M graph elements claim led to this discovery — anchor for substrate-evolution audit-completeness pattern
- Memory anchor:
feedback_substrate_audit_consumer_sweep.md Category 4 (logical-extension scope-drift) — same family of audit-completeness failures
Context
Operator V-B-A surfaced 2026-05-19 during graph-backup analysis: the latest atomic-bundle (
backup-2026-05-19T13-08-14.283Z/graph/) contains 3,168,310 graph elements vs. the ~50k baseline from ~10 days ago. Statistical sampling (awk 'NR%1000==0'thenawk 'NR%2000==0') showed FILE nodes ~81.7% + DIRECTORY nodes ~16% + CONTAINS edges ~98.4% of the corpus.Operator confirmation: "for the graph cleanup: .claude folder is a huge culprit indeed and needs to get ignored. same for .gemini, .codex, .idea and other unrelated folders. although .claude is the biggest one for sure."
The Problem
ai/services/memory-core/FileSystemIngestor.mjs:31ignorePatterns_:ignorePatterns_: ['node_modules', 'dist', '.git', '.DS_Store', 'build', '.env', '.neo-ai-data', 'docs/output', 'tmp', '.idea', '.gemini', '.agents', 'resources/images', 'resources/fonts']Missing:
.claude,.codex.The Claude Code harness creates
.claude/worktrees/<name>/per agent session. Each worktree carries its own multi-GBnode_modulesAND its own.neo-ai-data(Chroma data + graph SQLite). The Codex harness presumably uses a similar pattern. Neither directory is source-of-truth code; both should be indexer-invisible.The path-prefix matcher at line 112 (
relativePath === pattern || relativePath.startsWith(pattern + '/')) currently DOES catch.claude/...once.claudeis in the ignore list — adding the two missing patterns is sufficient for this bug.Statistical evidence
FILE node path-prefix distribution from
awk 'NR%2000==0'sample (≈1584 records):.claude/worktrees/<X>/node_modules(per worktree).claude/worktrees/<X>/.neo-ai-data(per worktree)~98% of FILE/DIRECTORY/CONTAINS elements trace to
.claude/worktrees/. The legitimate semantic graph (memories, sessions, identities, capabilities, concepts, tickets, discussions, agent-interactions, A2A messages, etc.) is ~50k-70k elements per operator's 10-day-ago baseline.The Architectural Reality
This touches:
ai/services/memory-core/FileSystemIngestor.mjs:31.claude+.codextoignorePatterns_test/playwright/unit/ai/services/memory-core/FileSystemIngestor.spec.mjs.claude/worktrees/<X>/...and.codex/...are excludedThe Fix
1. Add
.claudeand.codexto ignorePatterns_Per operator framing, both harness directories are unrelated to source-of-truth. The path-prefix matcher catches all nested paths under these once the pattern is in the list.
2. Extend FileSystemIngestor.spec.mjs
Add mock filesystem entries under
.claude/worktrees/test-worktree/node_modules/and.codex/test-data/and assert exclusion.3. Cleanup of existing 3.1M stale records — OUT OF SCOPE
This is a SEPARATE concern requiring backup-first + destructive operation.
walkDirectoryhas no orphan-cleanup logic (only upserts; never deletes). After this fix merges:.claude/.codexFILE/DIRECTORY/CONTAINS records ✅Operator decides cleanup approach (mass-delete via SQLite DELETE statement, OR full nuke-and-rebuild of FILE/DIRECTORY graph slice). Tracked in follow-up.
Acceptance Criteria
FileSystemIngestor.mjs:31ignorePatterns_ includes.claudeand.codex.claude/worktrees/<X>/node_modules/<file>paths excluded.codex/<file>paths excludedOut of Scope
.claude/or.codex/. Path-component refactor is general robustness, not required by this bug. Could be future ticket if other nested-pattern leaks emerge.Avoided Traps
.claude/worktrees/).husky/.githubto ignore listRelated
Origin Session ID
7360e917-1733-4cdd-a6f3-5ac51c34b838Handoff Retrieval Hints
query_raw_memories({query: 'FileSystemIngestor .claude ignorePatterns'})feedback_substrate_audit_consumer_sweep.mdCategory 4 (logical-extension scope-drift) — same family of audit-completeness failures