LearnNewsExamplesServices
Frontmatter
id11650
titleFileSystemIngestor missing .claude + .codex from ignorePatterns; 3.1M stale FILE/DIRECTORY/CONTAINS graph elements
stateClosed
labels
bugaiarchitecture
assigneesneo-opus-ada
createdAtMay 19, 2026, 5:05 PM
updatedAtJun 7, 2026, 7:13 PM
githubUrlhttps://github.com/neomjs/neo/issues/11650
authorneo-opus-ada
commentsCount0
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 19, 2026, 5:25 PM

FileSystemIngestor missing .claude + .codex from ignorePatterns; 3.1M stale FILE/DIRECTORY/CONTAINS graph elements

Closed v13.0.0/archive-v13-0-0-chunk-12 bugaiarchitecture
neo-opus-ada
neo-opus-ada commented on May 19, 2026, 5:05 PM

Context

Operator V-B-A surfaced 2026-05-19 during graph-backup analysis: the latest atomic-bundle (backup-2026-05-19T13-08-14.283Z/graph/) contains 3,168,310 graph elements vs. the ~50k baseline from ~10 days ago. Statistical sampling (awk 'NR%1000==0' then awk 'NR%2000==0') showed FILE nodes ~81.7% + DIRECTORY nodes ~16% + CONTAINS edges ~98.4% of the corpus.

Operator confirmation: "for the graph cleanup: .claude folder is a huge culprit indeed and needs to get ignored. same for .gemini, .codex, .idea and other unrelated folders. although .claude is the biggest one for sure."

The Problem

ai/services/memory-core/FileSystemIngestor.mjs:31 ignorePatterns_:

ignorePatterns_: ['node_modules', 'dist', '.git', '.DS_Store', 'build', '.env', '.neo-ai-data', 'docs/output', 'tmp', '.idea', '.gemini', '.agents', 'resources/images', 'resources/fonts']

Missing: .claude, .codex.

The Claude Code harness creates .claude/worktrees/<name>/ per agent session. Each worktree carries its own multi-GB node_modules AND its own .neo-ai-data (Chroma data + graph SQLite). The Codex harness presumably uses a similar pattern. Neither directory is source-of-truth code; both should be indexer-invisible.

The path-prefix matcher at line 112 (relativePath === pattern || relativePath.startsWith(pattern + '/')) currently DOES catch .claude/... once .claude is in the ignore list — adding the two missing patterns is sufficient for this bug.

Statistical evidence

FILE node path-prefix distribution from awk 'NR%2000==0' sample (≈1584 records):

Pattern (top-15) Sample count Extrapolated to full corpus
.claude/worktrees/<X>/node_modules (per worktree) 11-14 each × ~25 worktrees visible ~750k+ records
.claude/worktrees/<X>/.neo-ai-data (per worktree) 8-11 each × ~25 worktrees visible ~500k+ records
All other paths combined <50 <100k

~98% of FILE/DIRECTORY/CONTAINS elements trace to .claude/worktrees/. The legitimate semantic graph (memories, sessions, identities, capabilities, concepts, tickets, discussions, agent-interactions, A2A messages, etc.) is ~50k-70k elements per operator's 10-day-ago baseline.

The Architectural Reality

This touches:

File Change
ai/services/memory-core/FileSystemIngestor.mjs:31 Add .claude + .codex to ignorePatterns_
test/playwright/unit/ai/services/memory-core/FileSystemIngestor.spec.mjs Extend test to assert .claude/worktrees/<X>/... and .codex/... are excluded

The Fix

1. Add .claude and .codex to ignorePatterns_

Per operator framing, both harness directories are unrelated to source-of-truth. The path-prefix matcher catches all nested paths under these once the pattern is in the list.

2. Extend FileSystemIngestor.spec.mjs

Add mock filesystem entries under .claude/worktrees/test-worktree/node_modules/ and .codex/test-data/ and assert exclusion.

3. Cleanup of existing 3.1M stale records — OUT OF SCOPE

This is a SEPARATE concern requiring backup-first + destructive operation. walkDirectory has no orphan-cleanup logic (only upserts; never deletes). After this fix merges:

  • Future sync runs will NOT add new .claude/.codex FILE/DIRECTORY/CONTAINS records ✅
  • Existing 3.1M stale records persist until explicitly cleaned ❌

Operator decides cleanup approach (mass-delete via SQLite DELETE statement, OR full nuke-and-rebuild of FILE/DIRECTORY graph slice). Tracked in follow-up.

Acceptance Criteria

  • FileSystemIngestor.mjs:31 ignorePatterns_ includes .claude and .codex
  • Unit test asserts .claude/worktrees/<X>/node_modules/<file> paths excluded
  • Unit test asserts .codex/<file> paths excluded
  • All existing FileSystemIngestor tests continue passing
  • PR body notes the cleanup-of-existing-bloat is OUT OF SCOPE for this ticket

Out of Scope

  • Path-component (gitignore-style) matcher refactor — current path-prefix matcher is sufficient because every leak path starts with .claude/ or .codex/. Path-component refactor is general robustness, not required by this bug. Could be future ticket if other nested-pattern leaks emerge.
  • Cleanup of existing 3.1M stale records — needs operator-orchestrated mass-delete OR full rebuild; backup-first required.
  • Reconciliation daemon to auto-prune stale FILE/DIRECTORY/CONTAINS records — Phase 4B (#11640) reconciliation daemon scope.

Avoided Traps

Trap Why rejected
Path-component matcher refactor in this PR Out of bug-fix scope; surgical add is sufficient per operator framing + sample data (100% of bloat under .claude/worktrees/)
Bundling cleanup into this PR Cleanup is destructive (mass-delete ~3.1M records); needs backup-first + operator orchestration. Code fix is reversible (revert if regression); cleanup is not.
Adding .husky / .github to ignore list Both are source-of-truth (committed to repo; CI workflows + git hooks). Should remain indexed.

Related

  • Operator framing 2026-05-19: "for the graph cleanup: .claude folder is a huge culprit indeed and needs to get ignored. same for .gemini, .codex, .idea and other unrelated folders."
  • Statistical evidence: graph-backup-2026-05-19T13-08-22.938Z.jsonl analysis (1.3 GB / 3,168,310 lines / 1/1000 + 1/2000 sampling)
  • Related substrate: Phase 4B reconciliation daemon #11640 — will own the orphan-cleanup substrate going forward; this bug fixes the prevention layer
  • Sibling cleanup follow-up: TBD ticket for mass-delete of existing 3.1M stale records (operator-orchestrated; backup-first)

Origin Session ID

7360e917-1733-4cdd-a6f3-5ac51c34b838

Handoff Retrieval Hints

  • query_raw_memories({query: 'FileSystemIngestor .claude ignorePatterns'})
  • Operator V-B-A on 3.17M graph elements claim led to this discovery — anchor for substrate-evolution audit-completeness pattern
  • Memory anchor: feedback_substrate_audit_consumer_sweep.md Category 4 (logical-extension scope-drift) — same family of audit-completeness failures
tobiu referenced in commit 617c712 - "fix(memory-core): add .claude and .codex to FileSystemIngestor ignorePatterns (#11650) (#11651) on May 19, 2026, 5:25 PM
tobiu closed this issue on May 19, 2026, 5:25 PM