Context
During post-merge observation of PRs #10114 / #10115 (DevIndex rate-limit pressure reduction), counts on the live data files surfaced an invariant violation: users.jsonl has entries not present in tracker.json. The gap was flagged (initially estimated ~351) and verified empirically — 354 orphans confirmed by parsing both files.
The Updater service drains work from tracker.json as its processing queue. Entries in users.jsonl without a corresponding tracker record are invisible to the Updater — they will never be re-fetched or have their profile refreshed. This is a slow-burn data-staleness bug: 354 users frozen at their last-known state with no recovery path through the normal sync pipeline.
The Problem
Verified counts (2026-04-19 against the dev HEAD worktree copy of apps/devindex/resources/data/):
| File |
Entries |
users.jsonl |
50,000 (exactly at config.github.maxUsers cap) |
tracker.json |
49,646 (0 pending, all active) |
users.jsonl ∩ tracker.json |
49,646 |
users in users.jsonl without tracker entry |
354 — invariant violation |
| tracker entries without user record |
0 — this direction is clean |
Reproducer (offline, reads the committed files):
import json
tracker = json.load(open('apps/devindex/resources/data/tracker.json'))
tracker_keys = set(k.lower() for k in tracker.keys())
users = []
with open('apps/devindex/resources/data/users.jsonl') as f:
for raw in f:
line = raw.strip().rstrip(',')
if line and line not in ('[', ']'):
users.append(json.loads(line))
user_logins = set(u['l'].lower() for u in users)
orphans = user_logins - tracker_keys
print(f'orphans: {len(orphans)}')
<h1 class="neo-h1" data-record-id="4">Sample: 06kellyjac, a-h-mzd, aabed, aaorris, aaronabuusama, adjazzzz, adomenech73, aeroastro, ...</h1>
Samples skew early-alphabet in the sorted output (may be coincidence; not enough evidence to claim a pattern).
The Architectural Reality
Three services maintain the tracker/users invariant:
Storage.updateUsers (apps/devindex/services/Storage.mjs:375): on prune-when-over-cap, removes pruned logins from tracker via updateTracker([{login, delete: true}, ...]) at line 417. Crucially, when adding NEW user records, it does not call updateTracker at all — the caller is expected to have already added the tracker entry (typically by the Spider discovering the user).
Updater.saveCheckpoint (per learn/guides/devindex/data-factory/Updater.md): atomically synchronizes new enriched profiles to users.jsonl AND updated timestamps to tracker.json — documented as atomic, but the actual code path involves two separate fs.writeFile calls via Storage, which are individually atomic (temp + rename) but not mutually transactional.
Cleanup (per learn/guides/devindex/data-factory/DataHygiene.md): threshold-prunes users.jsonl AND purges from tracker.json. Runs before major ops via Orchestrator invocation.
Candidate root-cause hypotheses (investigation must verify, NOT ratify):
- Updater saveCheckpoint partial-flush race: Updater writes
users.jsonl first, then tracker.json. If the process is interrupted between writes (CI runner timeout, rate-limit-kill, node crash), users.jsonl has the new record but tracker.json does not. Next run would not know to re-add the tracker entry for an already-in-users user.
- Rename recovery (
Updater §2 Rename Problem): when a user renames on GitHub, old login is deleted, new login is fetched and its data is merged into users.jsonl. If the new login's tracker entry isn't also inserted, orphan.
- Manual / admin-script writes: any CI or maintenance script writing to
users.jsonl without going through Storage.updateUsers + explicit updateTracker.
- Historical migration residue: an earlier data-shape migration may have left orphans that Cleanup has never caught (Cleanup's threshold-pruning only touches below-threshold users; above-threshold orphans survive indefinitely).
- Blocklist / opt-out partial delete: if a user is blocklisted after having a profile,
Cleanup hard-deletes from both — but a partial delete (tracker succeeded, users failed) would produce tracker-without-user (not what we observe). Conversely, a failed tracker-delete after users-delete succeeded would produce user-without-tracker. Worth checking the actual ordering in Cleanup.mjs.
The learn/guides/devindex/data-factory/DataHygiene.md guide describes an Allowlist Resurrection mechanism that injects missing tracker entries for allowlisted VIPs pre-pruning. If this mechanism ran for all users (not just allowlist), the invariant would self-heal. Currently scoped to allowlist only.
Investigation Plan
Execute in order, stopping when root cause is pinned:
- Audit
Storage.updateUsers callers. Every caller that adds a new user record — does it also call updateTracker for the add case? Specifically audit:
Updater.saveCheckpoint — new-record branch
Updater.#handleRenameRecovery — rename branch
- Any other callers discovered by grep
- Check
Cleanup deletion ordering. Does it delete tracker-first-then-user or user-first-then-tracker? Partial-failure of the latter ordering produces exactly our observed gap shape.
- Trace a sample of the 354 orphans. Spot-check 3-5 logins in the sample against GitHub to see if they:
- Look like likely renames (account exists under a different name)
- Are below current
threshold.tc
- Appear in any recent Updater logs (if CI logs retained)
- Extend
Cleanup resurrection to the full users.jsonl set. If the missing-entry pattern matches a class the existing Allowlist Resurrection already handles, widening the scope is a candidate fix. Verify before implementing.
- Document findings. File a follow-up fix ticket with the confirmed root cause. This ticket becomes the investigation reference.
Acceptance Criteria
Out of Scope
- Implementing the fix — investigation first, fix in a separate ticket once cause is known
- Tracker file format change — tracker.json structure is not the issue
config.github.maxUsers cap adjustment — orthogonal
- Backpressure Valve tuning (
config.spider.maxPendingUsers) — already working (0 pending)
Avoided Traps
- Writing a one-time repair script without root cause: fixes symptom, risks masking an ongoing bug that will re-create orphans after repair
- Broadening
Allowlist Resurrection to all users blindly: might look like a fix but could paper over a legitimate delete path (e.g., opt-out) that should keep the user removed from tracker
- Dismissing as "only 354 out of 50,000": users.jsonl-vs-tracker.json is a correctness invariant — the ratio is orthogonal to whether it's broken
Related
- Surfaced during: #10113 / PR #10115 post-merge observation
- Architectural context:
learn/guides/devindex/data-factory/Storage.md, Updater.md, DataHygiene.md
- Sibling services:
apps/devindex/services/Storage.mjs, Updater.mjs, Cleanup.mjs, Spider.mjs
Origin Session ID
07f601dc-353a-44d2-a373-18da2a0d305a
Context
During post-merge observation of PRs #10114 / #10115 (DevIndex rate-limit pressure reduction), counts on the live data files surfaced an invariant violation:
users.jsonlhas entries not present intracker.json. The gap was flagged (initially estimated ~351) and verified empirically — 354 orphans confirmed by parsing both files.The
Updaterservice drains work fromtracker.jsonas its processing queue. Entries inusers.jsonlwithout a corresponding tracker record are invisible to the Updater — they will never be re-fetched or have their profile refreshed. This is a slow-burn data-staleness bug: 354 users frozen at their last-known state with no recovery path through the normal sync pipeline.The Problem
Verified counts (2026-04-19 against the
devHEAD worktree copy ofapps/devindex/resources/data/):users.jsonlconfig.github.maxUserscap)tracker.jsonusers.jsonl ∩ tracker.jsonusers.jsonlwithout tracker entryReproducer (offline, reads the committed files):
import json tracker = json.load(open('apps/devindex/resources/data/tracker.json')) tracker_keys = set(k.lower() for k in tracker.keys()) users = [] with open('apps/devindex/resources/data/users.jsonl') as f: for raw in f: line = raw.strip().rstrip(',') if line and line not in ('[', ']'): users.append(json.loads(line)) user_logins = set(u['l'].lower() for u in users) orphans = user_logins - tracker_keys print(f'orphans: {len(orphans)}') <h1 class="neo-h1" data-record-id="4">Sample: 06kellyjac, a-h-mzd, aabed, aaorris, aaronabuusama, adjazzzz, adomenech73, aeroastro, ...</h1>Samples skew early-alphabet in the sorted output (may be coincidence; not enough evidence to claim a pattern).
The Architectural Reality
Three services maintain the tracker/users invariant:
Storage.updateUsers(apps/devindex/services/Storage.mjs:375): on prune-when-over-cap, removes pruned logins from tracker viaupdateTracker([{login, delete: true}, ...])at line 417. Crucially, when adding NEW user records, it does not callupdateTrackerat all — the caller is expected to have already added the tracker entry (typically by the Spider discovering the user).Updater.saveCheckpoint(perlearn/guides/devindex/data-factory/Updater.md): atomically synchronizes new enriched profiles tousers.jsonlAND updated timestamps totracker.json— documented as atomic, but the actual code path involves two separatefs.writeFilecalls viaStorage, which are individually atomic (temp + rename) but not mutually transactional.Cleanup(perlearn/guides/devindex/data-factory/DataHygiene.md): threshold-prunesusers.jsonlAND purges fromtracker.json. Runs before major ops viaOrchestratorinvocation.Candidate root-cause hypotheses (investigation must verify, NOT ratify):
users.jsonlfirst, thentracker.json. If the process is interrupted between writes (CI runner timeout, rate-limit-kill, node crash),users.jsonlhas the new record buttracker.jsondoes not. Next run would not know to re-add the tracker entry for an already-in-users user.Updater§2 Rename Problem): when a user renames on GitHub, old login is deleted, new login is fetched and its data is merged intousers.jsonl. If the new login's tracker entry isn't also inserted, orphan.users.jsonlwithout going throughStorage.updateUsers+ explicitupdateTracker.Cleanuphard-deletes from both — but a partial delete (tracker succeeded, users failed) would produce tracker-without-user (not what we observe). Conversely, a failed tracker-delete after users-delete succeeded would produce user-without-tracker. Worth checking the actual ordering inCleanup.mjs.The
learn/guides/devindex/data-factory/DataHygiene.mdguide describes anAllowlist Resurrectionmechanism that injects missing tracker entries for allowlisted VIPs pre-pruning. If this mechanism ran for all users (not just allowlist), the invariant would self-heal. Currently scoped to allowlist only.Investigation Plan
Execute in order, stopping when root cause is pinned:
Storage.updateUserscallers. Every caller that adds a new user record — does it also callupdateTrackerfor the add case? Specifically audit:Updater.saveCheckpoint— new-record branchUpdater.#handleRenameRecovery— rename branchCleanupdeletion ordering. Does it delete tracker-first-then-user or user-first-then-tracker? Partial-failure of the latter ordering produces exactly our observed gap shape.threshold.tcCleanupresurrection to the full users.jsonl set. If the missing-entry pattern matches a class the existingAllowlist Resurrectionalready handles, widening the scope is a candidate fix. Verify before implementing.Acceptance Criteria
Storage.updateUserscallers audited; any new-record-without-tracker-add code path documentedCleanup.mjsdeletion order verified against hypothesis #5Out of Scope
config.github.maxUserscap adjustment — orthogonalconfig.spider.maxPendingUsers) — already working (0 pending)Avoided Traps
Allowlist Resurrectionto all users blindly: might look like a fix but could paper over a legitimate delete path (e.g., opt-out) that should keep the user removed from trackerRelated
learn/guides/devindex/data-factory/Storage.md,Updater.md,DataHygiene.mdapps/devindex/services/Storage.mjs,Updater.mjs,Cleanup.mjs,Spider.mjsOrigin Session ID
07f601dc-353a-44d2-a373-18da2a0d305a