Context
Follow-up from #10117 intake/investigation on 2026-05-17. The original #10117 snapshot reported 354 users.jsonl records absent from tracker.json on 2026-04-19. Fresh V-B-A on current dev data shows the invariant is still broken and has worsened.
Current counts from apps/devindex/resources/data/:
| Metric |
Current value |
users.jsonl rich-user records |
50,000 |
tracker.json entries |
49,387 |
| Rich-user records absent from tracker |
614 |
| Tracker entries absent from rich users |
1 |
Recent bot data-sync history confirms this is not self-healing:
bc8c7f25d 2026-05-15T23:26Z users=50000 tracker=49394 orphans=606 trackerOnly=0
8383fae69 2026-05-16T04:08Z users=50000 tracker=49393 orphans=607 trackerOnly=0
907240408 2026-05-16T22:18Z users=50000 tracker=49387 orphans=614 trackerOnly=1
The Problem
tracker.json is the Updater scheduling index. A rich users.jsonl record absent from tracker.json is invisible to refresh scheduling, so the profile can remain stale forever. The drift is currently growing slowly across hourly data-sync commits.
The investigation also found a concrete rename / username-reuse hazard:
- Rich record sample:
0xBigBoss has stored i: 95193764, tc: 17244, lu: 2026-05-08T15:18:19.485Z.
- Live REST
GET /users/0xBigBoss now returns a different account, id: 283503690, created 2026-05-11.
- Live REST
GET /user/95193764 resolves the stored user ID to alleneubank.
Therefore a naive cleanup that simply re-adds all rich-user logins into tracker.json would requeue stale logins and may overwrite an old identity with a new account that reused the login. The repair must be identity-aware.
A second V-B-A catch: apps/devindex/services/GitHub.mjs#getLoginByDatabaseId() currently uses GraphQL user(databaseId: ...), but GitHub GraphQL rejects that argument:
Field 'user' is missing required arguments: login
Field 'user' doesn't accept argument 'databaseId'
That means the archived rename-recovery intent from #9137 is not actually reliable today.
The Architectural Reality
Relevant current surfaces:
apps/devindex/services/Storage.mjs#updateUsers() writes rich records and prunes over-cap users, but does not add tracker entries for inserted rich records. It expects callers to own tracker updates.
apps/devindex/services/Updater.mjs#saveCheckpoint() currently persists in this order: Storage.updateUsers(results), Storage.updateTracker(indexUpdates), Storage.deleteUsers(prunedLogins), then failed-list updates. Interruptions between these operations can leave one store updated without the other.
apps/devindex/services/Updater.mjs rename recovery depends on GitHub.getLoginByDatabaseId() to map old stored user IDs to current logins.
apps/devindex/services/GitHub.mjs#getLoginByDatabaseId() uses a GraphQL signature that is invalid against current GitHub GraphQL.
apps/devindex/services/Cleanup.mjs only resurrects allowlisted users into tracker. It does not reconcile non-allowlisted rich-user records missing from tracker, and it must not do so blindly because of login reuse.
.github/workflows/data-sync-pipeline.yml runs devindex:spider and devindex:update; the data-sync bot commits both users.jsonl and tracker.json hourly.
The Fix
Implement a bounded, identity-aware repair and prevention path:
- Replace or fix
GitHub.getLoginByDatabaseId() with a current GitHub API-compatible resolver for integer database IDs. If no direct GraphQL lookup exists, use a safe REST fallback (GET /user/{id}) or equivalent verified endpoint.
- Update rename recovery to use the fixed resolver and to avoid leaving stale old-login rich records behind when a user has renamed or when a login has been reused by a different account.
- Add a reconciliation step or script that audits rich-user records missing from tracker and repairs them safely:
- If the stored user ID resolves to the same login, restore the tracker entry with the rich record's
lu timestamp or a deliberate recheck timestamp.
- If the stored user ID resolves to a different current login, treat it as a rename: migrate/queue the current login and remove or supersede the stale old-login rich record.
- If the stored user ID no longer resolves, route to failed/penalty handling or prune according to existing data-minimization rules.
- Never requeue a stale login when live login ownership differs from the stored user ID.
- Add focused unit coverage for the resolver and reconciliation behavior. Add an offline invariant assertion for
users.jsonl vs tracker.json after repair, or document why it cannot be enforced in CI until the committed data is repaired.
- After code is fixed, perform the committed data repair in a dedicated data-sync-safe commit or document the operator/data-pipeline step required to perform it.
Contract Ledger Matrix
| Target Surface |
Source of Authority |
Proposed Behavior |
Fallback |
Docs |
Evidence |
GitHub.getLoginByDatabaseId() |
Current GitHub API behavior + archived rename-recovery intent #9137 |
Resolves stored integer user IDs to the current login using a verified supported API path |
Returns null only when the user cannot be resolved; transient API errors still throw |
learn/guides/devindex/data-factory/GitHubAPI.md |
Unit test or isolated API mock proving resolver handles success, missing user, and transient error |
Updater rename recovery |
#9137 + current #10117 drift evidence |
Old login is removed only after current-login replacement is safely persisted or queued; username reuse does not overwrite the wrong identity |
Failed/penalty path preserves retriable state without deleting valid rich history |
learn/guides/devindex/data-factory/Updater.md |
Unit test for rename and username-reuse scenario (0xBigBoss class) |
| Rich-store / tracker invariant |
#10117 + current data counts |
Rich users that should be refreshable have a tracker entry; tracker-only entries are resolved or pruned according to existing rules |
Ambiguous identity records are quarantined/failed instead of blindly requeued |
learn/guides/devindex/data-factory/DataHygiene.md |
Offline count assertion shows no unsafe rich-user orphan drift after repair |
| Data-sync pipeline |
.github/workflows/data-sync-pipeline.yml |
Hourly bot runs do not reintroduce the invariant drift |
If GitHub API resolver unavailable, data-sync fails loudly or records bounded failed state |
Workflow logs + data files |
Post-merge data-sync observation confirms orphan count does not grow |
Acceptance Criteria
Out of Scope
- Changing the
maxUsers cap or contribution threshold.
- Rewriting DevIndex storage formats.
- Blind one-time JSON surgery that repairs counts without fixing the pipeline path.
- Broad spider/updater performance tuning unrelated to the invariant.
Avoided Traps
- Blind resurrection of every rich-user login into tracker. Rejected because username reuse can map the same login to a different GitHub account than the stored user ID.
- Treating #10117 as stale because it was filed in April. Rejected: current data is worse, and recent bot commits show active drift.
- Only repairing committed data. Rejected because hourly data-sync would continue recreating drift if the resolver/checkpoint/reconciliation path remains broken.
Related
- Parent investigation: #10117
- Historical rename-recovery intent: #9137
- Safe purge / fallen-hero context: #9135
- Current data-sync commits sampled:
bc8c7f25d, 8383fae69, 907240408
Origin Session ID: c934160e-e886-455a-b41e-4bb2dd1f2732
Handoff Retrieval Hints:
DevIndex users.jsonl tracker.json orphan invariant 614
0xBigBoss alleneubank username reuse database id 95193764
GitHub getLoginByDatabaseId user(databaseId) invalid GraphQL
Updater saveCheckpoint Storage.updateUsers updateTracker deleteUsers order
Context
Follow-up from #10117 intake/investigation on 2026-05-17. The original #10117 snapshot reported
354users.jsonlrecords absent fromtracker.jsonon 2026-04-19. Fresh V-B-A on currentdevdata shows the invariant is still broken and has worsened.Current counts from
apps/devindex/resources/data/:users.jsonlrich-user recordstracker.jsonentriesRecent bot data-sync history confirms this is not self-healing:
The Problem
tracker.jsonis the Updater scheduling index. A richusers.jsonlrecord absent fromtracker.jsonis invisible to refresh scheduling, so the profile can remain stale forever. The drift is currently growing slowly across hourly data-sync commits.The investigation also found a concrete rename / username-reuse hazard:
0xBigBosshas storedi: 95193764,tc: 17244,lu: 2026-05-08T15:18:19.485Z.GET /users/0xBigBossnow returns a different account,id: 283503690, created 2026-05-11.GET /user/95193764resolves the stored user ID toalleneubank.Therefore a naive cleanup that simply re-adds all rich-user logins into
tracker.jsonwould requeue stale logins and may overwrite an old identity with a new account that reused the login. The repair must be identity-aware.A second V-B-A catch:
apps/devindex/services/GitHub.mjs#getLoginByDatabaseId()currently uses GraphQLuser(databaseId: ...), but GitHub GraphQL rejects that argument:That means the archived rename-recovery intent from #9137 is not actually reliable today.
The Architectural Reality
Relevant current surfaces:
apps/devindex/services/Storage.mjs#updateUsers()writes rich records and prunes over-cap users, but does not add tracker entries for inserted rich records. It expects callers to own tracker updates.apps/devindex/services/Updater.mjs#saveCheckpoint()currently persists in this order:Storage.updateUsers(results),Storage.updateTracker(indexUpdates),Storage.deleteUsers(prunedLogins), then failed-list updates. Interruptions between these operations can leave one store updated without the other.apps/devindex/services/Updater.mjsrename recovery depends onGitHub.getLoginByDatabaseId()to map old stored user IDs to current logins.apps/devindex/services/GitHub.mjs#getLoginByDatabaseId()uses a GraphQL signature that is invalid against current GitHub GraphQL.apps/devindex/services/Cleanup.mjsonly resurrects allowlisted users into tracker. It does not reconcile non-allowlisted rich-user records missing from tracker, and it must not do so blindly because of login reuse..github/workflows/data-sync-pipeline.ymlrunsdevindex:spideranddevindex:update; the data-sync bot commits bothusers.jsonlandtracker.jsonhourly.The Fix
Implement a bounded, identity-aware repair and prevention path:
GitHub.getLoginByDatabaseId()with a current GitHub API-compatible resolver for integer database IDs. If no direct GraphQL lookup exists, use a safe REST fallback (GET /user/{id}) or equivalent verified endpoint.lutimestamp or a deliberate recheck timestamp.users.jsonlvstracker.jsonafter repair, or document why it cannot be enforced in CI until the committed data is repaired.Contract Ledger Matrix
GitHub.getLoginByDatabaseId()nullonly when the user cannot be resolved; transient API errors still throwlearn/guides/devindex/data-factory/GitHubAPI.mdUpdaterrename recoverylearn/guides/devindex/data-factory/Updater.md0xBigBossclass)learn/guides/devindex/data-factory/DataHygiene.md.github/workflows/data-sync-pipeline.ymlAcceptance Criteria
GitHub.getLoginByDatabaseId()no longer uses invaliduser(databaseId: ...)GraphQL and is covered by tests.0xBigBoss/alleneubankclass is represented in test fixtures or documented as a verified manual sample.users.jsonl,tracker.json, rich-user orphan count, tracker-only count, and sample classifications.Out of Scope
maxUserscap or contribution threshold.Avoided Traps
Related
bc8c7f25d,8383fae69,907240408Origin Session ID: c934160e-e886-455a-b41e-4bb2dd1f2732
Handoff Retrieval Hints:
DevIndex users.jsonl tracker.json orphan invariant 6140xBigBoss alleneubank username reuse database id 95193764GitHub getLoginByDatabaseId user(databaseId) invalid GraphQLUpdater saveCheckpoint Storage.updateUsers updateTracker deleteUsers order