Data Hygiene & Cleanup
The Cleanup Service (DevIndex.services.Cleanup) acts as the Garbage Collector and State Enforcer for the DevIndex data pipeline.
Because the Data Factory operates autonomously—discovering thousands of users and constantly writing to JSON files—data entropy is inevitable. The Cleanup service is invoked automatically by the Orchestrator before any major operation to ensure the system starts with a clean, consistent, and optimized state.
Core Responsibilities
The Cleanup service performs a strict, multi-pass filtering and formatting routine on the entire JSON dataset.
1. Threshold Pruning (The Meritocracy Filter)
To prevent the index from bloating with inactive or low-value data, the system enforces a strict minTotalContributions threshold (configured in config.mjs).
During cleanup, every profile in the rich users.jsonl store is evaluated. If a user's total contributions fall below this threshold, their profile is permanently deleted. Subsequently, they are also purged from the tracker.json index, ensuring the Updater doesn't waste API quota continually re-evaluating them.
2. Blocklist Enforcement
Privacy is paramount. If a user's GitHub handle appears in blocklist.json (usually populated by the Opt-Out service), the Cleanup routine will aggressively hard-delete any trace of that user from all data files (users.jsonl, tracker.json, etc.). This is an absolute override that happens on every execution.
3. Allowlist Protection & "Resurrection"
Conversely, the allowlist.json provides an absolute protective barrier. It serves two distinct functions during cleanup:
- Protection: If a user is on the allowlist, they are completely exempt from Threshold Pruning. They will remain in the index even if they have 0 contributions.
- Resurrection: Before pruning begins, the service cross-references the allowlist against the tracker queue. If an allowlisted VIP is somehow missing from the tracker, the Cleanup service will instantly "resurrect" them, injecting a new pending entry (
lastUpdate: null) into the queue to ensure they are scheduled for the next Updater run.
4. The 30-Day "Penalty Box" TTL
When the Updater encounters an error analyzing a user (e.g., a GraphQL timeout or a 404), that user is placed in failed.json—the "Penalty Box."
Users in the Penalty Box are temporarily protected from being completely pruned from the tracker, giving them a chance to be successfully processed in a future run. However, the Cleanup service enforces a strict 30-Day Time-To-Live (TTL).
// Enforce Retention Policy (Penalty Box TTL)
const THIRTY_DAYS_MS = 30 * 24 * 60 * 60 * 1000;
const now = Date.now();
for (const [login, timestamp] of failed) {
const ts = new Date(timestamp).getTime();
if (now - ts > THIRTY_DAYS_MS) {
console.log(`[Cleanup] Expiring failed user (TTL > 30d): ${login}`);
failed.delete(login); // Removes Tracker protection
}
}
If a user remains in a failed state for more than 30 days, their protection is revoked, and the standard pruning logic will expunge them from the system.
Rationale & The "Right to be Forgotten": With a 50,000 user cap and an hourly update limit of 800, the pipeline completes a full cycle of the entire index approximately every 3 days. A 30-day retention period means a user has persistently failed roughly 10 consecutive update attempts (each of which internally utilizes multiple API retries). After a month of continuous failures, it is safe to assume the profile has either been permanently banned by GitHub or the user has explicitly deleted their own account. Automatically purging these records aligns with data minimization principles and proactively respects a user's right to be forgotten if they have chosen to remove their presence from the platform.
5. Canonical Sorting
The final step of the Cleanup routine is purely for Developer Experience (DX) and Git repository health.
When JSON files are modified by asynchronous services, the order of keys or array elements is often unpredictable. If committed as-is, this creates massive, noisy Git diffs, making code review impossible.
The Cleanup service applies Canonical Sorting before writing any data back to disk:
users.jsonlis strictly sorted bytotal_contributions(Descending).tracker.jsonis sorted alphabetically bylogin(Ascending).blocklist.json,allowlist.json, andvisited.jsonare sorted alphabetically (Ascending).
This guarantees that any Git diff generated by the pipeline accurately reflects meaningful changes rather than arbitrary structural shuffling.