Context
Operator reported a live npm run ai:sync-github-workflow / GitHub Workflow sync failure on 2026-05-18 from /Users/Shared/github/neomjs/neo:
[INFO] 🔄 Fetching pull requests from GitHub via GraphQL...
❌ Sync failed: Error: GitHub API request failed: 504 Gateway Timeout
at GraphqlService.query (file:///Users/Shared/github/neomjs/neo/ai/services/github-workflow/GraphqlService.mjs:97:19)
at async PullRequestSyncer.syncPullRequests (file:///Users/Shared/github/neomjs/neo/ai/services/github-workflow/sync/PullRequestSyncer.mjs:213:26)
at async SyncService.runFullSync (file:///Users/Shared/github/neomjs/neo/ai/services/github-workflow/SyncService.mjs:116:28)
at async withHeavyMaintenanceLease (file:///Users/Shared/github/neomjs/neo/ai/daemons/services/HeavyMaintenanceLeaseService.mjs:522:21)
at async syncGithubWorkflow (file:///Users/Shared/github/neomjs/neo/buildScripts/ai/syncGithubWorkflow.mjs:61:19)
The sync pipeline is already under pressure from recent GitHub Workflow substrate changes. A single transient GitHub 504 currently aborts the whole sync after earlier expensive phases have already completed.
The Problem
GraphqlService.query() throws immediately on any non-OK HTTP response. It does not retry transient GitHub/proxy failures such as 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout, 429 Too Many Requests, network disconnects, or terminated fetch failures.
That makes SyncService.runFullSync() brittle: PullRequestSyncer.syncPullRequests() paginates the full PR corpus via GraphQL after issues, releases, and discussions have already run. One transient gateway timeout on any page loses the entire sync run and prevents the generated resources/content corpus from refreshing.
The Architectural Reality
Current source evidence:
ai/services/github-workflow/GraphqlService.mjs:83-99 performs a single fetch() and throws GitHub API request failed: ${response.status} ${response.statusText} on any non-OK response.
ai/services/github-workflow/sync/PullRequestSyncer.mjs:213-222 calls GraphqlService.query(FETCH_PULL_REQUESTS_FOR_SYNC, ...) inside a pagination loop without any local retry.
ai/services/github-workflow/SyncService.mjs:116 runs PR sync after other sync phases, so a late 504 wastes the earlier completed work.
- Live V-B-A from this session: the first 52 pages of the same PR query shape (
limit=30, maxComments=100, maxReviews=20) completed successfully from Codex when network access was allowed, which points to transient GitHub/proxy failure rather than a deterministic schema or cursor break.
- Memory/KB precedent: archived issue #9063 documents that similar GitHub GraphQL 502/504 failures were previously mitigated with exponential backoff retry and smaller query chunks. The current
GraphqlService no longer carries that resilience primitive.
The Fix
Add retry/backoff at the cross-cutting GraphqlService.query() layer instead of each syncer:
- Retry transient HTTP statuses:
429, 502, 503, 504.
- Retry fetch/network failures whose message/cause indicates network disconnects, DNS/socket reset, or
terminated/fetch failed transient transport failure.
- Respect
Retry-After when GitHub provides it; otherwise use bounded exponential backoff with jitter.
- Preserve fail-fast behavior for non-transient 4xx errors and GraphQL semantic
errors responses unless a future ticket adds partial-data handling.
- Log retry attempts with operation context that is safe for logs; do not dump tokens or full query payloads.
- Add focused Playwright unit coverage for retry-on-504 and no-retry-on-400/GraphQL-errors behavior.
Contract Ledger Matrix
| Target Surface |
Source of Authority |
Proposed Behavior |
Fallback |
Docs |
Evidence |
GraphqlService.query() |
Operator 2026-05-18 sync failure + #9063 precedent |
Transient GitHub transport/gateway failures retry before failing the caller |
Non-transient errors still throw immediately |
JSDoc in GraphqlService.mjs |
Unit coverage with mocked fetch |
| GitHub Workflow full sync |
SyncService.runFullSync() + PullRequestSyncer.syncPullRequests() |
A single transient PR page 504 no longer aborts the sync without retry |
If retries exhaust, retain current failure visibility |
CLI stderr / logger output |
Targeted unit + operator rerun |
Acceptance Criteria
Out of Scope
- Redesigning
PullRequestSyncer pagination or changing PR query fields unless retry evidence proves insufficient.
- Changing issue/discussion/release sync semantics.
- Partial-data GraphQL error handling from older #10096.
- Auto-push / pre-commit hook behavior from #11580 / #11582.
- AGENTS.md anchor/link cleanup from #11584.
Avoided Traps / Gold Standards Rejected
- Retry inside
PullRequestSyncer only: rejected because transient GraphQL failures can hit labels, issues, discussions, PRs, project mutations, or comments. The owning abstraction is the shared GraphQL transport service.
- Treating 504 as purely external and telling the operator to rerun manually: rejected because Neo already has precedent (#9063) that GitHub GraphQL gateway failures need bounded retry resilience.
- Reducing PR sync page size first: deferred. The live pagination probe did not reproduce deterministic 504 in the first 52 pages; add cross-cutting retry first, then reduce query complexity only if operator reruns still fail after retries.
Related
- Operator failure report: 2026-05-18
GraphqlService.query 504 in PullRequestSyncer.syncPullRequests().
- #9063 — older GitHub GraphQL 502/504 mitigation precedent via retry/backoff and query chunking.
- #11469 —
ai:sync-github-workflow CLI enabler.
- #11503 / #11507 — heavy-maintenance lease around
syncGithubWorkflow.
- #11580 / #11582 — recent sync hook/root automation regressions, adjacent but not this transport failure.
Origin Session ID: 8591bc48-0ddc-48bf-aa47-58e53ea81a57
Handoff Retrieval Hints:
query_raw_memories("GraphqlService query 504 Gateway Timeout PullRequestSyncer syncGithubWorkflow retry backoff")
rg -n "GraphqlService.query|FETCH_PULL_REQUESTS_FOR_SYNC|syncGithubWorkflow|Retry-After|504" ai/services/github-workflow test/playwright/unit/ai/services/github-workflow
Context
Operator reported a live
npm run ai:sync-github-workflow/ GitHub Workflow sync failure on 2026-05-18 from/Users/Shared/github/neomjs/neo:[INFO] 🔄 Fetching pull requests from GitHub via GraphQL... ❌ Sync failed: Error: GitHub API request failed: 504 Gateway Timeout at GraphqlService.query (file:///Users/Shared/github/neomjs/neo/ai/services/github-workflow/GraphqlService.mjs:97:19) at async PullRequestSyncer.syncPullRequests (file:///Users/Shared/github/neomjs/neo/ai/services/github-workflow/sync/PullRequestSyncer.mjs:213:26) at async SyncService.runFullSync (file:///Users/Shared/github/neomjs/neo/ai/services/github-workflow/SyncService.mjs:116:28) at async withHeavyMaintenanceLease (file:///Users/Shared/github/neomjs/neo/ai/daemons/services/HeavyMaintenanceLeaseService.mjs:522:21) at async syncGithubWorkflow (file:///Users/Shared/github/neomjs/neo/buildScripts/ai/syncGithubWorkflow.mjs:61:19)The sync pipeline is already under pressure from recent GitHub Workflow substrate changes. A single transient GitHub 504 currently aborts the whole sync after earlier expensive phases have already completed.
The Problem
GraphqlService.query()throws immediately on any non-OK HTTP response. It does not retry transient GitHub/proxy failures such as502 Bad Gateway,503 Service Unavailable,504 Gateway Timeout,429 Too Many Requests, network disconnects, orterminatedfetch failures.That makes
SyncService.runFullSync()brittle:PullRequestSyncer.syncPullRequests()paginates the full PR corpus via GraphQL after issues, releases, and discussions have already run. One transient gateway timeout on any page loses the entire sync run and prevents the generatedresources/contentcorpus from refreshing.The Architectural Reality
Current source evidence:
ai/services/github-workflow/GraphqlService.mjs:83-99performs a singlefetch()and throwsGitHub API request failed: ${response.status} ${response.statusText}on any non-OK response.ai/services/github-workflow/sync/PullRequestSyncer.mjs:213-222callsGraphqlService.query(FETCH_PULL_REQUESTS_FOR_SYNC, ...)inside a pagination loop without any local retry.ai/services/github-workflow/SyncService.mjs:116runs PR sync after other sync phases, so a late 504 wastes the earlier completed work.limit=30,maxComments=100,maxReviews=20) completed successfully from Codex when network access was allowed, which points to transient GitHub/proxy failure rather than a deterministic schema or cursor break.GraphqlServiceno longer carries that resilience primitive.The Fix
Add retry/backoff at the cross-cutting
GraphqlService.query()layer instead of each syncer:429,502,503,504.terminated/fetch failedtransient transport failure.Retry-Afterwhen GitHub provides it; otherwise use bounded exponential backoff with jitter.errorsresponses unless a future ticket adds partial-data handling.Contract Ledger Matrix
GraphqlService.query()GraphqlService.mjsfetchSyncService.runFullSync()+PullRequestSyncer.syncPullRequests()Acceptance Criteria
GraphqlService.query()retries transient HTTP statuses429,502,503, and504with bounded exponential backoff.GraphqlService.query()retries transient fetch/network failures includingterminated/fetch failedstyle errors.Retry-Afteris honored when present; otherwise bounded jittered backoff is used.400still fail without retry.errorsresponses preserve current failure behavior.errorsresponses.npm run ai:sync-github-workflowand a transient 504 no longer aborts on the first occurrence.Out of Scope
PullRequestSyncerpagination or changing PR query fields unless retry evidence proves insufficient.Avoided Traps / Gold Standards Rejected
PullRequestSynceronly: rejected because transient GraphQL failures can hit labels, issues, discussions, PRs, project mutations, or comments. The owning abstraction is the shared GraphQL transport service.Related
GraphqlService.query504 inPullRequestSyncer.syncPullRequests().ai:sync-github-workflowCLI enabler.syncGithubWorkflow.Origin Session ID: 8591bc48-0ddc-48bf-aa47-58e53ea81a57
Handoff Retrieval Hints:
query_raw_memories("GraphqlService query 504 Gateway Timeout PullRequestSyncer syncGithubWorkflow retry backoff")rg -n "GraphqlService.query|FETCH_PULL_REQUESTS_FOR_SYNC|syncGithubWorkflow|Retry-After|504" ai/services/github-workflow test/playwright/unit/ai/services/github-workflow