LearnNewsExamplesServices
Frontmatter
id11585
titleRetry transient GitHub GraphQL failures during sync
stateClosed
labels
bugairegressionbuildmodel-experience
assigneesneo-gpt
createdAtMay 18, 2026, 7:59 PM
updatedAtMay 18, 2026, 10:06 PM
githubUrlhttps://github.com/neomjs/neo/issues/11585
authorneo-gpt
commentsCount0
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 18, 2026, 10:06 PM

Retry transient GitHub GraphQL failures during sync

Closedbugairegressionbuildmodel-experience
neo-gpt
neo-gpt commented on May 18, 2026, 7:59 PM

Context

Operator reported a live npm run ai:sync-github-workflow / GitHub Workflow sync failure on 2026-05-18 from /Users/Shared/github/neomjs/neo:

[INFO] 🔄 Fetching pull requests from GitHub via GraphQL...
❌ Sync failed: Error: GitHub API request failed: 504 Gateway Timeout
    at GraphqlService.query (file:///Users/Shared/github/neomjs/neo/ai/services/github-workflow/GraphqlService.mjs:97:19)
    at async PullRequestSyncer.syncPullRequests (file:///Users/Shared/github/neomjs/neo/ai/services/github-workflow/sync/PullRequestSyncer.mjs:213:26)
    at async SyncService.runFullSync (file:///Users/Shared/github/neomjs/neo/ai/services/github-workflow/SyncService.mjs:116:28)
    at async withHeavyMaintenanceLease (file:///Users/Shared/github/neomjs/neo/ai/daemons/services/HeavyMaintenanceLeaseService.mjs:522:21)
    at async syncGithubWorkflow (file:///Users/Shared/github/neomjs/neo/buildScripts/ai/syncGithubWorkflow.mjs:61:19)

The sync pipeline is already under pressure from recent GitHub Workflow substrate changes. A single transient GitHub 504 currently aborts the whole sync after earlier expensive phases have already completed.

The Problem

GraphqlService.query() throws immediately on any non-OK HTTP response. It does not retry transient GitHub/proxy failures such as 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout, 429 Too Many Requests, network disconnects, or terminated fetch failures.

That makes SyncService.runFullSync() brittle: PullRequestSyncer.syncPullRequests() paginates the full PR corpus via GraphQL after issues, releases, and discussions have already run. One transient gateway timeout on any page loses the entire sync run and prevents the generated resources/content corpus from refreshing.

The Architectural Reality

Current source evidence:

  • ai/services/github-workflow/GraphqlService.mjs:83-99 performs a single fetch() and throws GitHub API request failed: ${response.status} ${response.statusText} on any non-OK response.
  • ai/services/github-workflow/sync/PullRequestSyncer.mjs:213-222 calls GraphqlService.query(FETCH_PULL_REQUESTS_FOR_SYNC, ...) inside a pagination loop without any local retry.
  • ai/services/github-workflow/SyncService.mjs:116 runs PR sync after other sync phases, so a late 504 wastes the earlier completed work.
  • Live V-B-A from this session: the first 52 pages of the same PR query shape (limit=30, maxComments=100, maxReviews=20) completed successfully from Codex when network access was allowed, which points to transient GitHub/proxy failure rather than a deterministic schema or cursor break.
  • Memory/KB precedent: archived issue #9063 documents that similar GitHub GraphQL 502/504 failures were previously mitigated with exponential backoff retry and smaller query chunks. The current GraphqlService no longer carries that resilience primitive.

The Fix

Add retry/backoff at the cross-cutting GraphqlService.query() layer instead of each syncer:

  • Retry transient HTTP statuses: 429, 502, 503, 504.
  • Retry fetch/network failures whose message/cause indicates network disconnects, DNS/socket reset, or terminated/fetch failed transient transport failure.
  • Respect Retry-After when GitHub provides it; otherwise use bounded exponential backoff with jitter.
  • Preserve fail-fast behavior for non-transient 4xx errors and GraphQL semantic errors responses unless a future ticket adds partial-data handling.
  • Log retry attempts with operation context that is safe for logs; do not dump tokens or full query payloads.
  • Add focused Playwright unit coverage for retry-on-504 and no-retry-on-400/GraphQL-errors behavior.

Contract Ledger Matrix

Target Surface Source of Authority Proposed Behavior Fallback Docs Evidence
GraphqlService.query() Operator 2026-05-18 sync failure + #9063 precedent Transient GitHub transport/gateway failures retry before failing the caller Non-transient errors still throw immediately JSDoc in GraphqlService.mjs Unit coverage with mocked fetch
GitHub Workflow full sync SyncService.runFullSync() + PullRequestSyncer.syncPullRequests() A single transient PR page 504 no longer aborts the sync without retry If retries exhaust, retain current failure visibility CLI stderr / logger output Targeted unit + operator rerun

Acceptance Criteria

  • GraphqlService.query() retries transient HTTP statuses 429, 502, 503, and 504 with bounded exponential backoff.
  • GraphqlService.query() retries transient fetch/network failures including terminated / fetch failed style errors.
  • Retry-After is honored when present; otherwise bounded jittered backoff is used.
  • Non-transient HTTP responses such as 400 still fail without retry.
  • GraphQL semantic errors responses preserve current failure behavior.
  • Unit coverage verifies retry success after an initial 504 and fail-after-retry-budget behavior.
  • Unit coverage verifies no retry for non-transient 400 or GraphQL errors responses.
  • Operator can rerun npm run ai:sync-github-workflow and a transient 504 no longer aborts on the first occurrence.

Out of Scope

  • Redesigning PullRequestSyncer pagination or changing PR query fields unless retry evidence proves insufficient.
  • Changing issue/discussion/release sync semantics.
  • Partial-data GraphQL error handling from older #10096.
  • Auto-push / pre-commit hook behavior from #11580 / #11582.
  • AGENTS.md anchor/link cleanup from #11584.

Avoided Traps / Gold Standards Rejected

  • Retry inside PullRequestSyncer only: rejected because transient GraphQL failures can hit labels, issues, discussions, PRs, project mutations, or comments. The owning abstraction is the shared GraphQL transport service.
  • Treating 504 as purely external and telling the operator to rerun manually: rejected because Neo already has precedent (#9063) that GitHub GraphQL gateway failures need bounded retry resilience.
  • Reducing PR sync page size first: deferred. The live pagination probe did not reproduce deterministic 504 in the first 52 pages; add cross-cutting retry first, then reduce query complexity only if operator reruns still fail after retries.

Related

  • Operator failure report: 2026-05-18 GraphqlService.query 504 in PullRequestSyncer.syncPullRequests().
  • #9063 — older GitHub GraphQL 502/504 mitigation precedent via retry/backoff and query chunking.
  • #11469ai:sync-github-workflow CLI enabler.
  • #11503 / #11507 — heavy-maintenance lease around syncGithubWorkflow.
  • #11580 / #11582 — recent sync hook/root automation regressions, adjacent but not this transport failure.

Origin Session ID: 8591bc48-0ddc-48bf-aa47-58e53ea81a57

Handoff Retrieval Hints: query_raw_memories("GraphqlService query 504 Gateway Timeout PullRequestSyncer syncGithubWorkflow retry backoff") rg -n "GraphqlService.query|FETCH_PULL_REQUESTS_FOR_SYNC|syncGithubWorkflow|Retry-After|504" ai/services/github-workflow test/playwright/unit/ai/services/github-workflow

tobiu referenced in commit cbafce0 - "fix(github-workflow): retry transient GraphQL sync failures (#11585) (#11587) on May 18, 2026, 10:06 PM
tobiu closed this issue on May 18, 2026, 10:06 PM