What is an LLM fallback chain?

Fallback chain — an ordered list of (provider, model) endpoints that an LLM gateway tries in sequence when the primary endpoint fails. If the first endpoint returns a 5xx, 429, or times out, the gateway cascades to the second; if that fails, the third; and so on. The caller sees one successful response; the cascade is reported in a response header.

Why fallback chains exist

LLM provider outages happen. Provider rate limits are tighter than developers realize. A single hour of GPT-5-mini throttling can stall an entire production application. Fallback chains move the failure-recovery decision from the application code to the gateway: the gateway tries another provider automatically, and the application keeps working.

The same mechanism handles transient 5xx errors, rate-limit 429s, timeout-exceeded, and any other upstream failure. From the application's perspective, the gateway either succeeds or returns 503 only when every chain entry fails.

What gets included in a chain

GammaInfra's chains per task label

How callers see the cascade

Every response includes the X-GammaInfra-Fallback-Chain response header listing the endpoints tried in order. If only the primary fired, the chain has one entry. If a cascade happened, the chain shows the full path:

X-GammaInfra-Endpoint: anthropic/claude-sonnet-4-6
X-GammaInfra-Fallback-Chain: openai/gpt-5-mini,anthropic/claude-sonnet-4-6
X-GammaInfra-Fallback-Reason: openai_rate_limit

Log these headers for visibility into how often cascades fire and which providers are responsible. A chain that fires repeatedly on the same primary is a signal to re-order the chain or investigate the primary's health.

Common questions

What triggers a fallback?
HTTP 5xx errors, 429 rate-limit errors, connection timeouts, exceeded per-provider timeout (30 seconds default), and exceeded caller's X-GammaInfra-Max-Latency-Ms budget if set. The provider's health-check failures also drop it from candidate selection for the next 30 seconds.
Can I customize the chain per request?
Yes. Pass models: [list, of, models, in, order] in the request body to use that explicit chain (fails 503 on exhaustion, no auto-router). Or use provider.only and provider.ignore to filter the default chain. Or X-GammaInfra-Routing: literal to force literal endpoint selection from a bare model name.
What happens when every chain entry fails?
The gateway returns 503 with code providers_down and includes the full cascade in X-GammaInfra-Fallback-Chain and X-GammaInfra-Fallback-Reason. This is rare in practice — chains span 3+ providers, and simultaneous outages across distinct vendors are statistically uncommon. When it does happen, retry with exponential backoff; the issue usually clears within seconds.
Does fallback work for streaming responses?
Fallback only fires before the first byte of response data has been sent. Once the gateway has begun streaming, mid-stream provider failures cannot fall back — the caller would already have received partial data, and switching providers would produce inconsistent output. Pre-stream errors (auth, rate-limit, immediate 5xx) do cascade normally.
What's the latency cost of a fallback?
Each failed attempt adds whatever time the failed provider took to return its error — typically 0.5 to 3 seconds for upstream 5xx, up to the per-provider timeout (30 seconds default) for hangs. Setting X-GammaInfra-Max-Latency-Ms caps the total time the gateway will spend cascading.

Try the gateway

Get a GammaInfra API key →

$3 free trial credit on signup, $10 minimum top-up. Pass-through provider rates plus 3% top-up fee during the launch window (5% after 2026-06-23).

Last updated 2026-05-15.