What is an LLM fallback chain?
Fallback chain — an ordered list of (provider, model) endpoints that an LLM gateway tries in sequence when the primary endpoint fails. If the first endpoint returns a 5xx, 429, or times out, the gateway cascades to the second; if that fails, the third; and so on. The caller sees one successful response; the cascade is reported in a response header.
Why fallback chains exist
LLM provider outages happen. Provider rate limits are tighter than developers realize. A single hour of GPT-5-mini throttling can stall an entire production application. Fallback chains move the failure-recovery decision from the application code to the gateway: the gateway tries another provider automatically, and the application keeps working.
The same mechanism handles transient 5xx errors, rate-limit 429s, timeout-exceeded, and any other upstream failure. From the application's perspective, the gateway either succeeds or returns 503 only when every chain entry fails.
What gets included in a chain
- Distinct providers. A chain of 3–4 models all from one provider doesn't protect against that provider's outage. Chains should span at least 3 different vendors.
- Roughly equivalent quality. The chain runs in quality-descending order, but every entry should still be acceptable for the task. Adding a much-weaker model as #4 means the customer gets a noticeably worse answer on a bad day — usually better to fail loud than degrade quietly.
- Cross-cloud coverage. If you only span vendor APIs that all run in one cloud, a cloud-level network event takes down the whole chain. Mixing Bedrock (AWS-hosted Claude/Llama/Mistral) with native vendor APIs adds infrastructure independence.
GammaInfra's chains per task label
- reasoning / math — DeepSeek V4 Pro → GPT-5.4 → Claude Opus 4.6 → Gemini 3.1 Pro
- code — Claude Sonnet 4.6 → GPT-5.4 Mini → DeepSeek V4 Flash → Devstral 2512
- chat — GPT-5 Mini → Mistral Small 2603 → Grok 4.1 Fast → Llama 3.1 8B
- extraction — Gemini 2.5 Flash Lite → Llama 3.1 8B → GPT-5 Nano → Mistral Small 2603
- summarize — Gemini 3.1 Flash Lite → DeepSeek V4 Flash → GPT-5 Mini → Mistral Small 2603
- translation — Mistral Large 2512 → GPT-5 Mini → DeepSeek V4 Flash → Gemini 3 Flash
- tool_use — GPT-5.4 Mini → Claude Sonnet 4.6 → DeepSeek V4 Flash
- multimodal — Gemini 3.1 Pro → Claude Sonnet 4.6 → GPT-5.4 Mini → Gemini 3 Flash
How callers see the cascade
Every response includes the X-GammaInfra-Fallback-Chain response header listing the endpoints tried in order. If only the primary fired, the chain has one entry. If a cascade happened, the chain shows the full path:
X-GammaInfra-Endpoint: anthropic/claude-sonnet-4-6
X-GammaInfra-Fallback-Chain: openai/gpt-5-mini,anthropic/claude-sonnet-4-6
X-GammaInfra-Fallback-Reason: openai_rate_limit
Log these headers for visibility into how often cascades fire and which providers are responsible. A chain that fires repeatedly on the same primary is a signal to re-order the chain or investigate the primary's health.
Common questions
What triggers a fallback?
X-GammaInfra-Max-Latency-Ms budget if set. The provider's health-check failures also drop it from candidate selection for the next 30 seconds.Can I customize the chain per request?
models: [list, of, models, in, order] in the request body to use that explicit chain (fails 503 on exhaustion, no auto-router). Or use provider.only and provider.ignore to filter the default chain. Or X-GammaInfra-Routing: literal to force literal endpoint selection from a bare model name.What happens when every chain entry fails?
providers_down and includes the full cascade in X-GammaInfra-Fallback-Chain and X-GammaInfra-Fallback-Reason. This is rare in practice — chains span 3+ providers, and simultaneous outages across distinct vendors are statistically uncommon. When it does happen, retry with exponential backoff; the issue usually clears within seconds.Does fallback work for streaming responses?
What's the latency cost of a fallback?
X-GammaInfra-Max-Latency-Ms caps the total time the gateway will spend cascading.Try the gateway
$3 free trial credit on signup, $10 minimum top-up. Pass-through provider rates plus 3% top-up fee during the launch window (5% after 2026-06-23).
Last updated 2026-05-15.