What are hedged LLM requests?
Hedged requests — a latency-reduction technique in which an LLM gateway fires two (or more) provider requests in parallel for the same prompt and returns whichever completes first, cancelling the others. Reduces p95 and p99 latency by avoiding the worst-case provider hang, at the cost of roughly 2x token spend per hedged request.
Why hedging works
LLM provider latency has a long tail. Median is fast; p95 includes occasional slow responses from rate-limit queue backups, network jitter, or cold-start variance. Hedging two independent providers in parallel dramatically reduces the chance that both are simultaneously in their slow tail.
If provider A has 5% probability of being >3 seconds and provider B has 5% independently, the probability that both are slow is ~0.25%. The first-to-complete wins, so the user-perceived p95 drops sharply.
The trade-off
- ~2x token cost for hedged requests — both providers run the full prompt; only the winner's output is used.
- Only beneficial when latency variance is high. If both providers respond in 800 ms consistently, hedging just doubles cost with no latency win.
- Cancellation must be reliable. If the gateway can't cancel the loser, the slow provider keeps tokens flowing and bills accrue for unused output. Streaming hedging is particularly tricky — once the loser has started emitting tokens, cancellation requires HTTP connection close, not a clean API cancel.
- Same response goes back to the caller regardless. Hedging doesn't pick the better answer — just the faster one. Both providers must be acceptable for the task.
How GammaInfra hedges
Hedging fires when (a) the request targets gammainfra/fast or X-GammaInfra-Preference: latency, (b) the gateway operator has enabled hedging via the HEDGE_ENABLED=true environment variable, (c) the task's fallback chain has at least 2 distinct endpoints with similar quality, and (d) the request is non-streaming.
When enabled, the gateway fires the top-2 endpoints in parallel via asyncio.create_task, waits on asyncio.wait(..., FIRST_COMPLETED), returns the winner, and cancels the loser. If the first-completing leg raised an error (rather than completing successfully), the gateway keeps waiting on the other.
Counters on /metrics: kraken_hedge_fired_total{primary,secondary}, kraken_hedge_wins_total{position}, kraken_hedge_waste_total. Response header X-GammaInfra-Router-Version: v2_hedged identifies hedged wins.
Why streaming hedging is deferred
Streaming hedging is harder because the gateway has to decide which provider to relay to the client before the first chunk arrives. Once you start relaying provider A's chunks, you can't switch to provider B mid-stream without giving the client inconsistent output. Possible approaches include buffering both until first-byte from each (defeats the latency win), or always picking provider A and only failing over to provider B if A times out (effectively just fallback, not hedging).
GammaInfra's hedging currently fires only on non-streaming requests for this reason. The streaming case is a roadmap item once a clean buffering strategy is validated.
Common questions
Should I always use hedged requests?
How much does hedging actually reduce latency?
What if both hedged providers fail?
Does hedging count once or twice against my rate limit?
Is hedging the same as fallback?
Try the gateway
$3 free trial credit on signup, $10 minimum top-up. Pass-through provider rates plus 3% top-up fee during the launch window (5% after 2026-06-23).
Last updated 2026-05-15.