What are hedged LLM requests?

Hedged requests — a latency-reduction technique in which an LLM gateway fires two (or more) provider requests in parallel for the same prompt and returns whichever completes first, cancelling the others. Reduces p95 and p99 latency by avoiding the worst-case provider hang, at the cost of roughly 2x token spend per hedged request.

Why hedging works

LLM provider latency has a long tail. Median is fast; p95 includes occasional slow responses from rate-limit queue backups, network jitter, or cold-start variance. Hedging two independent providers in parallel dramatically reduces the chance that both are simultaneously in their slow tail.

If provider A has 5% probability of being >3 seconds and provider B has 5% independently, the probability that both are slow is ~0.25%. The first-to-complete wins, so the user-perceived p95 drops sharply.

The trade-off

How GammaInfra hedges

Hedging fires when (a) the request targets gammainfra/fast or X-GammaInfra-Preference: latency, (b) the gateway operator has enabled hedging via the HEDGE_ENABLED=true environment variable, (c) the task's fallback chain has at least 2 distinct endpoints with similar quality, and (d) the request is non-streaming.

When enabled, the gateway fires the top-2 endpoints in parallel via asyncio.create_task, waits on asyncio.wait(..., FIRST_COMPLETED), returns the winner, and cancels the loser. If the first-completing leg raised an error (rather than completing successfully), the gateway keeps waiting on the other.

Counters on /metrics: kraken_hedge_fired_total{primary,secondary}, kraken_hedge_wins_total{position}, kraken_hedge_waste_total. Response header X-GammaInfra-Router-Version: v2_hedged identifies hedged wins.

Why streaming hedging is deferred

Streaming hedging is harder because the gateway has to decide which provider to relay to the client before the first chunk arrives. Once you start relaying provider A's chunks, you can't switch to provider B mid-stream without giving the client inconsistent output. Possible approaches include buffering both until first-byte from each (defeats the latency win), or always picking provider A and only failing over to provider B if A times out (effectively just fallback, not hedging).

GammaInfra's hedging currently fires only on non-streaming requests for this reason. The streaming case is a roadmap item once a clean buffering strategy is validated.

Common questions

Should I always use hedged requests?
No. Hedging roughly doubles token cost. Use it when latency variance hurts your application — interactive chat with strict p95 SLAs, real-time tools, autocomplete-style features. Skip it for batch processing, background workflows, and cost-sensitive workloads where median latency is fine.
How much does hedging actually reduce latency?
Depends on provider-pair correlation. On uncorrelated providers (different vendors, different clouds), hedging typically drops p95 by 30–60% in practice. On correlated providers (e.g. two endpoints on the same vendor sharing infrastructure), the win is smaller because both tend to be slow at the same times. GammaInfra picks the top-2 across distinct providers to maximize the independence assumption.
What if both hedged providers fail?
The gateway re-raises the error and falls through to the standard fallback chain. So a hedged request that fails on both primary endpoints still tries chain entries 3, 4, etc. The customer never sees the failure unless every chain entry fails.
Does hedging count once or twice against my rate limit?
Twice — both upstream provider calls are real. The 240-rpm per-API-key cap on the gateway counts the customer's request once (it's one gateway call), but the gateway makes two upstream calls so your provider-side rate limit on each provider is hit independently. If you BYOK, this means hedged requests effectively halve your per-provider rate-limit headroom.
Is hedging the same as fallback?
No. Fallback is sequential: try provider A, if it fails try provider B. Hedging is parallel: try both A and B simultaneously, take the first success. Fallback adds latency on failure (you waited for A to fail before starting B). Hedging adds cost on success (both A and B ran, only one was needed). The two are complementary — hedging the head of a chain, fallback the tail.

Try the gateway

Get a GammaInfra API key →

$3 free trial credit on signup, $10 minimum top-up. Pass-through provider rates plus 3% top-up fee during the launch window (5% after 2026-06-23).

Last updated 2026-05-15.