Designing a continuous cost/quality dial for LLM routing
Most LLM routers force you to pick a tier. cheap, balanced, quality. Three buckets, one knob per request. That works as a first cut, but it has two problems we kept hitting in practice.
- "Cheap" for code review isn't the same set of models as "cheap" for chat. The optimal mapping shifts by task.
- The space between buckets is wasted. If you'd take a 20% latency hit to save 40% on cost, no discrete bucket captures that exactly.
So we added a continuous dial. Send any value between 0.0 and 1.0 in a request header and the router weighs cost-vs-quality at that exact point.
curl https://gammainfra.com/v1/chat/completions \
-H "Authorization: Bearer sk-gammainfra-..." \
-H "X-GammaInfra-Cost-Quality: 0.3" \
-d '{"model":"gammainfra/auto","messages":[...]}'
0.0 = pure quality, 1.0 = pure cost. The router echoes the applied value back as X-GammaInfra-Cost-Quality-Applied: 0.3 so you can confirm it drove the decision.
Why continuous
Discrete tiers force the router into one of three boxes per request. Continuous lets the router pick a model that no discrete bucket would name.
Take three calls with the same prompt — "summarize this 4-page legal contract" — at three dial positions:
| Dial | Routes to | Latency | Cost |
|---|---|---|---|
| 0.0 | anthropic/claude-opus-4-7 | ~3.5s | ~$0.040 |
| 0.3 | anthropic/claude-sonnet-4-6 | ~2.1s | ~$0.015 |
| 0.5 | openai/gpt-5-mini | ~1.2s | ~$0.003 |
| 0.7 | google/gemini-3-flash-preview | ~0.9s | ~$0.0012 |
| 1.0 | groq/llama-3.1-8b-instant | ~0.4s | ~$0.0002 |
Five different models, all reasonable for the task, none of them named "tier 1/2/3." The dial lets you slide between them per-request without re-thinking the model decision.
Per-request dial vs per-customer config
Putting the dial in a request header (instead of a customer-level setting) means three things change for free:
- You can shift the dial per use case. Your nightly summary cron job sets
0.7for cost; your customer-support reply path sets0.2for quality. Same customer, same API key, different cost-quality trade per code path. - You can A/B the dial. Send half your traffic at 0.3 and half at 0.5, compare downstream metrics (user-perceived quality, click-through rate, refund rate), pick the sweet spot empirically.
- You can drift it gradually. Start at 0.2 because you trust quality. Two weeks later, drift to 0.4 because the cheap-tier models have caught up. No code change at the SDK layer.
How the dial maps to actual model picks
Phase 1 (live now): bucketing at 0.5. Values below 0.5 route through the "quality" preset; values 0.5 and above route through the "cost" preset. Within each preset, the router still uses task-aware classification — short prompts go to fast models, reasoning prompts go to reasoning-tier models, etc.
This is a deliberate simplification. We could have built a 5-bucket or 10-bucket version, but Phase 1 wanted to ship behind a stable API surface. The header semantics are forward-compatible: when Phase 2 lands, the same header value will drive truly continuous scoring without any client-side change.
Phase 2 (post-oracle-grid landing): truly continuous. We're building an internal benchmark grid that scores every model on a representative prompt set. The router will compute a per-model utility score as a weighted sum of (1 − dial) × quality_score + dial × cost_score, normalised across the candidate pool, then pick the argmax. No more buckets.
X-GammaInfra-Preference: latency in addition to the dial, the explicit preference wins — the router prioritises latency over the cost-quality trade. Same for X-GammaInfra-Preference: cost and X-GammaInfra-Preference: quality. The dial is for the continuous middle; the discrete preferences are escape hatches.
What we tried and threw away
Two earlier designs went into the bin before this one shipped.
Per-customer profile. The first version stored a cost-quality preference in the customer record. Pros: no per-request overhead, simple SDK story. Cons: same-customer use-case nuance disappears, A/B testing requires multiple API keys, drift requires customer-facing UI. Threw it away.
Per-key profile. Second version stored it per API key. Pros: A/B becomes possible (two keys, two profiles). Cons: per-use-case nuance still requires you to mint multiple keys, manage them, attribute traffic. Threw it away in favor of the header.
The header version was strictly more flexible: anything the per-customer or per-key versions can do, the header version can do by setting the value once at the SDK-init layer and never touching it again. The opposite is not true.
Failure modes
Three things to know if you start using the dial heavily:
- Malformed values silently fall through. If you send
X-GammaInfra-Cost-Quality: fooor1.5orNaN, the router uses your legacyX-GammaInfra-Preference(or defaultquality) and the request succeeds. We deliberately never return 400 on a malformed dial value because that would break customers who fat-finger the header in a typo. - Direct-pinned model names ignore the dial. If you call
openai/gpt-5directly, the dial has no effect — you've already picked the model. The dial only steersgammainfra/autorouting. - The dial doesn't override safety filters. A dial value of 1.0 won't route a tool-use request to a model that doesn't support tools. The router still filters on capability before applying the dial.
The takeaway
Three discrete tiers map poorly to the actual continuous trade developers want to make. A per-request header dial is cheap to add, forward-compatible, and lets you A/B and drift without code changes downstream.
If you want to play with the dial, signup is at gammainfra.com. Try it on gammainfra/auto with a short prompt at three or four different values and watch the X-GammaInfra-Endpoint header change.
Frequently asked questions
What is GammaInfra's cost-quality dial?
X-GammaInfra-Cost-Quality, valued 0.0 to 1.0: 0.0 means pure quality (pick the strongest model for the task), 1.0 means pure cost (cheapest viable), intermediate values bias proportionally. It lets each call site express its own cost/quality position instead of being limited to three discrete buckets.Is the dial truly continuous or does it bucket internally?
0.5 — any value below maps to the quality preset, at-or-above to the cost preset — and echoes the applied value back as X-GammaInfra-Cost-Quality-Applied. Phase 2 (once the oracle response grid is populated) computes a continuous per-model score so any dial value can produce a distinct pick.Is the dial per-request or per-account?
default_headers value on your OpenAI SDK client so every call from that client carries it. There is no per-customer config — the design is deliberately per-request so different call sites in one app can choose differently.What takes precedence if I also set a preference or pin a model?
X-GammaInfra-Preference (quality/cost/latency) beats the dial. The dial wins only when neither is set. A malformed dial value (NaN, out of range, non-numeric) silently falls through to default preference — it never returns a 400.What dial value should I use?
0.3 is a sensible production default — mostly quality-biased, dropping to cheap when the model-fit difference is marginal. 0.5 is the neutral midpoint, equivalent to sending no dial at all. 0.0 forces flagship-only (expensive); 1.0 forces cheap-only (risks quality regressions on hard prompts).