June 18, 2026

Designing a continuous cost/quality dial for LLM routing

Most LLM routers force you to pick a tier. cheap, balanced, quality. Three buckets, one knob per request. That works as a first cut, but it has two problems we kept hitting in practice.

  1. "Cheap" for code review isn't the same set of models as "cheap" for chat. The optimal mapping shifts by task.
  2. The space between buckets is wasted. If you'd take a 20% latency hit to save 40% on cost, no discrete bucket captures that exactly.

So we added a continuous dial. Send any value between 0.0 and 1.0 in a request header and the router weighs cost-vs-quality at that exact point.

curl https://gammainfra.com/v1/chat/completions \
  -H "Authorization: Bearer sk-gammainfra-..." \
  -H "X-GammaInfra-Cost-Quality: 0.3" \
  -d '{"model":"gammainfra/auto","messages":[...]}'

0.0 = pure quality, 1.0 = pure cost. The router echoes the applied value back as X-GammaInfra-Cost-Quality-Applied: 0.3 so you can confirm it drove the decision.

Why continuous

Discrete tiers force the router into one of three boxes per request. Continuous lets the router pick a model that no discrete bucket would name.

Take three calls with the same prompt — "summarize this 4-page legal contract" — at three dial positions:

DialRoutes toLatencyCost
0.0anthropic/claude-opus-4-7~3.5s~$0.040
0.3anthropic/claude-sonnet-4-6~2.1s~$0.015
0.5openai/gpt-5-mini~1.2s~$0.003
0.7google/gemini-3-flash-preview~0.9s~$0.0012
1.0groq/llama-3.1-8b-instant~0.4s~$0.0002

Five different models, all reasonable for the task, none of them named "tier 1/2/3." The dial lets you slide between them per-request without re-thinking the model decision.

Per-request dial vs per-customer config

Putting the dial in a request header (instead of a customer-level setting) means three things change for free:

How the dial maps to actual model picks

Phase 1 (live now): bucketing at 0.5. Values below 0.5 route through the "quality" preset; values 0.5 and above route through the "cost" preset. Within each preset, the router still uses task-aware classification — short prompts go to fast models, reasoning prompts go to reasoning-tier models, etc.

This is a deliberate simplification. We could have built a 5-bucket or 10-bucket version, but Phase 1 wanted to ship behind a stable API surface. The header semantics are forward-compatible: when Phase 2 lands, the same header value will drive truly continuous scoring without any client-side change.

Phase 2 (post-oracle-grid landing): truly continuous. We're building an internal benchmark grid that scores every model on a representative prompt set. The router will compute a per-model utility score as a weighted sum of (1 − dial) × quality_score + dial × cost_score, normalised across the candidate pool, then pick the argmax. No more buckets.

Explicit preference always wins. If you send X-GammaInfra-Preference: latency in addition to the dial, the explicit preference wins — the router prioritises latency over the cost-quality trade. Same for X-GammaInfra-Preference: cost and X-GammaInfra-Preference: quality. The dial is for the continuous middle; the discrete preferences are escape hatches.

What we tried and threw away

Two earlier designs went into the bin before this one shipped.

Per-customer profile. The first version stored a cost-quality preference in the customer record. Pros: no per-request overhead, simple SDK story. Cons: same-customer use-case nuance disappears, A/B testing requires multiple API keys, drift requires customer-facing UI. Threw it away.

Per-key profile. Second version stored it per API key. Pros: A/B becomes possible (two keys, two profiles). Cons: per-use-case nuance still requires you to mint multiple keys, manage them, attribute traffic. Threw it away in favor of the header.

The header version was strictly more flexible: anything the per-customer or per-key versions can do, the header version can do by setting the value once at the SDK-init layer and never touching it again. The opposite is not true.

Failure modes

Three things to know if you start using the dial heavily:

  1. Malformed values silently fall through. If you send X-GammaInfra-Cost-Quality: foo or 1.5 or NaN, the router uses your legacy X-GammaInfra-Preference (or default quality) and the request succeeds. We deliberately never return 400 on a malformed dial value because that would break customers who fat-finger the header in a typo.
  2. Direct-pinned model names ignore the dial. If you call openai/gpt-5 directly, the dial has no effect — you've already picked the model. The dial only steers gammainfra/auto routing.
  3. The dial doesn't override safety filters. A dial value of 1.0 won't route a tool-use request to a model that doesn't support tools. The router still filters on capability before applying the dial.

The takeaway

Three discrete tiers map poorly to the actual continuous trade developers want to make. A per-request header dial is cheap to add, forward-compatible, and lets you A/B and drift without code changes downstream.

If you want to play with the dial, signup is at gammainfra.com. Try it on gammainfra/auto with a short prompt at three or four different values and watch the X-GammaInfra-Endpoint header change.

Get a GammaInfra API key →

Frequently asked questions

What is GammaInfra's cost-quality dial?
A continuous request-time header, X-GammaInfra-Cost-Quality, valued 0.0 to 1.0: 0.0 means pure quality (pick the strongest model for the task), 1.0 means pure cost (cheapest viable), intermediate values bias proportionally. It lets each call site express its own cost/quality position instead of being limited to three discrete buckets.
Is the dial truly continuous or does it bucket internally?
Phase 1 (current) buckets at 0.5 — any value below maps to the quality preset, at-or-above to the cost preset — and echoes the applied value back as X-GammaInfra-Cost-Quality-Applied. Phase 2 (once the oracle response grid is populated) computes a continuous per-model score so any dial value can produce a distinct pick.
Is the dial per-request or per-account?
Per request. Send it as a header on the individual call, or as a default_headers value on your OpenAI SDK client so every call from that client carries it. There is no per-customer config — the design is deliberately per-request so different call sites in one app can choose differently.
What takes precedence if I also set a preference or pin a model?
An explicit model pin always wins. Then an explicit X-GammaInfra-Preference (quality/cost/latency) beats the dial. The dial wins only when neither is set. A malformed dial value (NaN, out of range, non-numeric) silently falls through to default preference — it never returns a 400.
What dial value should I use?
0.3 is a sensible production default — mostly quality-biased, dropping to cheap when the model-fit difference is marginal. 0.5 is the neutral midpoint, equivalent to sending no dial at all. 0.0 forces flagship-only (expensive); 1.0 forces cheap-only (risks quality regressions on hard prompts).