Why every GammaInfra response carries a cost-USD header
Most LLM APIs report tokens. The unit is technically correct — tokens are what get billed — but it's the wrong unit for the question developers are actually asking. The question is "how many dollars did that completion cost me?" and you can't answer it from tokens alone.
So every GammaInfra response carries this header:
X-GammaInfra-Cost-USD: 0.000123
Three decimal places of cents, the actual dollar value passed through from the underlying provider. No markup. The router can pick any model on any provider per prompt, but the number in the header is canonical for that request.
This post explains how we compute it, why it's harder than it sounds, and how to read it from the SDK you're already using.
Why dollars matter more than tokens
Token-counting was useful in 2023 when there were three frontier models and they all charged similar rates. In 2026 you can call:
openai/gpt-5at $1.25 per 1M input tokens, $10 outputanthropic/claude-opus-4-7at $5/$25deepseek/deepseek-v4-proat $0.50/$2.00groq/llama-3.1-8b-instantat $0.05/$0.08
A 1000-token completion costs anywhere from $0.00013 to $0.030 depending on which one served it — a 230× spread for the same token count. Token counts say nothing about cost in this world. You need the dollar value, at the request level, every time.
If you've ever opened your OpenAI invoice at end-of-month and tried to attribute spend to a specific feature, you know the alternative: aggregate counts grouped by API key, no way to drill down to "what did this user's session cost." The header solves it.
How the number gets computed
For each request, GammaInfra's signal collector knows three things that the upstream provider's response doesn't include directly:
- Which model actually served the request. When you call
gammainfra/auto, the router picks. When you callopenai/gpt-5, the fallback chain may have cascaded to a different model if OpenAI throttled. TheX-GammaInfra-Endpointresponse header tells you which one. - The per-direction pricing for that model. Most models charge input and output differently — sometimes by 5×. The
usage.prompt_tokensandusage.completion_tokensfields from the response, multiplied by the right per-direction rate, gives the dollar cost. - Whether any per-context surcharges apply. Some models charge double for input above a context threshold — for example, GPT-5.5 doubles input tokens beyond 272K. The signal collector tracks this per-model and applies it in the calculation.
Sum the input and output costs (after surcharges) and you have a per-request dollar value. We emit it as X-GammaInfra-Cost-USD on the response. We also split it:
X-GammaInfra-Input-Cost-USD: 0.000045
X-GammaInfra-Output-Cost-USD: 0.000078
X-GammaInfra-Cost-USD: 0.000123
The split is useful for agent-loop chargeback where one direction usually dominates. A LangChain ReAct loop reading 80K tokens of context per tool call and outputting 200 has very different cost dynamics than a creative generation call doing the opposite.
Reading the header in different SDKs
Most OpenAI-compatible SDKs hide response headers behind a "raw response" interface. Here's how to extract the cost-USD header from common stacks:
curl
curl -i https://gammainfra.com/v1/chat/completions \
-H "Authorization: Bearer sk-gammainfra-..." \
-H "Content-Type: application/json" \
-d '{"model":"gammainfra/auto","messages":[{"role":"user","content":"hi"}]}'
The -i flag prints response headers. Look for the X-GammaInfra-Cost-USD line.
openai-python
from openai import OpenAI
client = OpenAI(
base_url="https://gammainfra.com/v1",
api_key="sk-gammainfra-...",
)
resp = client.with_raw_response.chat.completions.create(
model="gammainfra/auto",
messages=[{"role": "user", "content": "hi"}],
)
cost = resp.http_response.headers["X-GammaInfra-Cost-USD"]
completion = resp.parse()
print(f"Cost: ${cost}, content: {completion.choices[0].message.content}")
LangChain
LangChain's ChatOpenAI doesn't expose response headers by default. Drop down to the underlying OpenAI client (above) for per-call header access. For aggregate cost tracking across many LangChain calls, the dashboard at dashboard.gammainfra.com rolls up per-request cost by model.
Vercel AI SDK
import { createOpenAI } from '@ai-sdk/openai';
const kraken = createOpenAI({
baseURL: 'https://gammainfra.com/v1',
apiKey: process.env.KRAKEN_API_KEY,
fetch: async (input, init) => {
const res = await fetch(input, init);
console.log('Cost USD:', res.headers.get('X-GammaInfra-Cost-USD'));
return res;
},
});
The custom fetch sees every response and can log the cost header (or push it to a metrics system, or write it to your DB alongside the request).
What the header doesn't include
Three things the X-GammaInfra-Cost-USD value is not:
- It doesn't include the top-up fee. When you top up $100 of managed credits, the 3% fee (during launch) means you get $97 of credits. Those credits then deduct at pass-through provider rates. The header reflects the pass-through rate, not your effective cost-after-fee.
- It doesn't include any failed retry costs. If the request cascaded through 3 providers because the first two rate-limited you, the cost in the header is from the provider that succeeded. Failed attempts that returned no tokens cost nothing. Look at
X-GammaInfra-Fallback-Chainto see which providers were tried. - It doesn't apply to BYOK requests with $0 deduction. If your BYOK balance is empty and you have BYOK keys configured, the request returns 402 — no provider call happens, no cost incurred.
The whole observability header surface
Cost-USD is the most-asked-about header, but it ships with siblings:
| Header | What it tells you |
|---|---|
| X-GammaInfra-Cost-USD | Total cost of this request in USD |
| X-GammaInfra-Input-Cost-USD | Cost of input tokens only |
| X-GammaInfra-Output-Cost-USD | Cost of output tokens only |
| X-GammaInfra-Endpoint | Which provider/model actually served the request |
| X-GammaInfra-Fallback-Chain | Cascade order if multiple providers were tried |
| X-GammaInfra-Logical-Model | The task classification (chat / code / reasoning / etc.) |
| X-GammaInfra-Cost-Quality-Applied | Echo of your X-GammaInfra-Cost-Quality dial value if you sent one |
| X-GammaInfra-Provider | Provider name (also derivable from X-GammaInfra-Endpoint) |
| X-GammaInfra-Attempted-Count | How many fallback attempts ran |
Together these answer most "why did this request go the way it did" questions in one HTTP round trip. The full reference is in the docs.
The takeaway
Token counts are universal but they hide the question developers actually have. Dollar amounts answer it. Putting them in a response header instead of a separate dashboard query means every script that touches your LLM API can react to cost without an extra round trip.
If you want to try it, signup is at gammainfra.com. $3 trial credit on signup; you'll see your first X-GammaInfra-Cost-USD line on the first request.
Frequently asked questions
What is the X-GammaInfra-Cost-USD header?
How is the cost number computed?
X-GammaInfra-Input-Cost-USD and X-GammaInfra-Output-Cost-USD for chargeback when one direction dominates (agent-loop context, for example).Does the cost header include GammaInfra's fee?
How do I read the header in my SDK?
onFinish/stream-end callback for streaming. The header name is stable across SDK versions; consult your SDK's docs for the exact accessor.What other observability headers does GammaInfra return?
X-GammaInfra-Endpoint (which provider/model served the call), X-GammaInfra-Fallback-Chain (the cascade if one fired), X-GammaInfra-Router-Version (which routing path), X-GammaInfra-Cost-Quality-Applied (your dial echoed back when it drove the decision), and the standard X-RateLimit-Limit/Remaining/Reset trio.