Every agent step deserves a different model
If you're running an agent loop in production, you've probably noticed: pinning one model across every step is a tax you didn't realize you were paying. A planner step needs reasoning. An extractor step should be cheap. A finalizer needs tight tool-call shape. Running all three through the same flagship model is the LLM equivalent of using your most expensive engineer to file expense reports.
This is a field guide to building agents that route per-step, bound their tail latency, survive provider hiccups, and tell you exactly how much each step cost. Code is OpenAI-shape throughout, so it works with LangGraph, OpenAI Agents SDK, AutoGen, Mastra, Letta, or any framework that lets you set a base URL. (The Claude Agent SDK is a Claude Code CLI wrapper and can't be redirected; if your loop is Anthropic-shaped, use the OpenAI SDK against a pinned Anthropic model — same shape, see below.)
A 6-step loop, two different bills
Take a concrete agent: a research-and-draft loop with six steps.
- Plan. Decompose the user's question into 3 sub-queries. Long context, requires reasoning. ~2000 input / 600 output.
- Retrieve. Three parallel web-search tool calls. Tight tool-call shape, no real generation. ~500 input / 200 output, ×3.
- Extract. Pull structured facts from each result. Short prompt, schema-bound output. ~3000 input / 400 output.
- Synthesize. Reason across the extracted facts to draft an answer. Long context, real reasoning. ~5000 input / 1500 output.
- Critique. A separate pass that grades the draft and flags weak claims. ~6000 input / 400 output.
- Finalize. Rewrite the draft using the critique. ~6500 input / 1500 output.
Total request budget across the loop: ~24K input + 5K output tokens.
One-model baseline, pin everything to claude-opus-4-7 at $5 / $25 per 1M tokens. 24K × $5/1M + 5K × $25/1M = ~$0.245 per loop. Wall-clock is bounded by Opus's typical latency at ~3s per call, so ~14s sequential (plus tool round-trips).
Now route per step against current rates:
- Plan (reasoning) →
deepseek-v4-proat $1.74 / $3.48 per 1M. 2K × $1.74/1M + 600 × $3.48/1M ≈ $0.0056. - Retrieve × 3 (tool calls) →
gpt-5-miniat $0.25 / $2 per 1M. 1.5K × $0.25/1M + 600 × $2/1M ≈ $0.0016. - Extract (extraction) →
gemini-3.1-flash-lite-previewat $0.25 / $1.50 per 1M. 3K × $0.25/1M + 400 × $1.50/1M ≈ $0.0014. - Synthesize (reasoning) →
deepseek-v4-pro. 5K × $1.74/1M + 1.5K × $3.48/1M ≈ $0.0139. - Critique (chat-class judging) →
gpt-5-mini. 6K × $0.25/1M + 400 × $2/1M ≈ $0.0023. - Finalize (rewrite) →
gpt-5-mini. 6.5K × $0.25/1M + 1.5K × $2/1M ≈ $0.0046.
Sum: ~$0.029 per loop, against the $0.245 baseline. Roughly 8× cheaper for the same outputs, and wall-clock drops to ~6s because the cheap and mid-tier steps come back faster. Aggressive: more of the loop goes to gpt-5-mini and you land closer to $0.015 — the trade is quality on the synthesis pass.
That's not a free lunch — it's the consequence of the fact that the optimal model is task-specific, and the cost spread between tiers is wide. If you're not routing per step, you're paying the full flagship rate on requests that didn't need it.
A note on rates. The per-1M prices above are current as of May 2026 and pulled from the live cost table the gateway uses to compute response-header costs. Upstream providers reprice periodically; the response-header dollars are always authoritative, this post's per-step arithmetic isn't.
The per-step pattern
The simplest version of this in code: different model string per step, but everything else identical. GammaInfra accepts the OpenAI SDK request/response shape, so the only thing that changes is what you put in model and (optionally) one request header.
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="https://api.gammainfra.com/v1",
api_key=os.environ["GAMMAINFRA_API_KEY"],
)
# Step 1: planner — needs reasoning, send the prompt at the quality end of the dial.
plan = await client.chat.completions.create(
model="gammainfra/auto",
messages=[{"role": "system", "content": PLANNER_PROMPT}, {"role": "user", "content": question}],
extra_headers={"X-GammaInfra-Cost-Quality": "0.1"}, # 0.0 = pure quality
)
# Step 3: extractor — short prompts, structured output, push cheap.
facts = await client.chat.completions.create(
model="gammainfra/auto",
messages=[{"role": "user", "content": EXTRACT_PROMPT.format(result=r)}],
extra_headers={"X-GammaInfra-Cost-Quality": "0.8"}, # 1.0 = pure cost
response_format={"type": "json_object"},
)
# Step 6: finalizer — same routing call as planner, knob set for quality.
final = await client.chat.completions.create(
model="gammainfra/auto",
messages=[{"role": "user", "content": FINALIZE_PROMPT}],
extra_headers={"X-GammaInfra-Cost-Quality": "0.2"},
)
The X-GammaInfra-Cost-Quality header takes any float from 0.0 to 1.0. 0.0 is pure quality, 1.0 is pure cost. Send any value in between and the router weighs the trade. Malformed values (NaN, out of range, non-numeric) silently fall through to default routing — we never 400 on a header typo.
If you want to pin a specific model on a step, you can: model="anthropic/claude-opus-4-7" or model="openai/gpt-5-mini" bypasses the router entirely. The point of gammainfra/auto is that the router watches live p50 latency, capability, and per-model cost, and picks the best-fit member of the task class for that moment. You don't have to chase model releases or rebalance chains by hand.
Tail latency compounds in agent loops
Six sequential 2-second calls is a 12-second loop. That's the median. The problem is the tail.
Call any LLM provider enough and you'll see p95 outliers — a request that should take 2s sometimes takes 8s because the provider is under load, your region is far from theirs, or you happened to hit a cold cache. If each of your six steps has a 5% chance of being a p95 outlier, the probability that at least one step in your loop blows past the budget is roughly 1 − 0.956 = ~26%. One in four loops drags. Users feel it.
Two mitigations ship in the routing layer, both worth knowing about.
Live-p50 latency-aware endpoint selection. The router refreshes per-endpoint p50 latency every 30 seconds from production traffic, over a 5-minute window. When you set X-GammaInfra-Preference: latency (or call gammainfra/fast), endpoint selection uses the current p50, not a stale static estimate. If one provider's region is having a bad afternoon, the router moves traffic off it before you have to notice.
Hedged requests for latency-preference traffic. For X-GammaInfra-Preference: latency requests, the router can fire the top-2 endpoints in parallel, take whichever returns first, and cancel the loser mid-flight. The waste is bounded (one extra in-flight request per hedged call) and the p95 reduction is meaningful — the second leg only matters when the first leg has stalled. Hedging is rolling out on request; ping us to flip it on for your key.
# Time-sensitive step: bound by latency, hedged dispatch when enabled.
await client.chat.completions.create(
model="gammainfra/fast",
messages=[{"role": "user", "content": "extract entities from: ..."}],
)
You don't have to think about which two endpoints get raced — the router picks based on its current latency table. The response header X-GammaInfra-Endpoint tells you which provider actually won, in case you want to log it. When hedging is enabled, X-GammaInfra-Router-Version: v2_hedged tells you the request raced.
Fallback chains: surviving provider weather
The flip side of latency is reliability. When a provider 5xxs or times out, what happens?
For every task class, the router holds a multi-provider fallback chain — usually 3 or 4 different providers covering the same capability. A reasoning request might run deepseek-v4-pro → gpt-5.4 → claude-opus-4-6 → gemini-3.1-pro-preview. If the first provider 5xxs after 2 seconds, the router immediately retries on the second. The customer sees one response — possibly a slightly later one — but it succeeds, with no client-side retry logic.
The response carries a X-GammaInfra-Fallback-Chain header so your agent code can log which providers were tried and which one actually served:
X-GammaInfra-Endpoint: anthropic/claude-opus-4-6
X-GammaInfra-Fallback-Chain: deepseek-v4-pro,claude-opus-4-6
X-GammaInfra-Fallback-Reason: upstream_5xx
X-GammaInfra-Attempted-Count: 2
This is useful for two things. First, when an agent loop produces a surprising result, you can correlate it with the actual model that served, not the model you thought you asked for. Second, when a provider has a sustained outage, you'll see it in your fallback-chain logs before it shows up on a status page.
Per-step cost observability
Every response from POST /v1/chat/completions carries the dollar cost as a header:
X-GammaInfra-Cost-USD: 0.000123
X-GammaInfra-Input-Cost-USD: 0.000045
X-GammaInfra-Output-Cost-USD: 0.000078
Six decimal places of USD — fractions of a cent — pass-through from the underlying provider's rate. The input/output split matters for agent loops because the directions are usually wildly asymmetric — a critique pass reads 6000 tokens of context and outputs 400 tokens, so input cost dominates and you can target the input direction when you optimize.
Accumulating cost across a loop is straightforward with the OpenAI SDK's raw-response interface:
class AgentBudget:
def __init__(self, max_usd: float):
self.max_usd = max_usd
self.spent_usd = 0.0
async def call(self, **kwargs):
resp = await client.chat.completions.with_raw_response.create(**kwargs)
cost = float(resp.http_response.headers.get("X-GammaInfra-Cost-USD", "0"))
self.spent_usd += cost
if self.spent_usd > self.max_usd:
raise BudgetExceeded(f"loop spent ${self.spent_usd:.4f}")
return resp.parse()
budget = AgentBudget(max_usd=0.10)
plan = await budget.call(model="gammainfra/auto", messages=[...])
# ...if any step pushes total spend over $0.10, the loop short-circuits.
The reason this matters for agents specifically: a runaway loop is the most expensive failure mode in production. A planner that gets stuck calling itself, an extractor that keeps requesting more context, a tool-call that retries 50 times — every one of those is a real money leak. Bounding per-loop cost in your agent state, with the cost header, is one of the cheapest wins available.
Per-step latency budgets
Cost has a per-loop budget. Latency wants one too. Send X-GammaInfra-Max-Latency-Ms on any request and the gateway will cancel the upstream call if it overruns:
await client.chat.completions.create(
model="gammainfra/auto",
messages=[...],
extra_headers={"X-GammaInfra-Max-Latency-Ms": "3000"},
)
# If the upstream provider hasn't responded in 3s, returns
# HTTP 504 with code "max_latency_exceeded".
Valid range is 60 to 600000 ms. Out-of-range or malformed values are silently dropped — same design as the cost-quality dial, header typos never break the request. When the budget is exceeded, cancellation propagates into the in-flight upstream call so the provider connection actually closes, not just your local task.
This is meaningfully different from wrapping the SDK call in asyncio.wait_for. The local wait-for stops your code from blocking, but the upstream HTTP request keeps draining tokens — you pay for output you'll never read. The header-driven cancel lives at the proxy, so the upstream socket closes and the provider stops generating. For agent loops where one step can hang and bottleneck the whole loop, this is the cleaner primitive.
The tool_call.id papercut
Here's the under-discussed cross-provider gotcha that has bitten more agent codebases than anything else on this list.
OpenAI returns tool calls with IDs shaped like call_abc123.... Anthropic returns them shaped like toolu_01abc.... Same OpenAI-style wire format, different ID convention. Agent code that hard-codes id.startswith("call_") (or the inverse) silently breaks the first time it sees the other provider.
The mid-conversation roundtrip is worse: when you send a tool result back, Anthropic validates that the tool_use_id in the user turn matches an ID it issued on the assistant turn. If your code accumulated a toolu_* ID in conversation history and you swap providers between turns, the next call 400s on ID mismatch.
GammaInfra translates these IDs at the boundary, in both directions. Outbound (provider → client): every toolu_* from Anthropic becomes call_* in the response. Inbound (client → provider): every call_* in tool-role messages or assistant tool_calls[] gets translated back to toolu_* before the upstream Anthropic call. Same-provider conversations round-trip cleanly; agent code that assumes one ID shape works against any model behind gammainfra/auto.
# This loop works whether gammainfra/auto picks OpenAI or Anthropic
# under the hood. The id shape stays consistent on the client side.
resp = await client.chat.completions.create(
model="gammainfra/auto",
messages=conversation,
tools=TOOLS,
)
for call in resp.choices[0].message.tool_calls:
assert call.id.startswith("call_") # always true
result = await execute_tool(call)
conversation.append({"role": "tool", "tool_call_id": call.id, "content": result})
# next turn — uses the same call.id back into the proxy:
next_resp = await client.chat.completions.create(
model="gammainfra/auto",
messages=conversation,
tools=TOOLS,
)
Streaming gets the same treatment for the tool-call index field. Anthropic emits content_block.index as the absolute position in the response array, which can be 1 or higher if any text precedes the tool_use. OpenAI semantics want a 0-based per-stream counter for the tool-call sequence. The proxy maintains the per-stream counter so agent code that assumes delta.tool_calls[0].index == 0 for the first tool call is correct regardless of upstream provider.
One key, every framework
Because the wire format is OpenAI-shape, every agent framework that lets you set a base URL just works.
OpenAI Agents SDK
from openai import AsyncOpenAI
from agents import Agent, Runner, OpenAIChatCompletionsModel
client = AsyncOpenAI(
base_url="https://api.gammainfra.com/v1",
api_key=os.environ["GAMMAINFRA_API_KEY"],
)
agent = Agent(
name="Researcher",
instructions="...",
model=OpenAIChatCompletionsModel(
model="gammainfra/auto",
openai_client=client,
),
)
result = await Runner.run(agent, "What's the current draft of the SEC ruling?")
LangGraph
ChatOpenAI carries its default_headers at instantiation; per-node header overrides need a separate client per persona. .bind() doesn't forward headers to the underlying HTTP request, so a one-client-many-binds shortcut quietly drops the routing hint.
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph
planner_llm = ChatOpenAI(
model="gammainfra/auto",
base_url="https://api.gammainfra.com/v1",
api_key=os.environ["GAMMAINFRA_API_KEY"],
default_headers={"X-GammaInfra-Cost-Quality": "0.2"},
)
extractor_llm = ChatOpenAI(
model="gammainfra/auto",
base_url="https://api.gammainfra.com/v1",
api_key=os.environ["GAMMAINFRA_API_KEY"],
default_headers={
"X-GammaInfra-Cost-Quality": "0.8",
"X-GammaInfra-Max-Latency-Ms": "5000",
},
)
graph = StateGraph(...)
graph.add_node("plan", planner_llm)
graph.add_node("extract", extractor_llm)
Claude Agent SDK / Anthropic-shaped tool loops
If your agent code is built around Anthropic's tool-use loop pattern, you get the most underrated upgrade in this list for free: provider portability. Pin anthropic/claude-opus-4-7 today, swap to gammainfra/auto tomorrow and your tool-handling code doesn't change a line. The proxy normalizes toolu_* ↔ call_* in both directions and 0-bases the streaming tool index, so the same agent loop runs against any model behind gammainfra/auto.
One mechanical note: the OpenAI Python SDK is the wiring of choice here because Claude Agent SDK itself is a Claude Code CLI wrapper without a custom base-URL hook today. Use the OpenAI SDK with an Anthropic model pin and the loop stays Anthropic-shaped end to end — same role/content schema, same tool-use semantics, with the cross-provider portability layered in.
client = AsyncOpenAI(
base_url="https://api.gammainfra.com/v1",
api_key=os.environ["GAMMAINFRA_API_KEY"],
)
# Pin to a specific Anthropic model for the agent loop.
resp = await client.chat.completions.create(
model="anthropic/claude-opus-4-7",
messages=conversation,
tools=TOOLS,
)
# Same code, swap the model, route across every provider:
resp = await client.chat.completions.create(
model="gammainfra/auto",
messages=conversation,
tools=TOOLS,
)
The point: zero framework-specific wiring. The OpenAI-shape proxy is intentional precisely so existing agent code keeps working when you swap in smarter routing.
What we deliberately don't do
One last thing worth saying out loud: we do not log the contents of your prompts. The router classifies the prompt to pick a model, tracks aggregate decisions across the population of requests, and writes per-request metadata (model dispatched, latency, token counts, cost) for billing and reliability. We don't store prompt text, response text, or tool arguments.
That matters for agent loops specifically because agent traces contain everything: user data, internal tool schemas, partial reasoning chains. The routing layer doesn't need to see any of that to do its job, so it doesn't. Every routing decision is auditable on the customer side via response headers — you can see exactly which model served, which provider was attempted, and what it cost, on the request that just returned.
Before you launch
Two operational notes worth knowing before you point production traffic at the proxy:
- Default rate limit is 240 requests/minute per key. Long-running agents with parallel tool fan-outs can blow through that. If your loop fans out wider — or you're running many agents in parallel — get in touch before launch and we'll provision higher.
- Hedged dispatch is rolling out on request. The latency-tier endpoint selection is live for every key; the parallel-leg hedging on top is gated per account during rollout. Ping us if you want it flipped on for your key.
The takeaway
Agent loops compound every weakness of using one model for everything. Per-step model variance, per-step latency budgets, per-step cost tracking, and resilient fallback chains all have to live somewhere. They can live in your agent code, written by hand, framework by framework. Or they can live in the routing layer, exposed through request and response headers, with the OpenAI wire format you're already using.
If you want to try it on your agent loop, signup is at gammainfra.com. $3 trial credit, $10 minimum top-up, no per-token markup on managed credits. Point your agent framework at api.gammainfra.com/v1 and watch the headers do the work. The marketing summary is at /agents; the wire-format spec is at /docs.
Frequently asked questions
Why use a different model for each agent step?
How do I route each agent step to a different model?
model to a specific endpoint (e.g. anthropic/claude-opus-4-7 for the reasoning step, gammainfra/cheap for extraction). Or use gammainfra/auto everywhere and let the task-aware router classify each step's prompt and pick — optionally biased per call with the X-GammaInfra-Cost-Quality header. Only base_url and api_key change vs the OpenAI SDK.Does switching models mid-agent-loop break tool calls?
toolu_* vs OpenAI's call_*) at the boundary so OpenAI-shaped agent code round-trips cleanly. Cross-provider mid-conversation continuity is not possible because each provider validates IDs it issued; the recommended pattern is to commit a tool-heavy session to one provider and switch providers between sessions, not within them.How do I cap latency per agent step?
X-GammaInfra-Max-Latency-Ms request header (range 60 to 600 000 ms). If the upstream call exceeds the budget, GammaInfra cancels it and returns 504 max_latency_exceeded so one slow provider can't stall the whole loop. Malformed values are silently dropped — the header never causes a 400.What happens if a provider fails mid-loop?
X-GammaInfra-Fallback-Chain response header. The step only fails if every endpoint in the chain fails.