Built for agent workflows

Smart routing for agent loops.
One model per step. One API.

Agents call models in loops — planning, extracting, calling tools, reflecting. Each step wants a different model. GammaInfra picks the right one per call, fails over across every major provider when one breaks, and enforces per-call cost and latency budgets you set in a header.

Get an API key Read the docs
OpenAI SDK· LangGraph· OpenAI Agents SDK· AutoGen· Mastra· Letta

One model for every step is the wrong default.

A real agent loop runs 5–50 model calls before it returns. Picking one flagship for all of them burns money on trivial steps. Picking one cheap model breaks on the hard ones. And when a provider hiccups mid-loop, all the prior work is wasted.

01

Per-step model variance

A planner needs reasoning. An extractor needs cheap. A tool-caller needs structured-output reliability. The model that wins one step loses the next — and writing per-step logic to pick is a maintenance trap.

02

Tail latency compounds

An agent with 20 sequential calls inherits the worst p95 of any one of them. One slow provider on call 14 drags the whole run. You need a router that picks based on live latency, not a static config.

03

One provider down breaks the loop

An outage at step 18 of a 20-step task throws away every prior tool result. Fallbacks need to be automatic and cross-provider — not "retry the same broken endpoint with backoff".

Agent-shaped routing, built into the API.

Every feature below maps to a real pain point of running agents in production. The list reflects what ships in the gateway today; hedged dispatch is rolling out on request.

Per-step model variance
Task-aware routing picks the best model per call
gammainfra/auto classifies each prompt into one of eight task labels — reasoning, code, creative, rewrite, chat, extraction, summarize, translation — and dispatches to the best-fit model at that moment. Requests with a tools param or image content short-circuit through dedicated tool-use / multimodal chains. Your planner step lands on a reasoning model; your extractor step lands on a cheap one. No per-step config.
Tail latency compounds
Latency-aware routing from live p50
Endpoint selection reads live p50 latency on a 30-second refresh window, not a static config. Add X-GammaInfra-Preference: latency on hot-path steps to bias selection. Hedged dispatch for latency-preference traffic is available on request — enable on your key.
Provider outage mid-workflow
Cross-provider fallback chains
Every task class has a 3–4 deep fallback chain across different providers. When the primary 429s or 503s, the router moves to the next chain member automatically — your step 18 doesn't die because one provider hiccuped. The full chain is in X-GammaInfra-Fallback-Chain.
Cost runaway in long loops
Per-direction cost split in every response
Every response carries X-GammaInfra-Cost-USD, X-GammaInfra-Input-Cost-USD, and X-GammaInfra-Output-Cost-USD. Sum them inside your agent loop and cut the run when the budget tips. No client-side token math against drifting price tables.
Step-level timeout budgets
Per-call max-latency budget enforced by the gateway
Send X-GammaInfra-Max-Latency-Ms: 5000 on any step. If the upstream blows the budget, GammaInfra cancels the call in-flight and returns 504 with max_latency_exceeded — your agent loop catches a typed error instead of hanging on a 99-second provider tail.
Tool-call ID schemas differ
Automatic tool-call ID translation
Anthropic emits toolu_*, OpenAI emits call_*. Agent code that asserts id.startswith("call_") breaks on Anthropic. GammaInfra translates both directions at the OpenAI-compat boundary so the same agent loop works against any provider.
Streaming tool-call index quirks
0-based per-stream tool indexing
Anthropic streams emit absolute content-block index (often 1+ when text precedes the tool_use). OpenAI clients expect 0-based per-stream. GammaInfra normalizes the index on streaming deltas — your parallel tool-call accumulator code is portable across providers.
Multi-model agents are painful
One API key for every major LLM
OpenAI, Anthropic, Google, Mistral, Groq, DeepSeek, xAI, and Amazon Bedrock — all behind one sk-gammainfra-* key. Direct-pin openai/gpt-5-mini, logical-name claude-opus-4-7, or let the router decide. Add a provider and your code doesn't change.
Compliance and residency
Region and provider constraints per call
Add X-GammaInfra-Region: eu to constrain endpoint selection to a region group, or pass provider.only: ["bedrock"] in the request body for strict per-provider routing. The served endpoint and region echo back in response headers for audit.
Cost vs quality dial per call
Continuous cost-quality preference
Set X-GammaInfra-Cost-Quality: 0.0..1.0 per call — 0.0 biases toward the best model, 1.0 toward the cheapest. Run the planner step at 0.2 and the extractor at 0.8 in the same loop without swapping models in your agent code.

The framework you already use.

Each example uses the OpenAI SDK shape — pass base_url="https://api.gammainfra.com/v1" and the routing happens server-side. Frameworks that accept a custom base URL drop in unchanged.

openai_sdk_agent_loop.py
# The OpenAI Python SDK is the lowest-common-denominator client for
# agent loops. Point base_url at GammaInfra and the same code routes
# across every major provider with per-step header controls.
from openai import OpenAI

client = OpenAI(
  base_url="https://api.gammainfra.com/v1",
  api_key="sk-gammainfra-...",
)

# Planner step: bias toward reasoning, give it a 30s budget.
plan = client.chat.completions.create(
  model="gammainfra/auto",
  messages=[{"role": "user", "content": "Plan the steps to refactor this module..."}],
  extra_headers={
    "X-GammaInfra-Cost-Quality": "0.2",        # quality-biased
    "X-GammaInfra-Max-Latency-Ms": "30000",    # cancel if upstream tails
  },
)

# Extractor step in the same loop: bias toward cheap.
extract = client.chat.completions.create(
  model="gammainfra/auto",
  messages=[{"role": "user", "content": "Extract function names as JSON..."}],
  extra_headers={"X-GammaInfra-Cost-Quality": "0.8"},   # cost-biased
)

# Every response carries the cost split + which model served you:
#   X-GammaInfra-Endpoint:           deepseek/deepseek-v4-pro
#   X-GammaInfra-Input-Cost-USD:     0.000034
#   X-GammaInfra-Output-Cost-USD:    0.000128
#   X-GammaInfra-Fallback-Chain:     deepseek-v4-pro,gpt-5.4,claude-opus-4-6
# LangGraph nodes accept any ChatOpenAI-compatible client. Point at
# GammaInfra and per-node routing comes from headers, not model swaps.
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, MessagesState

planner = ChatOpenAI(
  base_url="https://api.gammainfra.com/v1",
  api_key="sk-gammainfra-...",
  model="gammainfra/auto",
  default_headers={"X-GammaInfra-Preference": "quality"},
)

extractor = ChatOpenAI(
  base_url="https://api.gammainfra.com/v1",
  api_key="sk-gammainfra-...",
  model="gammainfra/auto",
  default_headers={
    "X-GammaInfra-Preference": "cost",
    "X-GammaInfra-Max-Latency-Ms": "5000",
  },
)

def plan_node(state: MessagesState):
  return {"messages": [planner.invoke(state["messages"])]}

def extract_node(state: MessagesState):
  return {"messages": [extractor.invoke(state["messages"])]}

graph = StateGraph(MessagesState)
graph.add_node("plan", plan_node)
graph.add_node("extract", extract_node)
graph.add_edge("plan", "extract")

# Different nodes route to different models without ever importing
# a second SDK. Header changes; same OpenAI shape underneath.
# OpenAI Agents SDK wraps a custom OpenAI client in OpenAIChatCompletionsModel.
# Per-call routing goes through ModelSettings(extra_headers=...) on a RunConfig.
from openai import AsyncOpenAI
from agents import Agent, Runner, OpenAIChatCompletionsModel
from agents.model_settings import ModelSettings
from agents.run import RunConfig

client = AsyncOpenAI(
  base_url="https://api.gammainfra.com/v1",
  api_key="sk-gammainfra-...",
)

agent = Agent(
  name="research_agent",
  instructions="Plan, search, summarize.",
  model=OpenAIChatCompletionsModel(
    model="gammainfra/auto",
    openai_client=client,
  ),
)

# Per-run constraint: never spend more than 8 seconds on one upstream call.
run_config = RunConfig(
  model_settings=ModelSettings(
    extra_headers={"X-GammaInfra-Max-Latency-Ms": "8000"},
  ),
)

result = await Runner.run(
  agent,
  input="Find the three biggest changes in HTTP/3 vs HTTP/2.",
  run_config=run_config,
)

# The agent step gets routed to the best tool-capable model
# in the moment. Tool-call IDs are normalized to call_* on the way out,
# so the OpenAI SDK's assertions hold across providers.
# AutoGen's OpenAIChatCompletionClient takes base_url + default_headers.
# Wire one client per persona, then assemble agents on top of those clients.
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_agentchat.agents import AssistantAgent

# Reasoning persona — quality-biased.
planner_model = OpenAIChatCompletionClient(
  model="gammainfra/auto",
  base_url="https://api.gammainfra.com/v1",
  api_key="sk-gammainfra-...",
  default_headers={
    "X-GammaInfra-Cost-Quality": "0.2",
    "X-GammaInfra-Max-Latency-Ms": "30000",
  },
)

# Extraction persona — cost-biased, tighter latency budget.
extractor_model = OpenAIChatCompletionClient(
  model="gammainfra/auto",
  base_url="https://api.gammainfra.com/v1",
  api_key="sk-gammainfra-...",
  default_headers={
    "X-GammaInfra-Cost-Quality": "0.8",
    "X-GammaInfra-Max-Latency-Ms": "5000",
  },
)

planner = AssistantAgent("planner", model_client=planner_model)
extractor = AssistantAgent("extractor", model_client=extractor_model)

# Different agents land on different models without ever swapping SDKs.
# Each client carries its own per-step headers; GammaInfra routes accordingly.

Anything that takes a custom base URL works. See all integrations →

Every routing choice is auditable.

When an agent loop misbehaves, the debug path is response headers — not a black-box dashboard you have to wait for. Every step tells you exactly which model served it, what it cost, and what the router considered.

X-GammaInfra-Endpoint
The physical provider/model that served this call — even if you used a logical name like claude-opus-4-7 that resolved to either native or Bedrock.
X-GammaInfra-Cost-USD
Total cost in USD. Split available as X-GammaInfra-Input-Cost-USD + X-GammaInfra-Output-Cost-USD.
X-GammaInfra-Fallback-Chain
Comma-separated list of every endpoint the router attempted. If the primary failed mid-call, the chain shows what it cascaded through.
X-GammaInfra-Fallback-Reason
Why the router moved off the primary — rate limit, 5xx, timeout, or budget breach. Typed strings, not free-form text.
X-GammaInfra-Logical-Model
The task class the classifier picked — reasoning, code, extraction, chat. Useful for debugging when an agent step lands on a model you didn't expect.
X-GammaInfra-Region-Used
AWS region of the served endpoint when applicable (e.g., Bedrock). Echoes back when you constrained selection with X-GammaInfra-Region.
X-GammaInfra-Router-Version
Which routing path served the call: v2, v2_keyword, v2_hedged, or v1 fallback. Tells you exactly which decision path fired.
X-RateLimit-Remaining
Standard rate-limit headers (Limit, Remaining, Reset) on every response. Your agent's rate-limit accountant doesn't have to guess.

Prompts are never logged. Headers carry decisions and costs only.

Pass-through token rates. Pay on top-ups, not requests.

Agents make a lot of calls. Per-request markups compound. GammaInfra charges its fee on top-ups, not on each token — and lets you bring your own provider keys if you'd rather pay upstream direct.

Managed

Use GammaInfra's provider keys. 0% markup on tokens — you pay exactly what the upstream provider charges. Top-up fee on funding the balance (launch-window rate active through 2026-06-23, then standard). $3 free credit on signup, $10 minimum top-up.

See full pricing →

BYOK

Bring your own provider keys. Smart routing, fallback, observability — all still apply. Small per-request routing fee deducted from a prepaid balance (launch-window rate active through 2026-06-23). $5 minimum top-up, no top-up fee.

See full pricing →

Default rate limit is 240 rpm per key. Agent fan-outs that exceed it can be provisioned higher — contact us before launch.

Common questions about agents on smart routing.

How does GammaInfra help with agent loops?
Every step in an agent loop can request a different model — reasoning steps to gammainfra/auto (quality-preferred), extraction steps to gammainfra/cheap, tool calls pinned to a specific provider. The router picks per call. A real research-loop example reduces cost from $0.245 to $0.029 (8.3× cheaper) by mixing models per step rather than running everything through one flagship.
Does GammaInfra work with the OpenAI Agents SDK, LangGraph, AutoGen, Mastra, and Letta?
Yes — any agent framework that accepts a custom OpenAI base URL. The code samples above show working snippets for OpenAI Agents SDK (OpenAIChatCompletionsModel + RunConfig.model_settings.extra_headers), LangGraph (ChatOpenAI with default_headers per persona), AutoGen (OpenAIChatCompletionClient with default_headers), Mastra, and Letta. Per-persona model pinning is via headers or by passing different model names per step.
How do I set a max latency budget per agent step?
Set X-GammaInfra-Max-Latency-Ms: <ms> on the step's chat completion call (range 60 to 600 000). If the upstream provider call exceeds the budget, GammaInfra cancels it and returns a 504 max_latency_exceeded response. This prevents one slow provider from holding up an entire agent run.
What happens if tool_call.id shapes differ between providers?
GammaInfra translates Anthropic's toolu_* IDs to OpenAI's call_* shape (and the reverse on the inbound tool-result direction) at the gateway boundary. Same-provider conversations round-trip cleanly. Cross-provider mid-conversation continuity isn't possible (each provider validates IDs it issued) — design agent loops to commit to one provider per session for tool-heavy work, then switch providers between sessions.
Can I see per-step cost in agent logs?
Yes — every step response carries X-GammaInfra-Cost-USD, X-GammaInfra-Input-Cost-USD, X-GammaInfra-Output-Cost-USD, and X-GammaInfra-Endpoint. Log these alongside your step trace and you get full per-step cost attribution without a separate accounting pipeline.
Does GammaInfra hedge requests for lower agent-step latency?
On gammainfra/fast with hedging enabled in production, the router fires the top-2 endpoints in parallel and takes the first success, cancelling the loser. Counters are exposed at /metrics. Streaming hedging is deferred to a later release; currently rolling out per-customer on request.

Start routing in under a minute.

Verify your email, get a key, change one base URL in your agent code. Smart routing kicks in immediately on gammainfra/auto.

Prompts never logged · Credits never expire · No subscriptions · Cancel anytime