Agents call models in loops — planning, extracting, calling tools, reflecting. Each step wants a different model. GammaInfra picks the right one per call, fails over across every major provider when one breaks, and enforces per-call cost and latency budgets you set in a header.
A real agent loop runs 5–50 model calls before it returns. Picking one flagship for all of them burns money on trivial steps. Picking one cheap model breaks on the hard ones. And when a provider hiccups mid-loop, all the prior work is wasted.
A planner needs reasoning. An extractor needs cheap. A tool-caller needs structured-output reliability. The model that wins one step loses the next — and writing per-step logic to pick is a maintenance trap.
An agent with 20 sequential calls inherits the worst p95 of any one of them. One slow provider on call 14 drags the whole run. You need a router that picks based on live latency, not a static config.
An outage at step 18 of a 20-step task throws away every prior tool result. Fallbacks need to be automatic and cross-provider — not "retry the same broken endpoint with backoff".
Every feature below maps to a real pain point of running agents in production. The list reflects what ships in the gateway today; hedged dispatch is rolling out on request.
gammainfra/auto classifies each prompt into one of eight task labels — reasoning, code, creative, rewrite, chat, extraction, summarize, translation — and dispatches to the best-fit model at that moment. Requests with a tools param or image content short-circuit through dedicated tool-use / multimodal chains. Your planner step lands on a reasoning model; your extractor step lands on a cheap one. No per-step config.X-GammaInfra-Preference: latency on hot-path steps to bias selection. Hedged dispatch for latency-preference traffic is available on request — enable on your key.X-GammaInfra-Fallback-Chain.X-GammaInfra-Cost-USD, X-GammaInfra-Input-Cost-USD, and X-GammaInfra-Output-Cost-USD. Sum them inside your agent loop and cut the run when the budget tips. No client-side token math against drifting price tables.X-GammaInfra-Max-Latency-Ms: 5000 on any step. If the upstream blows the budget, GammaInfra cancels the call in-flight and returns 504 with max_latency_exceeded — your agent loop catches a typed error instead of hanging on a 99-second provider tail.toolu_*, OpenAI emits call_*. Agent code that asserts id.startswith("call_") breaks on Anthropic. GammaInfra translates both directions at the OpenAI-compat boundary so the same agent loop works against any provider.sk-gammainfra-* key. Direct-pin openai/gpt-5-mini, logical-name claude-opus-4-7, or let the router decide. Add a provider and your code doesn't change.X-GammaInfra-Region: eu to constrain endpoint selection to a region group, or pass provider.only: ["bedrock"] in the request body for strict per-provider routing. The served endpoint and region echo back in response headers for audit.X-GammaInfra-Cost-Quality: 0.0..1.0 per call — 0.0 biases toward the best model, 1.0 toward the cheapest. Run the planner step at 0.2 and the extractor at 0.8 in the same loop without swapping models in your agent code.Each example uses the OpenAI SDK shape — pass base_url="https://api.gammainfra.com/v1" and the routing happens server-side. Frameworks that accept a custom base URL drop in unchanged.
# The OpenAI Python SDK is the lowest-common-denominator client for # agent loops. Point base_url at GammaInfra and the same code routes # across every major provider with per-step header controls. from openai import OpenAI client = OpenAI( base_url="https://api.gammainfra.com/v1", api_key="sk-gammainfra-...", ) # Planner step: bias toward reasoning, give it a 30s budget. plan = client.chat.completions.create( model="gammainfra/auto", messages=[{"role": "user", "content": "Plan the steps to refactor this module..."}], extra_headers={ "X-GammaInfra-Cost-Quality": "0.2", # quality-biased "X-GammaInfra-Max-Latency-Ms": "30000", # cancel if upstream tails }, ) # Extractor step in the same loop: bias toward cheap. extract = client.chat.completions.create( model="gammainfra/auto", messages=[{"role": "user", "content": "Extract function names as JSON..."}], extra_headers={"X-GammaInfra-Cost-Quality": "0.8"}, # cost-biased ) # Every response carries the cost split + which model served you: # X-GammaInfra-Endpoint: deepseek/deepseek-v4-pro # X-GammaInfra-Input-Cost-USD: 0.000034 # X-GammaInfra-Output-Cost-USD: 0.000128 # X-GammaInfra-Fallback-Chain: deepseek-v4-pro,gpt-5.4,claude-opus-4-6
# LangGraph nodes accept any ChatOpenAI-compatible client. Point at # GammaInfra and per-node routing comes from headers, not model swaps. from langchain_openai import ChatOpenAI from langgraph.graph import StateGraph, MessagesState planner = ChatOpenAI( base_url="https://api.gammainfra.com/v1", api_key="sk-gammainfra-...", model="gammainfra/auto", default_headers={"X-GammaInfra-Preference": "quality"}, ) extractor = ChatOpenAI( base_url="https://api.gammainfra.com/v1", api_key="sk-gammainfra-...", model="gammainfra/auto", default_headers={ "X-GammaInfra-Preference": "cost", "X-GammaInfra-Max-Latency-Ms": "5000", }, ) def plan_node(state: MessagesState): return {"messages": [planner.invoke(state["messages"])]} def extract_node(state: MessagesState): return {"messages": [extractor.invoke(state["messages"])]} graph = StateGraph(MessagesState) graph.add_node("plan", plan_node) graph.add_node("extract", extract_node) graph.add_edge("plan", "extract") # Different nodes route to different models without ever importing # a second SDK. Header changes; same OpenAI shape underneath.
# OpenAI Agents SDK wraps a custom OpenAI client in OpenAIChatCompletionsModel. # Per-call routing goes through ModelSettings(extra_headers=...) on a RunConfig. from openai import AsyncOpenAI from agents import Agent, Runner, OpenAIChatCompletionsModel from agents.model_settings import ModelSettings from agents.run import RunConfig client = AsyncOpenAI( base_url="https://api.gammainfra.com/v1", api_key="sk-gammainfra-...", ) agent = Agent( name="research_agent", instructions="Plan, search, summarize.", model=OpenAIChatCompletionsModel( model="gammainfra/auto", openai_client=client, ), ) # Per-run constraint: never spend more than 8 seconds on one upstream call. run_config = RunConfig( model_settings=ModelSettings( extra_headers={"X-GammaInfra-Max-Latency-Ms": "8000"}, ), ) result = await Runner.run( agent, input="Find the three biggest changes in HTTP/3 vs HTTP/2.", run_config=run_config, ) # The agent step gets routed to the best tool-capable model # in the moment. Tool-call IDs are normalized to call_* on the way out, # so the OpenAI SDK's assertions hold across providers.
# AutoGen's OpenAIChatCompletionClient takes base_url + default_headers. # Wire one client per persona, then assemble agents on top of those clients. from autogen_ext.models.openai import OpenAIChatCompletionClient from autogen_agentchat.agents import AssistantAgent # Reasoning persona — quality-biased. planner_model = OpenAIChatCompletionClient( model="gammainfra/auto", base_url="https://api.gammainfra.com/v1", api_key="sk-gammainfra-...", default_headers={ "X-GammaInfra-Cost-Quality": "0.2", "X-GammaInfra-Max-Latency-Ms": "30000", }, ) # Extraction persona — cost-biased, tighter latency budget. extractor_model = OpenAIChatCompletionClient( model="gammainfra/auto", base_url="https://api.gammainfra.com/v1", api_key="sk-gammainfra-...", default_headers={ "X-GammaInfra-Cost-Quality": "0.8", "X-GammaInfra-Max-Latency-Ms": "5000", }, ) planner = AssistantAgent("planner", model_client=planner_model) extractor = AssistantAgent("extractor", model_client=extractor_model) # Different agents land on different models without ever swapping SDKs. # Each client carries its own per-step headers; GammaInfra routes accordingly.
Anything that takes a custom base URL works. See all integrations →
When an agent loop misbehaves, the debug path is response headers — not a black-box dashboard you have to wait for. Every step tells you exactly which model served it, what it cost, and what the router considered.
claude-opus-4-7 that resolved to either native or Bedrock.X-GammaInfra-Input-Cost-USD + X-GammaInfra-Output-Cost-USD.reasoning, code, extraction, chat. Useful for debugging when an agent step lands on a model you didn't expect.X-GammaInfra-Region.v2, v2_keyword, v2_hedged, or v1 fallback. Tells you exactly which decision path fired.Limit, Remaining, Reset) on every response. Your agent's rate-limit accountant doesn't have to guess.Prompts are never logged. Headers carry decisions and costs only.
Agents make a lot of calls. Per-request markups compound. GammaInfra charges its fee on top-ups, not on each token — and lets you bring your own provider keys if you'd rather pay upstream direct.
Use GammaInfra's provider keys. 0% markup on tokens — you pay exactly what the upstream provider charges. Top-up fee on funding the balance (launch-window rate active through 2026-06-23, then standard). $3 free credit on signup, $10 minimum top-up.
Bring your own provider keys. Smart routing, fallback, observability — all still apply. Small per-request routing fee deducted from a prepaid balance (launch-window rate active through 2026-06-23). $5 minimum top-up, no top-up fee.
Default rate limit is 240 rpm per key. Agent fan-outs that exceed it can be provisioned higher — contact us before launch.
gammainfra/auto (quality-preferred), extraction steps to gammainfra/cheap, tool calls pinned to a specific provider. The router picks per call. A real research-loop example reduces cost from $0.245 to $0.029 (8.3× cheaper) by mixing models per step rather than running everything through one flagship.OpenAIChatCompletionsModel + RunConfig.model_settings.extra_headers), LangGraph (ChatOpenAI with default_headers per persona), AutoGen (OpenAIChatCompletionClient with default_headers), Mastra, and Letta. Per-persona model pinning is via headers or by passing different model names per step.X-GammaInfra-Max-Latency-Ms: <ms> on the step's chat completion call (range 60 to 600 000). If the upstream provider call exceeds the budget, GammaInfra cancels it and returns a 504 max_latency_exceeded response. This prevents one slow provider from holding up an entire agent run.tool_call.id shapes differ between providers?toolu_* IDs to OpenAI's call_* shape (and the reverse on the inbound tool-result direction) at the gateway boundary. Same-provider conversations round-trip cleanly. Cross-provider mid-conversation continuity isn't possible (each provider validates IDs it issued) — design agent loops to commit to one provider per session for tool-heavy work, then switch providers between sessions.X-GammaInfra-Cost-USD, X-GammaInfra-Input-Cost-USD, X-GammaInfra-Output-Cost-USD, and X-GammaInfra-Endpoint. Log these alongside your step trace and you get full per-step cost attribution without a separate accounting pipeline.gammainfra/fast with hedging enabled in production, the router fires the top-2 endpoints in parallel and takes the first success, cancelling the loser. Counters are exposed at /metrics. Streaming hedging is deferred to a later release; currently rolling out per-customer on request.Verify your email, get a key, change one base URL in your agent code. Smart routing kicks in immediately on gammainfra/auto.
Prompts never logged · Credits never expire · No subscriptions · Cancel anytime