GammaInfra for agents — Smart routing built for agentic workflows

Q: How do I set a max latency budget per agent step?

Set the request header X-GammaInfra-Max-Latency-Ms: on the step's chat completion call (range 60 to 600000). If the upstream provider call exceeds the budget, GammaInfra cancels it and returns a 504 max_latency_exceeded response. This prevents one slow provider from holding up an entire agent run.

Q: What happens if tool_call.id shapes differ between providers (toolu vs call_)?

GammaInfra translates Anthropic's toolu_* IDs to OpenAI's call_* shape (and the reverse on the inbound tool-result direction) at the gateway boundary. Same-provider conversations round-trip cleanly. Cross-provider mid-conversation continuity isn't possible because each provider validates IDs it issued — design agent loops to commit to one provider per session for tool-heavy work, then switch providers between sessions.

Q: Can I see per-step cost in agent logs?

Yes — every step response carries X-GammaInfra-Cost-USD, X-GammaInfra-Input-Cost-USD, X-GammaInfra-Output-Cost-USD, and X-GammaInfra-Endpoint headers. Log these alongside your step trace and you get full per-step cost attribution without a separate accounting pipeline.

Q: Does GammaInfra hedge requests for lower agent-step latency?

On gammainfra/fast with HEDGE_ENABLED set in production, the router fires the top-2 endpoints in parallel and takes the first success, cancelling the loser. Counters are exposed at /metrics (kraken_hedge_fired_total, kraken_hedge_wins_total). Streaming hedging is deferred to a later release. Currently rolling out per-customer on request.

Why agents need smart routing

One model for every step is the wrong default.

A real agent loop runs 5–50 model calls before it returns. Picking one flagship for all of them burns money on trivial steps. Picking one cheap model breaks on the hard ones. And when a provider hiccups mid-loop, all the prior work is wasted.

Per-step model variance

A planner needs reasoning. An extractor needs cheap. A tool-caller needs structured-output reliability. The model that wins one step loses the next — and writing per-step logic to pick is a maintenance trap.

Tail latency compounds

An agent with 20 sequential calls inherits the worst p95 of any one of them. One slow provider on call 14 drags the whole run. You need a router that picks based on live latency, not a static config.

One provider down breaks the loop

An outage at step 18 of a 20-step task throws away every prior tool result. Fallbacks need to be automatic and cross-provider — not "retry the same broken endpoint with backoff".

What you get

Agent-shaped routing, built into the API.

Every feature below maps to a real pain point of running agents in production. The list reflects what ships in the gateway today; hedged dispatch is rolling out on request.

Per-step model variance

Task-aware routing picks the best model per call

gammainfra/auto classifies each prompt into one of eight task labels — reasoning, code, creative, rewrite, chat, extraction, summarize, translation — and dispatches to the best-fit model at that moment. Requests with a tools param or image content short-circuit through dedicated tool-use / multimodal chains. Your planner step lands on a reasoning model; your extractor step lands on a cheap one. No per-step config.

Tail latency compounds

Latency-aware routing from live p50

Endpoint selection reads live p50 latency on a 30-second refresh window, not a static config. Add X-GammaInfra-Preference: latency on hot-path steps to bias selection. Hedged dispatch for latency-preference traffic is available on request — enable on your key.

Provider outage mid-workflow

Cross-provider fallback chains

Every task class has a 3–4 deep fallback chain across different providers. When the primary 429s or 503s, the router moves to the next chain member automatically — your step 18 doesn't die because one provider hiccuped. The full chain is in X-GammaInfra-Fallback-Chain.

Cost runaway in long loops

Per-direction cost split in every response

Every response carries X-GammaInfra-Cost-USD, X-GammaInfra-Input-Cost-USD, and X-GammaInfra-Output-Cost-USD. Sum them inside your agent loop and cut the run when the budget tips. No client-side token math against drifting price tables.

Step-level timeout budgets

Per-call max-latency budget enforced by the gateway

Send X-GammaInfra-Max-Latency-Ms: 5000 on any step. If the upstream blows the budget, GammaInfra cancels the call in-flight and returns 504 with max_latency_exceeded — your agent loop catches a typed error instead of hanging on a 99-second provider tail.

Tool-call ID schemas differ

Automatic tool-call ID translation

Anthropic emits toolu_*, OpenAI emits call_*. Agent code that asserts id.startswith("call_") breaks on Anthropic. GammaInfra translates both directions at the OpenAI-compat boundary so the same agent loop works against any provider.

Streaming tool-call index quirks

0-based per-stream tool indexing

Anthropic streams emit absolute content-block index (often 1+ when text precedes the tool_use). OpenAI clients expect 0-based per-stream. GammaInfra normalizes the index on streaming deltas — your parallel tool-call accumulator code is portable across providers.

Multi-model agents are painful

One API key for every major LLM

OpenAI, Anthropic, Google, Mistral, Groq, DeepSeek, xAI, and Amazon Bedrock — all behind one sk-gammainfra-* key. Direct-pin openai/gpt-5-mini, logical-name claude-opus-4-7, or let the router decide. Add a provider and your code doesn't change.

Compliance and residency

Region and provider constraints per call

Add X-GammaInfra-Region: eu to constrain endpoint selection to a region group, or pass provider.only: ["bedrock"] in the request body for strict per-provider routing. The served endpoint and region echo back in response headers for audit.

Cost vs quality dial per call

Continuous cost-quality preference

Set X-GammaInfra-Cost-Quality: 0.0..1.0 per call — 0.0 biases toward the best model, 1.0 toward the cheapest. Run the planner step at 0.2 and the extractor at 0.8 in the same loop without swapping models in your agent code.

Drop-in

The framework you already use.

Each example uses the OpenAI SDK shape — pass base_url="https://api.gammainfra.com/v1" and the routing happens server-side. Frameworks that accept a custom base URL drop in unchanged.

openai_sdk_agent_loop.py

# The OpenAI Python SDK is the lowest-common-denominator client for
# agent loops. Point base_url at GammaInfra and the same code routes
# across every major provider with per-step header controls.
from openai import OpenAI

client = OpenAI(
  base_url="https://api.gammainfra.com/v1",
  api_key="sk-gammainfra-...",
)

# Planner step: bias toward reasoning, give it a 30s budget.
plan = client.chat.completions.create(
  model="gammainfra/auto",
  messages=[{"role": "user", "content": "Plan the steps to refactor this module..."}],
  extra_headers={
    "X-GammaInfra-Cost-Quality": "0.2",        # quality-biased
    "X-GammaInfra-Max-Latency-Ms": "30000",    # cancel if upstream tails
  },
)

# Extractor step in the same loop: bias toward cheap.
extract = client.chat.completions.create(
  model="gammainfra/auto",
  messages=[{"role": "user", "content": "Extract function names as JSON..."}],
  extra_headers={"X-GammaInfra-Cost-Quality": "0.8"},   # cost-biased
)

# Every response carries the cost split + which model served you:
#   X-GammaInfra-Endpoint:           deepseek/deepseek-v4-pro
#   X-GammaInfra-Input-Cost-USD:     0.000034
#   X-GammaInfra-Output-Cost-USD:    0.000128
#   X-GammaInfra-Fallback-Chain:     deepseek-v4-pro,gpt-5.4,claude-opus-4-6

# LangGraph nodes accept any ChatOpenAI-compatible client. Point at
# GammaInfra and per-node routing comes from headers, not model swaps.
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, MessagesState

planner = ChatOpenAI(
  base_url="https://api.gammainfra.com/v1",
  api_key="sk-gammainfra-...",
  model="gammainfra/auto",
  default_headers={"X-GammaInfra-Preference": "quality"},
)

extractor = ChatOpenAI(
  base_url="https://api.gammainfra.com/v1",
  api_key="sk-gammainfra-...",
  model="gammainfra/auto",
  default_headers={
    "X-GammaInfra-Preference": "cost",
    "X-GammaInfra-Max-Latency-Ms": "5000",
  },
)

def plan_node(state: MessagesState):
  return {"messages": [planner.invoke(state["messages"])]}

def extract_node(state: MessagesState):
  return {"messages": [extractor.invoke(state["messages"])]}

graph = StateGraph(MessagesState)
graph.add_node("plan", plan_node)
graph.add_node("extract", extract_node)
graph.add_edge("plan", "extract")

# Different nodes route to different models without ever importing
# a second SDK. Header changes; same OpenAI shape underneath.

# OpenAI Agents SDK wraps a custom OpenAI client in OpenAIChatCompletionsModel.
# Per-call routing goes through ModelSettings(extra_headers=...) on a RunConfig.
from openai import AsyncOpenAI
from agents import Agent, Runner, OpenAIChatCompletionsModel
from agents.model_settings import ModelSettings
from agents.run import RunConfig

client = AsyncOpenAI(
  base_url="https://api.gammainfra.com/v1",
  api_key="sk-gammainfra-...",
)

agent = Agent(
  name="research_agent",
  instructions="Plan, search, summarize.",
  model=OpenAIChatCompletionsModel(
    model="gammainfra/auto",
    openai_client=client,
  ),
)

# Per-run constraint: never spend more than 8 seconds on one upstream call.
run_config = RunConfig(
  model_settings=ModelSettings(
    extra_headers={"X-GammaInfra-Max-Latency-Ms": "8000"},
  ),
)

result = await Runner.run(
  agent,
  input="Find the three biggest changes in HTTP/3 vs HTTP/2.",
  run_config=run_config,
)

# The agent step gets routed to the best tool-capable model
# in the moment. Tool-call IDs are normalized to call_* on the way out,
# so the OpenAI SDK's assertions hold across providers.

# AutoGen's OpenAIChatCompletionClient takes base_url + default_headers.
# Wire one client per persona, then assemble agents on top of those clients.
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_agentchat.agents import AssistantAgent

# Reasoning persona — quality-biased.
planner_model = OpenAIChatCompletionClient(
  model="gammainfra/auto",
  base_url="https://api.gammainfra.com/v1",
  api_key="sk-gammainfra-...",
  default_headers={
    "X-GammaInfra-Cost-Quality": "0.2",
    "X-GammaInfra-Max-Latency-Ms": "30000",
  },
)

# Extraction persona — cost-biased, tighter latency budget.
extractor_model = OpenAIChatCompletionClient(
  model="gammainfra/auto",
  base_url="https://api.gammainfra.com/v1",
  api_key="sk-gammainfra-...",
  default_headers={
    "X-GammaInfra-Cost-Quality": "0.8",
    "X-GammaInfra-Max-Latency-Ms": "5000",
  },
)

planner = AssistantAgent("planner", model_client=planner_model)
extractor = AssistantAgent("extractor", model_client=extractor_model)

# Different agents land on different models without ever swapping SDKs.
# Each client carries its own per-step headers; GammaInfra routes accordingly.

Anything that takes a custom base URL works. See all integrations →

Transparent decisions

Every routing choice is auditable.

When an agent loop misbehaves, the debug path is response headers — not a black-box dashboard you have to wait for. Every step tells you exactly which model served it, what it cost, and what the router considered.

X-GammaInfra-Endpoint

The physical provider/model that served this call — even if you used a logical name like claude-opus-4-7 that resolved to either native or Bedrock.

X-GammaInfra-Cost-USD

Total cost in USD. Split available as X-GammaInfra-Input-Cost-USD + X-GammaInfra-Output-Cost-USD.

X-GammaInfra-Fallback-Chain

Comma-separated list of every endpoint the router attempted. If the primary failed mid-call, the chain shows what it cascaded through.

X-GammaInfra-Fallback-Reason

Why the router moved off the primary — rate limit, 5xx, timeout, or budget breach. Typed strings, not free-form text.

X-GammaInfra-Logical-Model

The task class the classifier picked — reasoning, code, extraction, chat. Useful for debugging when an agent step lands on a model you didn't expect.

X-GammaInfra-Region-Used

AWS region of the served endpoint when applicable (e.g., Bedrock). Echoes back when you constrained selection with X-GammaInfra-Region.

X-GammaInfra-Router-Version

Which routing path served the call: v2, v2_keyword, v2_hedged, or v1 fallback. Tells you exactly which decision path fired.

X-RateLimit-Remaining

Standard rate-limit headers (Limit, Remaining, Reset) on every response. Your agent's rate-limit accountant doesn't have to guess.

Prompts are never logged. Headers carry decisions and costs only.

FAQ

Common questions about agents on smart routing.

How does GammaInfra help with agent loops?

Every step in an agent loop can request a different model — reasoning steps to gammainfra/auto (quality-preferred), extraction steps to gammainfra/cheap, tool calls pinned to a specific provider. The router picks per call. A real research-loop example reduces cost from $0.245 to $0.029 (8.3× cheaper) by mixing models per step rather than running everything through one flagship.

Does GammaInfra work with the OpenAI Agents SDK, LangGraph, AutoGen, Mastra, and Letta?

Yes — any agent framework that accepts a custom OpenAI base URL. The code samples above show working snippets for OpenAI Agents SDK (OpenAIChatCompletionsModel + RunConfig.model_settings.extra_headers), LangGraph (ChatOpenAI with default_headers per persona), AutoGen (OpenAIChatCompletionClient with default_headers), Mastra, and Letta. Per-persona model pinning is via headers or by passing different model names per step.

How do I set a max latency budget per agent step?

Set X-GammaInfra-Max-Latency-Ms: <ms> on the step's chat completion call (range 60 to 600 000). If the upstream provider call exceeds the budget, GammaInfra cancels it and returns a 504 max_latency_exceeded response. This prevents one slow provider from holding up an entire agent run.

What happens if tool_call.id shapes differ between providers?

GammaInfra translates Anthropic's toolu_* IDs to OpenAI's call_* shape (and the reverse on the inbound tool-result direction) at the gateway boundary. Same-provider conversations round-trip cleanly. Cross-provider mid-conversation continuity isn't possible (each provider validates IDs it issued) — design agent loops to commit to one provider per session for tool-heavy work, then switch providers between sessions.

Can I see per-step cost in agent logs?

Yes — every step response carries X-GammaInfra-Cost-USD, X-GammaInfra-Input-Cost-USD, X-GammaInfra-Output-Cost-USD, and X-GammaInfra-Endpoint. Log these alongside your step trace and you get full per-step cost attribution without a separate accounting pipeline.

Does GammaInfra hedge requests for lower agent-step latency?

On gammainfra/fast with hedging enabled in production, the router fires the top-2 endpoints in parallel and takes the first success, cancelling the loser. Counters are exposed at /metrics. Streaming hedging is deferred to a later release; currently rolling out per-customer on request.

Smart routing for agent loops.
One model per step. One API.

One model for every step is the wrong default.

Per-step model variance

Tail latency compounds

One provider down breaks the loop

Agent-shaped routing, built into the API.

The framework you already use.

Every routing choice is auditable.

Pass-through token rates. Pay on top-ups, not requests.

Managed

BYOK

Common questions about agents on smart routing.

Start routing in under a minute.

Smart routing for agent loops.One model per step. One API.

One model for every step is the wrong default.

Per-step model variance

Tail latency compounds

One provider down breaks the loop

Agent-shaped routing, built into the API.

The framework you already use.

Every routing choice is auditable.

Pass-through token rates. Pay on top-ups, not requests.

Managed

BYOK

Common questions about agents on smart routing.

Start routing in under a minute.

Smart routing for agent loops.
One model per step. One API.