Use LangChain (and LlamaIndex) with GammaInfra
If your chain code uses ChatOpenAI or OpenAI, you can route every chain through GammaInfra by changing one parameter: base_url. Existing chains work unchanged. You just start seeing per-request cost in response headers and get automatic fallback across providers.
The pain
LangChain and LlamaIndex apps using more than one provider typically end up with:
- Multiple client instances —
ChatOpenAI(),ChatAnthropic(),ChatGoogleGenerativeAI()— each with its own SDK version, retry logic, and rate-limit handling. - Provider-specific quirks leaking through LangChain's wrappers (tool-call ID shapes, streaming chunk shapes, system-prompt placement).
- No unified cost picture. Each provider reports differently in its dashboard, and aggregating them requires custom code.
- Fallback strategies hand-rolled inside chains —
except RateLimitErrorblocks that switch providers and re-instantiate clients.
What changes with GammaInfra
Use one ChatOpenAI instance, pointed at GammaInfra. Behind it, GammaInfra talks to every major LLM provider, classifies prompts, picks the best-fit model, falls back when one provider throttles, and reports cost in response headers. Your LangChain code doesn't know or care.
LangChain setup
1. Install
pip install -U langchain-openai
You don't need langchain-anthropic, langchain-google-genai, etc. — the OpenAI client is all you need, since GammaInfra speaks OpenAI's protocol to every backend.
2. Construct the client
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="https://api.gammainfra.com/v1",
api_key="sk-gammainfra-...", # your GammaInfra key
model="gammainfra/auto", # or any specific model
)
response = llm.invoke("Explain hedged requests in three sentences.")
print(response.content)
3. Pick a model name
Some choices:
gammainfra/auto— task-aware routing. Reasonable default. GammaInfra classifies your prompt and picks an appropriate model.gammainfra/cheap— cost-optimized.gammainfra/fast— latency-optimized with hedging when enabled.openai/gpt-5,openai/gpt-5-mini,openai/gpt-5-nano— OpenAI direct pins.anthropic/claude-opus-4-7,anthropic/claude-sonnet-4-6,anthropic/claude-haiku-4-5— Anthropic direct pins.google/gemini-3.1-pro-preview,google/gemini-3-flash-preview— Google.deepseek/deepseek-v4-pro,deepseek/deepseek-v4-flash— DeepSeek.mistral/mistral-large-2512,mistral/devstral-2512— Mistral.groq/llama-3.3-70b-versatile,groq/llama-3.1-8b-instant— Llama via Groq.bedrock/us.anthropic.claude-opus-4-7— Claude Opus via Amazon Bedrock (cross-cloud reliability).
Full list at GET /v1/models.
Tool calling, structured output, streaming — work unchanged
Everything LangChain does on top of the OpenAI protocol works through GammaInfra:
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
@tool
def get_weather(city: str) -> str:
"""Get the current weather in a city."""
return f"It's 72°F and sunny in {city}."
llm = ChatOpenAI(
base_url="https://api.gammainfra.com/v1",
api_key="sk-gammainfra-...",
model="anthropic/claude-opus-4-7", # pin Anthropic for this chain
)
llm_with_tools = llm.bind_tools([get_weather])
response = llm_with_tools.invoke("What's the weather in Tokyo?")
print(response.tool_calls)
toolu_* IDs while OpenAI uses call_*. GammaInfra translates these at the boundary so LangChain's tool-loop round-trips correctly whether the underlying model is OpenAI or Anthropic. You won't notice the translation unless you're inspecting raw IDs.
Structured output
from pydantic import BaseModel
class WeatherReport(BaseModel):
city: str
temperature_f: int
conditions: str
llm = ChatOpenAI(
base_url="https://api.gammainfra.com/v1",
api_key="sk-gammainfra-...",
model="gammainfra/auto",
)
structured = llm.with_structured_output(WeatherReport)
result = structured.invoke("What's the weather in Tokyo?")
print(result.temperature_f)
GammaInfra honors response_format across every provider — even ones whose native API doesn't support JSON mode directly. Anthropic and Amazon Nova use a system-prompt + stop-sequence trick; Mistral and Cohere use their native JSON mode. The output reaches LangChain in the expected shape.
Streaming
llm = ChatOpenAI(
base_url="https://api.gammainfra.com/v1",
api_key="sk-gammainfra-...",
model="gammainfra/fast",
)
for chunk in llm.stream("Write a haiku about hedged requests."):
print(chunk.content, end="", flush=True)
GammaInfra streams Server-Sent Events (SSE) in OpenAI's chunk format regardless of the underlying provider. Anthropic, Google, and Bedrock chunks are transformed on the fly. LangChain's streaming iterator works as-is.
LlamaIndex setup
Same pattern with LlamaIndex's OpenAI LLM wrapper:
from llama_index.llms.openai import OpenAI
llm = OpenAI(
api_base="https://api.gammainfra.com/v1",
api_key="sk-gammainfra-...",
model="gammainfra/auto",
)
response = llm.complete("Explain task-aware routing in two sentences.")
print(response.text)
Embeddings: GammaInfra doesn't currently offer an embedding endpoint (the gateway focuses on chat completions). For embeddings, point LlamaIndex's embedding model directly at your provider — OpenAIEmbedding(api_key=...) against api.openai.com works as normal.
Bonus features only GammaInfra adds
Headers you can pass via extra_headers in LangChain (or default_headers in the underlying httpx client):
X-GammaInfra-Cost-Quality: 0.0..1.0— continuous dial. 0.0 = pure quality, 1.0 = pure cost. Echoed back inX-GammaInfra-Cost-Quality-Applied.X-GammaInfra-Max-Latency-Ms: 30000— hard latency budget. Returns 504 if the underlying provider exceeds it. Useful for time-sensitive workflows where you'd rather fail fast than wait.X-GammaInfra-Region: us|eu|apac— constrain the served endpoint to a region. For data-residency-sensitive workloads when paired withprovider.only=["bedrock"].X-GammaInfra-Routing: off— disable v2 routing and force the literal model name through.
llm = ChatOpenAI(
base_url="https://api.gammainfra.com/v1",
api_key="sk-gammainfra-...",
model="gammainfra/auto",
extra_headers={
"X-GammaInfra-Cost-Quality": "0.3",
"X-GammaInfra-Max-Latency-Ms": "20000",
},
)
Reading the cost header
LangChain doesn't expose response headers by default. To read X-GammaInfra-Cost-USD directly from a single call, drop down one layer to the OpenAI client:
from openai import OpenAI
client = OpenAI(
base_url="https://api.gammainfra.com/v1",
api_key="sk-gammainfra-...",
)
resp = client.with_raw_response.chat.completions.create(
model="gammainfra/auto",
messages=[{"role": "user", "content": "Hello"}],
)
print("Cost:", resp.http_response.headers["X-GammaInfra-Cost-USD"])
print("Endpoint:", resp.http_response.headers["X-GammaInfra-Endpoint"])
print("Fallback chain:", resp.http_response.headers["X-GammaInfra-Fallback-Chain"])
# parse the completion
completion = resp.parse()
print(completion.choices[0].message.content)
For aggregate cost across many LangChain calls, the dashboard at dashboard.gammainfra.com shows per-request itemization and daily/weekly rollups.
A/B compare providers without rewriting chains
One genuine win: you can compare providers head-to-head on your own eval suite without re-instantiating clients or installing more SDKs.
def eval_chain(model_name: str):
llm = ChatOpenAI(
base_url="https://api.gammainfra.com/v1",
api_key="sk-gammainfra-...",
model=model_name,
)
# ...rest of your eval logic, unchanged...
return scores
for model in [
"openai/gpt-5-mini",
"anthropic/claude-haiku-4-5",
"google/gemini-3-flash-preview",
"deepseek/deepseek-v4-flash",
]:
print(model, eval_chain(model))
Trade-offs to know about
- Latency. ~10–50 ms overhead vs going direct. Negligible for chain workloads; can be net-negative for
gammainfra/fastwith hedging enabled. - Cost. 3% top-up fee during the launch window (5% after). Pass-through provider rates on tokens — no markup. BYOK alternative at 1–2% per request.
- No embeddings yet. GammaInfra focuses on chat completions. Embeddings continue to go direct to your provider.
- Privacy. Prompts and responses aren't logged by default. See privacy policy.
Ready to try it?
$3 free trial credit on signup, $10 minimum top-up. Pass-through provider token rates plus 3% top-up fee during the launch window (5% after 2026-06-23).
Frequently asked questions
How do I configure ChatOpenAI() to use GammaInfra?
base_url='https://api.gammainfra.com/v1' and api_key='sk-gammainfra-...' to ChatOpenAI(). Set model='gammainfra/auto' (or any specific model). All standard LangChain features — streaming, callbacks, structured output via with_structured_output, tool binding — work unchanged because the wire format is identical.Does GammaInfra work with LangChain Expression Language (LCEL)?
ChatOpenAI() through .pipe() chains the same way regardless of base URL. Async, streaming, parallelism via .with_config, batch invocations via .batch — all work. The ChatOpenAI instance is the only thing that changes; everything downstream of it is provider-agnostic.Can I see per-call cost in LangSmith?
X-GammaInfra-Cost-USD headers from custom-endpoint calls. To get per-call cost into LangSmith, attach a custom callback that reads the response headers from the ChatOpenAI response and logs them as run metadata. GammaInfra's dashboard always has the authoritative cost.How do I switch between models per chain step?
ChatOpenAI per step with different model strings — e.g. extract_llm = ChatOpenAI(model='gammainfra/cheap'), reason_llm = ChatOpenAI(model='anthropic/claude-opus-4-7'). Compose them in an LCEL chain with .pipe(). For per-step header control, use the default_headers parameter to set X-GammaInfra-Cost-Quality per ChatOpenAI instance.Does GammaInfra support LangChain's streaming callbacks?
streaming=True to ChatOpenAI() and attach a BaseCallbackHandler. on_llm_new_token fires per token chunk as expected. The gateway forwards the upstream SSE stream verbatim (after wire-format normalization) so callback timing matches what you'd see calling OpenAI directly.