Use LangChain (and LlamaIndex) with GammaInfra

If your chain code uses ChatOpenAI or OpenAI, you can route every chain through GammaInfra by changing one parameter: base_url. Existing chains work unchanged. You just start seeing per-request cost in response headers and get automatic fallback across providers.

The pain

LangChain and LlamaIndex apps using more than one provider typically end up with:

Multiple client instances — ChatOpenAI(), ChatAnthropic(), ChatGoogleGenerativeAI() — each with its own SDK version, retry logic, and rate-limit handling.
Provider-specific quirks leaking through LangChain's wrappers (tool-call ID shapes, streaming chunk shapes, system-prompt placement).
No unified cost picture. Each provider reports differently in its dashboard, and aggregating them requires custom code.
Fallback strategies hand-rolled inside chains — except RateLimitError blocks that switch providers and re-instantiate clients.

What changes with GammaInfra

Use one ChatOpenAI instance, pointed at GammaInfra. Behind it, GammaInfra talks to every major LLM provider, classifies prompts, picks the best-fit model, falls back when one provider throttles, and reports cost in response headers. Your LangChain code doesn't know or care.

LangChain setup

1. Install

pip install -U langchain-openai

You don't need langchain-anthropic, langchain-google-genai, etc. — the OpenAI client is all you need, since GammaInfra speaks OpenAI's protocol to every backend.

2. Construct the client

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://api.gammainfra.com/v1",
    api_key="sk-gammainfra-...",       # your GammaInfra key
    model="gammainfra/auto",            # or any specific model
)

response = llm.invoke("Explain hedged requests in three sentences.")
print(response.content)

3. Pick a model name

Some choices:

gammainfra/auto — task-aware routing. Reasonable default. GammaInfra classifies your prompt and picks an appropriate model.
gammainfra/cheap — cost-optimized.
gammainfra/fast — latency-optimized with hedging when enabled.
openai/gpt-5, openai/gpt-5-mini, openai/gpt-5-nano — OpenAI direct pins.
anthropic/claude-opus-4-7, anthropic/claude-sonnet-4-6, anthropic/claude-haiku-4-5 — Anthropic direct pins.
google/gemini-3.1-pro-preview, google/gemini-3-flash-preview — Google.
deepseek/deepseek-v4-pro, deepseek/deepseek-v4-flash — DeepSeek.
mistral/mistral-large-2512, mistral/devstral-2512 — Mistral.
groq/llama-3.3-70b-versatile, groq/llama-3.1-8b-instant — Llama via Groq.
bedrock/us.anthropic.claude-opus-4-7 — Claude Opus via Amazon Bedrock (cross-cloud reliability).

Full list at GET /v1/models.

Tool calling, structured output, streaming — work unchanged

Everything LangChain does on top of the OpenAI protocol works through GammaInfra:

from langchain_openai import ChatOpenAI
from langchain_core.tools import tool

@tool
def get_weather(city: str) -> str:
    """Get the current weather in a city."""
    return f"It's 72°F and sunny in {city}."

llm = ChatOpenAI(
    base_url="https://api.gammainfra.com/v1",
    api_key="sk-gammainfra-...",
    model="anthropic/claude-opus-4-7",   # pin Anthropic for this chain
)

llm_with_tools = llm.bind_tools([get_weather])
response = llm_with_tools.invoke("What's the weather in Tokyo?")
print(response.tool_calls)

Tool-call IDs: Anthropic uses toolu_* IDs while OpenAI uses call_*. GammaInfra translates these at the boundary so LangChain's tool-loop round-trips correctly whether the underlying model is OpenAI or Anthropic. You won't notice the translation unless you're inspecting raw IDs.

Structured output

from pydantic import BaseModel

class WeatherReport(BaseModel):
    city: str
    temperature_f: int
    conditions: str

llm = ChatOpenAI(
    base_url="https://api.gammainfra.com/v1",
    api_key="sk-gammainfra-...",
    model="gammainfra/auto",
)

structured = llm.with_structured_output(WeatherReport)
result = structured.invoke("What's the weather in Tokyo?")
print(result.temperature_f)

GammaInfra honors response_format across every provider — even ones whose native API doesn't support JSON mode directly. Anthropic and Amazon Nova use a system-prompt + stop-sequence trick; Mistral and Cohere use their native JSON mode. The output reaches LangChain in the expected shape.

Streaming

llm = ChatOpenAI(
    base_url="https://api.gammainfra.com/v1",
    api_key="sk-gammainfra-...",
    model="gammainfra/fast",
)

for chunk in llm.stream("Write a haiku about hedged requests."):
    print(chunk.content, end="", flush=True)

GammaInfra streams Server-Sent Events (SSE) in OpenAI's chunk format regardless of the underlying provider. Anthropic, Google, and Bedrock chunks are transformed on the fly. LangChain's streaming iterator works as-is.

LlamaIndex setup

Same pattern with LlamaIndex's OpenAI LLM wrapper:

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    api_base="https://api.gammainfra.com/v1",
    api_key="sk-gammainfra-...",
    model="gammainfra/auto",
)

response = llm.complete("Explain task-aware routing in two sentences.")
print(response.text)

Embeddings: GammaInfra doesn't currently offer an embedding endpoint (the gateway focuses on chat completions). For embeddings, point LlamaIndex's embedding model directly at your provider — OpenAIEmbedding(api_key=...) against api.openai.com works as normal.

Bonus features only GammaInfra adds

Headers you can pass via extra_headers in LangChain (or default_headers in the underlying httpx client):

X-GammaInfra-Cost-Quality: 0.0..1.0 — continuous dial. 0.0 = pure quality, 1.0 = pure cost. Echoed back in X-GammaInfra-Cost-Quality-Applied.
X-GammaInfra-Max-Latency-Ms: 30000 — hard latency budget. Returns 504 if the underlying provider exceeds it. Useful for time-sensitive workflows where you'd rather fail fast than wait.
X-GammaInfra-Region: us|eu|apac — constrain the served endpoint to a region. For data-residency-sensitive workloads when paired with provider.only=["bedrock"].
X-GammaInfra-Routing: off — disable v2 routing and force the literal model name through.

llm = ChatOpenAI(
    base_url="https://api.gammainfra.com/v1",
    api_key="sk-gammainfra-...",
    model="gammainfra/auto",
    extra_headers={
        "X-GammaInfra-Cost-Quality": "0.3",
        "X-GammaInfra-Max-Latency-Ms": "20000",
    },
)

Reading the cost header

LangChain doesn't expose response headers by default. To read X-GammaInfra-Cost-USD directly from a single call, drop down one layer to the OpenAI client:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.gammainfra.com/v1",
    api_key="sk-gammainfra-...",
)

resp = client.with_raw_response.chat.completions.create(
    model="gammainfra/auto",
    messages=[{"role": "user", "content": "Hello"}],
)

print("Cost:", resp.http_response.headers["X-GammaInfra-Cost-USD"])
print("Endpoint:", resp.http_response.headers["X-GammaInfra-Endpoint"])
print("Fallback chain:", resp.http_response.headers["X-GammaInfra-Fallback-Chain"])

# parse the completion
completion = resp.parse()
print(completion.choices[0].message.content)

For aggregate cost across many LangChain calls, the dashboard at dashboard.gammainfra.com shows per-request itemization and daily/weekly rollups.

A/B compare providers without rewriting chains

One genuine win: you can compare providers head-to-head on your own eval suite without re-instantiating clients or installing more SDKs.

def eval_chain(model_name: str):
    llm = ChatOpenAI(
        base_url="https://api.gammainfra.com/v1",
        api_key="sk-gammainfra-...",
        model=model_name,
    )
    # ...rest of your eval logic, unchanged...
    return scores

for model in [
    "openai/gpt-5-mini",
    "anthropic/claude-haiku-4-5",
    "google/gemini-3-flash-preview",
    "deepseek/deepseek-v4-flash",
]:
    print(model, eval_chain(model))

Trade-offs to know about

Latency. ~10–50 ms overhead vs going direct. Negligible for chain workloads; can be net-negative for gammainfra/fast with hedging enabled.
Cost. 3% top-up fee during the launch window (5% after). Pass-through provider rates on tokens — no markup. BYOK alternative at 1–2% per request.
No embeddings yet. GammaInfra focuses on chat completions. Embeddings continue to go direct to your provider.
Privacy. Prompts and responses aren't logged by default. See privacy policy.

Ready to try it?

Get a GammaInfra API key →

$3 free trial credit on signup, $10 minimum top-up. Pass-through provider token rates plus 3% top-up fee during the launch window (5% after 2026-06-23).

Frequently asked questions

How do I configure ChatOpenAI() to use GammaInfra?

Pass base_url='https://api.gammainfra.com/v1' and api_key='sk-gammainfra-...' to ChatOpenAI(). Set model='gammainfra/auto' (or any specific model). All standard LangChain features — streaming, callbacks, structured output via with_structured_output, tool binding — work unchanged because the wire format is identical.

Does GammaInfra work with LangChain Expression Language (LCEL)?

Yes. LCEL pipes ChatOpenAI() through .pipe() chains the same way regardless of base URL. Async, streaming, parallelism via .with_config, batch invocations via .batch — all work. The ChatOpenAI instance is the only thing that changes; everything downstream of it is provider-agnostic.

Can I see per-call cost in LangSmith?

LangSmith doesn't currently parse X-GammaInfra-Cost-USD headers from custom-endpoint calls. To get per-call cost into LangSmith, attach a custom callback that reads the response headers from the ChatOpenAI response and logs them as run metadata. GammaInfra's dashboard always has the authoritative cost.

How do I switch between models per chain step?

Instantiate one ChatOpenAI per step with different model strings — e.g. extract_llm = ChatOpenAI(model='gammainfra/cheap'), reason_llm = ChatOpenAI(model='anthropic/claude-opus-4-7'). Compose them in an LCEL chain with .pipe(). For per-step header control, use the default_headers parameter to set X-GammaInfra-Cost-Quality per ChatOpenAI instance.

Does GammaInfra support LangChain's streaming callbacks?

Yes. Pass streaming=True to ChatOpenAI() and attach a BaseCallbackHandler. on_llm_new_token fires per token chunk as expected. The gateway forwards the upstream SSE stream verbatim (after wire-format normalization) so callback timing matches what you'd see calling OpenAI directly.