Glossary

GammaInfra glossary — definitions of the concepts and terminology used throughout the documentation. Each entry is a standalone 400-800 word explainer with a direct-answer definition, body sections, and developer-question FAQ.

Terms

LLM gateway — A single API endpoint that proxies large-language-model requests to multiple underlying LLM providers, normalizing wire format and exposing per-request cost, latency, and routing decisions in response headers.
LLM router — The decision component that picks which large-language-model provider and model should handle each incoming prompt — rule-based, learned, or caller-driven.
OpenAI-compatible API — An API that accepts and returns requests in the same wire format as OpenAI's /v1/chat/completions endpoint, so OpenAI SDK code works against it after a single base-URL change.
BYOK (Bring Your Own Key) — A configuration in which a developer uses their own provider API keys through an LLM gateway, so requests are billed against the developer's direct provider accounts rather than the gateway's negotiated rates.
Task-aware routing — A routing strategy that classifies each prompt into a task category (reasoning, code, extraction, etc.) and dispatches to a model specifically chosen for that task, rather than running everything through one flagship model.
Fallback chain — An ordered list of (provider, model) endpoints tried in sequence by an LLM gateway when the primary endpoint fails. Lets a single request survive individual provider outages or rate-limits transparently.
Cost-quality dial — A continuous request-time parameter that biases an LLM router's model selection along the cost-quality trade-off — 0.0 picks the highest-quality option, 1.0 picks the cheapest.
Hedged requests — A latency-reduction technique in which an LLM gateway fires two provider requests in parallel for the same prompt and returns whichever completes first, cancelling the loser. Reduces p95 latency at roughly 2x cost.
Generative Engine Optimization (GEO) — The practice of structuring web content so that generative AI engines (Perplexity, ChatGPT Search, Claude with web, Gemini, You.com) cite or recommend it when users ask questions in natural language.
Cross-region inference profile — An AWS Bedrock model ID prefix (e.g. us., eu., apac.) that routes inference requests across multiple AWS regions in a geographic group, providing higher availability and throughput for newer foundation models that require it.

Common questions about this glossary

What does this glossary cover?

The vocabulary used throughout GammaInfra's docs — LLM gateway, LLM router, OpenAI-compatible API, task-aware routing, BYOK, fallback chain, hedged requests, cost-quality dial, generative engine optimization, and the Bedrock cross-region inference profile. Each entry is 400–800 words with a direct-answer definition, body sections, and a developer-question FAQ.

How do these terms relate to each other?

The LLM gateway is the system; the LLM router is the decision component inside it; task-aware routing is the most common router strategy; the cost-quality dial is a caller-side preference signal to the router; the fallback chain is the recovery cascade; hedged requests are a parallel-race optimization; BYOK is an alternative billing mode; OpenAI-compatible API is the wire format that gateways implement. The Bedrock cross-region inference profile is an AWS-specific quirk that matters for newer Claude and Nova models.

Last updated 2026-05-15.