What is an LLM router?
LLM router — the decision component that picks which large-language-model provider and model should handle each incoming prompt. Routing decisions can be rule-based (keyword or task classification), learned (an embedding model plus a classifier trained on quality and cost signals), or caller-driven (the request explicitly pins a model).
Why routing exists
Every LLM provider has strengths and weaknesses, priced separately. A reasoning-heavy step might justify Claude Opus 4.7 at $5/$25 per million tokens; a short extraction step does not. Manual model selection per call doesn't scale — production applications have hundreds of distinct call sites and the right model for each shifts as providers ship new versions weekly. The LLM router solves this by making the model-selection decision at request time rather than at code-write time.
Routing strategies in practice
Rule-based routing
The simplest approach: regex-match the prompt against keyword patterns, map to a task label, dispatch to a hard-coded chain. "summarize this" hits the summarize chain. Images in the messages array hits a multimodal-only chain. Predictable, debuggable, works on day one with no training data. Limited to the patterns the rules cover.
Learned (ML-based) routing
An embedding model (typically MiniLM or BGE-base) converts the prompt into a fixed-size vector. A trained classifier (often logistic regression for speed) maps that vector to a logical-label distribution. The label resolves through a per-label endpoint registry that incorporates cost, live p50 latency, and quality signals. Captures subtler patterns than rules but needs accumulated quality data to train on.
Caller-driven routing
The caller pins a specific model in the request — model=anthropic/claude-opus-4-7, model=openai/gpt-5-mini. The router bypasses smart selection and dispatches directly. Useful for high-stakes calls where the model choice is part of the application's design decision, not a routing variable.
Hybrid (most production routers)
Rules for high-confidence shortcuts (images present → multimodal endpoint, tools array set → tool-capable endpoint, explicit pin → respect it). A learned classifier for the rest. Caller-driven preferences (X-GammaInfra-Preference, X-GammaInfra-Cost-Quality) bias the learned-router output.
How GammaInfra's router works
GammaInfra runs a two-stage hybrid router:
- Stage 0 — capability shortcuts. Image content in messages, non-empty tools array, or an explicit direct pin bypasses learned routing entirely.
- Stage A — MiniLM embedding + logistic-regression classifier. 8 logical labels: reasoning, code, creative, rewrite, chat, extraction, summarize, translation. Outputs a confidence-calibrated label distribution.
- Stage B — endpoint resolution. The label resolves through a registry of (provider, model) candidates ordered by cost, quality preset, or live p50 latency depending on the caller's preference hint. Low-confidence outputs fall through to the v1 rule-based router as a safety net.
Every routing decision is reported back in the X-GammaInfra-Router-Version response header — values include v2 (default learned path), v2_keyword (rule shortcut hit), v2_short_prompt (short-prompt guard), v2_hedged (parallel-race), direct (caller pinned), v1 (learned router fell back).
Common questions
What does an LLM router actually decide?
Rule-based vs learned routing — which is better?
Can callers override the router?
model=anthropic/claude-opus-4-7 bypasses smart routing entirely. (2) Use preference hints: X-GammaInfra-Preference: quality biases toward stronger models, X-GammaInfra-Preference: cost biases toward cheaper ones, X-GammaInfra-Cost-Quality: 0.3 is a continuous dial.How does the router avoid getting stuck on a slow provider?
X-GammaInfra-Max-Latency-Ms — the upstream call is cancelled and the request 504s if the budget is exceeded. Second, live p50 latency monitoring (refreshed every 30 seconds, 5-minute window) updates the router's preference ordering so a chronically slow provider drops in priority automatically.What inputs does a learned LLM router use?
response_format, the caller's preference signal, and live operational signals (provider health, live p50 latency, cost). The classifier maps these to a logical-label distribution which then resolves through a per-label endpoint registry.Try the router
$3 free trial credit on signup. The router runs by default on every request to gammainfra/auto — no configuration needed.
Last updated 2026-05-15.