What is an LLM router?

LLM router — the decision component that picks which large-language-model provider and model should handle each incoming prompt. Routing decisions can be rule-based (keyword or task classification), learned (an embedding model plus a classifier trained on quality and cost signals), or caller-driven (the request explicitly pins a model).

Why routing exists

Every LLM provider has strengths and weaknesses, priced separately. A reasoning-heavy step might justify Claude Opus 4.7 at $5/$25 per million tokens; a short extraction step does not. Manual model selection per call doesn't scale — production applications have hundreds of distinct call sites and the right model for each shifts as providers ship new versions weekly. The LLM router solves this by making the model-selection decision at request time rather than at code-write time.

Routing strategies in practice

Rule-based routing

The simplest approach: regex-match the prompt against keyword patterns, map to a task label, dispatch to a hard-coded chain. "summarize this" hits the summarize chain. Images in the messages array hits a multimodal-only chain. Predictable, debuggable, works on day one with no training data. Limited to the patterns the rules cover.

Learned (ML-based) routing

An embedding model (typically MiniLM or BGE-base) converts the prompt into a fixed-size vector. A trained classifier (often logistic regression for speed) maps that vector to a logical-label distribution. The label resolves through a per-label endpoint registry that incorporates cost, live p50 latency, and quality signals. Captures subtler patterns than rules but needs accumulated quality data to train on.

Caller-driven routing

The caller pins a specific model in the request — model=anthropic/claude-opus-4-7, model=openai/gpt-5-mini. The router bypasses smart selection and dispatches directly. Useful for high-stakes calls where the model choice is part of the application's design decision, not a routing variable.

Hybrid (most production routers)

Rules for high-confidence shortcuts (images present → multimodal endpoint, tools array set → tool-capable endpoint, explicit pin → respect it). A learned classifier for the rest. Caller-driven preferences (X-GammaInfra-Preference, X-GammaInfra-Cost-Quality) bias the learned-router output.

How GammaInfra's router works

GammaInfra runs a two-stage hybrid router:

  1. Stage 0 — capability shortcuts. Image content in messages, non-empty tools array, or an explicit direct pin bypasses learned routing entirely.
  2. Stage A — MiniLM embedding + logistic-regression classifier. 8 logical labels: reasoning, code, creative, rewrite, chat, extraction, summarize, translation. Outputs a confidence-calibrated label distribution.
  3. Stage B — endpoint resolution. The label resolves through a registry of (provider, model) candidates ordered by cost, quality preset, or live p50 latency depending on the caller's preference hint. Low-confidence outputs fall through to the v1 rule-based router as a safety net.

Every routing decision is reported back in the X-GammaInfra-Router-Version response header — values include v2 (default learned path), v2_keyword (rule shortcut hit), v2_short_prompt (short-prompt guard), v2_hedged (parallel-race), direct (caller pinned), v1 (learned router fell back).

Common questions

What does an LLM router actually decide?
For each incoming request, the router picks (a) which provider to dispatch to, (b) which specific model on that provider, (c) the order of fallback candidates if the first choice fails. Some routers also decide whether to hedge — fire two providers in parallel and take the first success.
Rule-based vs learned routing — which is better?
Rule-based routing is predictable, debuggable, and works on day one with no training data. Learned routing requires accumulated quality signals to train on but can capture subtle prompt-to-model fit that rules miss. Production routers usually combine both — rules for high-confidence shortcuts, a learned classifier for the rest.
Can callers override the router?
Yes, in two ways. (1) Pin a specific model in the request: model=anthropic/claude-opus-4-7 bypasses smart routing entirely. (2) Use preference hints: X-GammaInfra-Preference: quality biases toward stronger models, X-GammaInfra-Preference: cost biases toward cheaper ones, X-GammaInfra-Cost-Quality: 0.3 is a continuous dial.
How does the router avoid getting stuck on a slow provider?
Two mechanisms. First, a max-latency budget per request via X-GammaInfra-Max-Latency-Ms — the upstream call is cancelled and the request 504s if the budget is exceeded. Second, live p50 latency monitoring (refreshed every 30 seconds, 5-minute window) updates the router's preference ordering so a chronically slow provider drops in priority automatically.
What inputs does a learned LLM router use?
Typically: a prompt embedding (e.g. MiniLM, BGE-base), the presence of attachments (images, audio), the tools/functions array, the requested response_format, the caller's preference signal, and live operational signals (provider health, live p50 latency, cost). The classifier maps these to a logical-label distribution which then resolves through a per-label endpoint registry.

Try the router

Get a GammaInfra API key →

$3 free trial credit on signup. The router runs by default on every request to gammainfra/auto — no configuration needed.

Last updated 2026-05-15.