What is task-aware LLM routing?

Task-aware routing — a routing strategy that classifies each incoming prompt into a task category (e.g. reasoning, code, creative, rewrite, chat, extraction, summarize, translation) and dispatches to a model specifically chosen for that task. Replaces the "everything goes through one flagship model" default with a per-prompt model-fit decision, dramatically lowering cost and often improving quality.

Why one model is the wrong default

Production LLM applications typically have one or two flagship models hardcoded. That's expensive: a $5/$25-per-million flagship is overkill for a 30-token extraction step that a $0.25/$2-per-million model would handle identically. And it's brittle: when the flagship has an outage or rate-limits, the entire application stalls.

Task-aware routing trades the single-model simplicity for per-prompt model selection. A short summarization run goes to the cheap fast summarization-specialist. A reasoning-heavy step goes to the flagship. Each request gets the cheapest model that actually answers the question well.

The 8 labels GammaInfra classifies into

How the classifier works

GammaInfra's classifier is a MiniLM-based embedding model plus a logistic-regression head. The embedding runs in ~3 ms on CPU per prompt; the LR head adds negligible latency. The classifier outputs a calibrated probability distribution over the 8 labels.

High-confidence predictions (max-probability above the threshold) dispatch immediately to the label's endpoint chain. Low-confidence predictions fall through to the v1 keyword-rule router as a safety net. Capability flags (image content present, tools array set) bypass classification entirely and route to multimodal- or tool-capable endpoints.

The measured cost win

On a representative 6-step agentic research loop, running every step through Claude Opus 4.7 cost $0.245. Task-aware routing — reasoning steps to DeepSeek V4 Pro, extraction steps to GPT-5-mini, summarization to Gemini 3.1 Flash Lite — cost $0.029. That's an 8.3x reduction with equivalent end-to-end output quality.

The math holds because most agent-loop steps are not reasoning. Extraction and summarization steps make up the bulk of token count and they don't benefit from flagship-tier inference.

Common questions

How accurate is the task classifier?
GammaInfra's classifier hits roughly 73% top-1 accuracy on a held-out validation set of 88 prompts. That sounds low until you realize that adjacent labels (creative vs rewrite, chat vs summarize) often have equivalent best-fit models, so misclassifications don't always cost anything. The endpoint registry collapses some adjacent labels to the same chain head for exactly this reason.
Can I override the routing decision per request?
Yes. Pin a specific model with model=anthropic/claude-opus-4-7. Or bias the router with X-GammaInfra-Preference: quality (forces toward stronger models) or X-GammaInfra-Cost-Quality: 0.3 (continuous dial, 0=quality, 1=cost). Explicit pins bypass classification entirely.
What if my prompt doesn't fit any of the 8 labels well?
The classifier outputs probabilities for all 8 labels. If no single label exceeds the confidence threshold, the request falls through to the v1 keyword-rule router (10 task types, broader patterns). If that doesn't match either, it dispatches to the chat chain as a universal default. The X-GammaInfra-Router-Version response header tells you which path served the request: v2 (learned), v2_keyword (rule shortcut), v1 (fallback), or direct (explicit pin).
Does task-aware routing handle multi-turn conversations?
Yes — each turn is classified independently. This sometimes routes different turns of one conversation to different providers, which is fine for text-only chat. For tool-heavy agent loops, switching providers mid-conversation can break tool_call.id continuity (each provider validates IDs it issued), so the recommended pattern is to pin a provider for one session and switch providers between sessions.
How does task-aware routing relate to model routing in research papers (e.g. RouteLLM, FrugalGPT)?
It's the practical production form of those ideas. RouteLLM and FrugalGPT propose learning a per-prompt model-selection policy that maximizes quality-per-dollar. GammaInfra's classifier is one concrete implementation: discrete labels, a small embedding model, a calibrated classifier, and a per-label endpoint registry. The continuous cost-quality dial (X-GammaInfra-Cost-Quality) is the production hook for the cost-quality trade-off those papers parametrize.

Try the gateway

Get a GammaInfra API key →

$3 free trial credit on signup, $10 minimum top-up. Pass-through provider rates plus 3% top-up fee during the launch window (5% after 2026-06-23).

Last updated 2026-05-15.