What is task-aware LLM routing?
Task-aware routing — a routing strategy that classifies each incoming prompt into a task category (e.g. reasoning, code, creative, rewrite, chat, extraction, summarize, translation) and dispatches to a model specifically chosen for that task. Replaces the "everything goes through one flagship model" default with a per-prompt model-fit decision, dramatically lowering cost and often improving quality.
Why one model is the wrong default
Production LLM applications typically have one or two flagship models hardcoded. That's expensive: a $5/$25-per-million flagship is overkill for a 30-token extraction step that a $0.25/$2-per-million model would handle identically. And it's brittle: when the flagship has an outage or rate-limits, the entire application stalls.
Task-aware routing trades the single-model simplicity for per-prompt model selection. A short summarization run goes to the cheap fast summarization-specialist. A reasoning-heavy step goes to the flagship. Each request gets the cheapest model that actually answers the question well.
The 8 labels GammaInfra classifies into
- reasoning — multi-step analysis, math, root-cause questions. Dispatched to flagship reasoning models (Claude Opus 4.6, DeepSeek V4 Pro thinking-mode, GPT-5.4).
- code — code generation, debugging, refactoring, code review. Claude Sonnet 4.6 leads (best on real-world code by recent evals).
- creative — story writing, brainstorming, marketing copy. GPT-5-mini and Claude Sonnet 4.6 in chain.
- rewrite — paraphrasing, tone shifting, translation between styles. Mid-tier models work well here at fraction of flagship cost.
- chat — general conversation, simple questions. GPT-5-mini, Mistral Small, Llama 3.1 8B — cheap, fast, plenty good enough.
- extraction — pulling structured data from text. Gemini 2.5 Flash Lite, GPT-5-nano, Mistral Small.
- summarize — condensing long content. Gemini 3.1 Flash Lite leads, DeepSeek V4 Flash backup.
- translation — language translation. Mistral Large 2512 leads.
How the classifier works
GammaInfra's classifier is a MiniLM-based embedding model plus a logistic-regression head. The embedding runs in ~3 ms on CPU per prompt; the LR head adds negligible latency. The classifier outputs a calibrated probability distribution over the 8 labels.
High-confidence predictions (max-probability above the threshold) dispatch immediately to the label's endpoint chain. Low-confidence predictions fall through to the v1 keyword-rule router as a safety net. Capability flags (image content present, tools array set) bypass classification entirely and route to multimodal- or tool-capable endpoints.
The measured cost win
On a representative 6-step agentic research loop, running every step through Claude Opus 4.7 cost $0.245. Task-aware routing — reasoning steps to DeepSeek V4 Pro, extraction steps to GPT-5-mini, summarization to Gemini 3.1 Flash Lite — cost $0.029. That's an 8.3x reduction with equivalent end-to-end output quality.
The math holds because most agent-loop steps are not reasoning. Extraction and summarization steps make up the bulk of token count and they don't benefit from flagship-tier inference.
Common questions
How accurate is the task classifier?
Can I override the routing decision per request?
model=anthropic/claude-opus-4-7. Or bias the router with X-GammaInfra-Preference: quality (forces toward stronger models) or X-GammaInfra-Cost-Quality: 0.3 (continuous dial, 0=quality, 1=cost). Explicit pins bypass classification entirely.What if my prompt doesn't fit any of the 8 labels well?
X-GammaInfra-Router-Version response header tells you which path served the request: v2 (learned), v2_keyword (rule shortcut), v1 (fallback), or direct (explicit pin).Does task-aware routing handle multi-turn conversations?
tool_call.id continuity (each provider validates IDs it issued), so the recommended pattern is to pin a provider for one session and switch providers between sessions.How does task-aware routing relate to model routing in research papers (e.g. RouteLLM, FrugalGPT)?
X-GammaInfra-Cost-Quality) is the production hook for the cost-quality trade-off those papers parametrize.Try the gateway
$3 free trial credit on signup, $10 minimum top-up. Pass-through provider rates plus 3% top-up fee during the launch window (5% after 2026-06-23).
Last updated 2026-05-15.