vLLM Semantic Router
Signal-driven decision routing for mixture-of-modality deployments.
16 papers / 2025-2026
Papers and infrastructure work around routing as a systems problem.
Signal-driven decision routing for mixture-of-modality deployments.
16 papers / 2025-2026
A research archive spanning semantic routing, agent behavior, and infrastructure efficiency.
Proposes token-budget-aware pool routing that estimates each request's token budget online and dispatches it to short- or long-context serving pools, reducing GPU cost while improving stability for LLM inference.
Introduces vLLM Semantic Router as a signal-driven routing framework for mixture-of-modality deployments, composing heterogeneous signals into deployment-specific policies across cost, privacy, latency, and safety.
Synthesizes recent routing, fleet, multimodal, and governance work into the Workload-Router-Pool architecture, framing routing as one layer in a broader inference optimization stack.
Formalizes the visual confused deputy as a security failure mode for computer-use agents and proposes a dual-channel guardrail for validating click targets and action reasoning before execution.
Presents OATS, an offline embedding-refinement method that improves semantic-router tool ranking under single-digit millisecond CPU budgets without adding serving-time model inference.
Proposes Adaptive VLM Routing to estimate step difficulty in computer-use agents and route each action to the cheapest model that can still satisfy a target reliability threshold.
Combines flash attention, prompt compression, and near-streaming body processing to cut routing latency from seconds to tens of milliseconds on lightweight shared serving hardware.
Introduces a queueing-theory-grounded fleet planner and discrete-event simulator for sizing multi-pool LLM GPU fleets against P99 TTFT targets without requiring up-front hardware profiling.
Derives the minimum-cost two-pool LLM fleet directly from workload distributions and P99 TTFT targets, then maps the optimal boundary to a deployable compress-and-route strategy.
Derives the 1/W law, showing that tokens per watt roughly halve whenever the serving context window doubles, making context-length routing a first-order energy-efficiency lever.
Shows how probabilistic ML predicates in policy languages can silently co-fire on the same query and adds conflict detection plus a softmax prevention mechanism in the Semantic Router DSL.
Extends the Semantic Router DSL from stateless per-request routing to multi-step agent workflows, compiling verified decision artifacts across orchestration, Kubernetes, and protocol layers.
Shows that conversational memory and retrieval-grounded routing let a lightweight 8B model recover most of a much larger model's performance on persistent user-specific queries while dramatically reducing cost.
Adds a real-time verification layer for long-document RAG systems, handling contexts up to 32K tokens while balancing latency with grounding coverage.
Routes prompts by reasoning requirements so reasoning is only invoked when it pays off, improving accuracy while cutting token usage and latency versus always-on reasoning.
Proposes category-aware semantic caching with category-specific similarity thresholds, TTLs, and quotas, using a hybrid split between in-memory HNSW retrieval and external document storage.
Maintainer, steering, and reviewer roles across gateways, service mesh, and inference infrastructure.
Manages Envoy Proxy as a standalone or Kubernetes-based application gateway.
Manages unified access to generative AI services built on Envoy Gateway.
Cost-efficient and pluggable infrastructure components for GenAI inference.
AI gateway and AI-native API gateway.
Connects, secures, controls, and observes services.
Observability console for Istio with service mesh.
Manages any layer-7 protocols in a service mesh.
Uses eBPF to speed up service mesh data paths.
Role-oriented, portable, and expressive interfaces for Kubernetes networking.
Converts Ingress resources to Gateway API resources.