vLLM Semantic Router
Signal-driven decision routing for mixture-of-modality deployments.
17 papers / 2025-2026
Papers and infrastructure work around routing as a systems problem.
Signal-driven decision routing for mixture-of-modality deployments.
17 papers / 2025-2026
A research archive spanning semantic routing, agent behavior, and infrastructure efficiency.
Presents a real-time verification layer for long-document RAG systems, handling contexts up to 32K tokens while balancing latency with grounding coverage for interactive production deployments.
Introduces a personal-model-first agent architecture where personal AI grows a correctable understanding of Identity, World, Pulse, and Journey through user-paced curiosity and reflection after each turn.
Proposes token-budget-aware pool routing that estimates each request's token budget online and dispatches it to short- or long-context serving pools, reducing GPU cost while improving stability for LLM inference.
Introduces vLLM Semantic Router as a signal-driven routing framework for mixture-of-modality deployments, composing heterogeneous signals into deployment-specific policies across cost, privacy, latency, and safety.
Synthesizes recent routing, fleet, multimodal, and governance work into the Workload-Router-Pool architecture, framing routing as one layer in a broader inference optimization stack.
Formalizes the visual confused deputy as a security failure mode for computer-use agents and proposes a dual-channel guardrail for validating click targets and action reasoning before execution.
Presents OATS, an offline embedding-refinement method that improves semantic-router tool ranking under single-digit millisecond CPU budgets without adding serving-time model inference.
Proposes Adaptive VLM Routing to estimate step difficulty in computer-use agents and route each action to the cheapest model that can still satisfy a target reliability threshold.
Combines flash attention, prompt compression, and near-streaming body processing to cut routing latency from seconds to tens of milliseconds on lightweight shared serving hardware.
Introduces a queueing-theory-grounded fleet planner and discrete-event simulator for sizing multi-pool LLM GPU fleets against P99 TTFT targets without requiring up-front hardware profiling.
Derives the minimum-cost two-pool LLM fleet directly from workload distributions and P99 TTFT targets, then maps the optimal boundary to a deployable compress-and-route strategy.
Derives the 1/W law, showing that tokens per watt roughly halve whenever the serving context window doubles, making context-length routing a first-order energy-efficiency lever.
Shows how probabilistic ML predicates in policy languages can silently co-fire on the same query and adds conflict detection plus a softmax prevention mechanism in the Semantic Router DSL.
Extends the Semantic Router DSL from stateless per-request routing to multi-step agent workflows, compiling verified decision artifacts across orchestration, Kubernetes, and protocol layers.
Shows that conversational memory and retrieval-grounded routing let a lightweight 8B model recover most of a much larger model's performance on persistent user-specific queries while dramatically reducing cost.
Routes prompts by reasoning requirements so reasoning is only invoked when it pays off, improving accuracy while cutting token usage and latency versus always-on reasoning.
Proposes category-aware semantic caching with category-specific similarity thresholds, TTLs, and quotas, using a hybrid split between in-memory HNSW retrieval and external document storage.
Maintainer, steering, and reviewer roles across gateways, service mesh, and inference infrastructure.
Personal-model-first self-evolving AI agent that grows correctable understanding and gets curious at the user's pace.
Manages Envoy Proxy as a standalone or Kubernetes-based application gateway.
Manages unified access to generative AI services built on Envoy Gateway.
Cost-efficient and pluggable infrastructure components for GenAI inference.
AI gateway and AI-native API gateway.
Connects, secures, controls, and observes services.
Observability console for Istio with service mesh.
Manages any layer-7 protocols in a service mesh.
Uses eBPF to speed up service mesh data paths.
Role-oriented, portable, and expressive interfaces for Kubernetes networking.
Converts Ingress resources to Gateway API resources.