Selected Works

LLM Routing, Research, and AI Infra.

Papers and infrastructure work around routing as a systems problem.

Primary Track

vLLM Semantic Router

Co-Founder

Signal-driven decision routing for mixture-of-modality deployments.

16 papers / 2025-2026

GitHub Website Publications

Paper Archive

Papers on routing, systems, and inference optimization.

A research archive spanning semantic routing, agent behavior, and infrastructure efficiency.

Publication list GitHub

Research Publication

Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

arXiv Technical Report / 2026

Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

Proposes token-budget-aware pool routing that estimates each request's token budget online and dispatches it to short- or long-context serving pools, reducing GPU cost while improving stability for LLM inference.

Paper

Position Paper

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

arXiv Technical Report / 2026

vLLM Semantic Router Team

Introduces vLLM Semantic Router as a signal-driven routing framework for mixture-of-modality deployments, composing heterogeneous signals into deployment-specific policies across cost, privacy, latency, and safety.

Paper

Vision Paper

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

arXiv Technical Report / 2026

Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang

Synthesizes recent routing, fleet, multimodal, and governance work into the Workload-Router-Pool architecture, framing routing as one layer in a broader inference optimization stack.

Paper

Research Publication

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

arXiv Technical Report / 2026

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Formalizes the visual confused deputy as a security failure mode for computer-use agents and proposes a dual-channel guardrail for validating click targets and action reasoning before execution.

Paper

Research Publication

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

arXiv Technical Report / 2026

Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

Presents OATS, an offline embedding-refinement method that improves semantic-router tool ranking under single-digit millisecond CPU budgets without adding serving-time model inference.

Paper

Research Publication

Adaptive Vision-Language Model Routing for Computer Use Agents

arXiv Technical Report / 2026

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Proposes Adaptive VLM Routing to estimate step difficulty in computer-use agents and route each action to the cheapest model that can still satisfy a target reliability threshold.

Paper

Research Publication

98x Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

arXiv Technical Report / 2026

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Combines flash attention, prompt compression, and near-streaming body processing to cut routing latency from seconds to tens of milliseconds on lightweight shared serving hardware.

Paper

Research Publication

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

arXiv Technical Report / 2026

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

Introduces a queueing-theory-grounded fleet planner and discrete-event simulator for sizing multi-pool LLM GPU fleets against P99 TTFT targets without requiring up-front hardware profiling.

Paper

Research Publication

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

arXiv Technical Report / 2026

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

Derives the minimum-cost two-pool LLM fleet directly from workload distributions and P99 TTFT targets, then maps the optimal boundary to a deployable compress-and-route strategy.

Paper

Research Publication

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

arXiv Technical Report / 2026

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

Derives the 1/W law, showing that tokens per watt roughly halve whenever the serving context window doubles, making context-length routing a first-order energy-efficiency lever.

Paper

Research Publication

Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL

arXiv Technical Report / 2026

Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, Xue Liu

Shows how probabilistic ML predicates in policy languages can silently co-fire on the same query and adds conflict detection plus a softmax prevention mechanism in the Semantic Router DSL.

Paper

Research Publication

From Inference Routing to Agent Orchestration: Declarative Policy Compilation with Cross-Layer Verification

arXiv Technical Report / 2026

Huamin Chen, Xunzhuo Liu, Bowei He, Xue Liu

Extends the Semantic Router DSL from stateless per-request routing to multi-step agent workflows, compiling verified decision artifacts across orchestration, Kubernetes, and protocol layers.

Paper

Research Publication

Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

arXiv Technical Report / 2026

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Shows that conversational memory and retrieval-grounded routing let a lightweight 8B model recover most of a much larger model's performance on persistent user-specific queries while dramatically reducing cost.

Paper

RAG Verification

Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

arXiv Technical Report / 2026

Xunzhuo Liu, Bowei He, Xue Liu, Haichen Zhang, Huamin Chen

Adds a real-time verification layer for long-document RAG systems, handling contexts up to 32K tokens while balancing latency with grounding coverage.

Paper

Research Publication

When to Reason: Semantic Router for vLLM

NeurIPS - MLForSys / 2025

Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen

Routes prompts by reasoning requirements so reasoning is only invoked when it pays off, improving accuracy while cutting token usage and latency versus always-on reasoning.

Paper

Research Publication

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

arXiv Technical Report / 2025

Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen

Proposes category-aware semantic caching with category-specific similarity thresholds, TTLs, and quotas, using a hybrid split between in-memory HNSW retrieval and external document storage.

Paper

Open Source

Infrastructure and standards work beyond the router.

Maintainer, steering, and reviewer roles across gateways, service mesh, and inference infrastructure.

Node 01

Envoy Gateway

Steering Committee and Maintainer

Manages Envoy Proxy as a standalone or Kubernetes-based application gateway.

GitHub Website

Node 02

Envoy AI Gateway

Maintainer

Manages unified access to generative AI services built on Envoy Gateway.

GitHub Website

Node 03

vLLM AIBrix

Maintainer

Cost-efficient and pluggable infrastructure components for GenAI inference.

GitHub Website

Node 04

Higress

Approver

AI gateway and AI-native API gateway.

GitHub Website

Node 05

Istio

Maintainer

Connects, secures, controls, and observes services.

GitHub Website

Node 06

Kiali

Maintainer

Observability console for Istio with service mesh.

GitHub Website

Node 07

Aeraki Mesh

Maintainer

Manages any layer-7 protocols in a service mesh.

GitHub Website

Node 08

Merbridge

Maintainer

Uses eBPF to speed up service mesh data paths.

GitHub Website

Node 09

Kubernetes Gateway API

Reviewer

Role-oriented, portable, and expressive interfaces for Kubernetes networking.

GitHub Website

Node 10

Kubernetes Ingress2Gateway

Reviewer

Converts Ingress resources to Gateway API resources.

GitHub Website