-
Cross-Instance KV Cache Sharing for Disaggregated LLM Serving: Cutting TTFT with Mooncake and LMCache
How cross-instance KV cache sharing with Mooncake + LMCache reduces TTFT by 24% in multi-instance disaggregated LLM serving
-
NIXL for KV Cache in Disaggregated Serving
How NIXL accelerates KV Cache transfer in Prefill/Decode disaggregated LLM serving, its architecture, vLLM integration, and a real-world memory leak debugging story
-
CUDA Graph in vLLM: Eliminating CPU Overhead in LLM Inference
How CUDA Graph reduces CPU launch overhead in LLM decode, memory management with Private Pools, and vLLM's graph capture modes
-
Multi-Node P/D Disagg vLLM Serving: How EFA Works Compared to InfiniBand?
Multi-node GPU communication on AWS EFA, InfiniBand vs EFA comparison, and vLLM P/D Disagg setup
-
MoE Expert FFN Backend: experts_implementation
Selecting Expert FFN computation backends (eager, batched_mm, grouped_mm) in HuggingFace Transformers and benchmarking with Solar-Open 100B