Lucas AI Blog

AI Research Engineering

Cross-Instance KV Cache Sharing for Disaggregated LLM Serving: Cutting TTFT with Mooncake and LMCache

How cross-instance KV cache sharing with Mooncake + LMCache reduces TTFT by 24% in multi-instance disaggregated LLM serving

12 min read · March 09, 2026

2026 · gpu vllm llm inference rdma networking · infrastructure
NIXL for KV Cache in Disaggregated Serving

How NIXL accelerates KV Cache transfer in Prefill/Decode disaggregated LLM serving, its architecture, vLLM integration, and a real-world memory leak debugging story

16 min read · March 04, 2026

2026 · gpu networking nixl vllm llm inference rdma · infrastructure
CUDA Graph in vLLM: Eliminating CPU Overhead in LLM Inference

How CUDA Graph reduces CPU launch overhead in LLM decode, memory management with Private Pools, and vLLM's graph capture modes

12 min read · February 28, 2026

2026 · cuda gpu vllm llm inference optimization · infrastructure
Multi-Node P/D Disagg vLLM Serving: How EFA Works Compared to InfiniBand?

Multi-node GPU communication on AWS EFA, InfiniBand vs EFA comparison, and vLLM P/D Disagg setup

9 min read · February 22, 2026

2026 · gpu networking efa infiniband rdma vllm · infrastructure
MoE Expert FFN Backend: experts_implementation

Selecting Expert FFN computation backends (eager, batched_mm, grouped_mm) in HuggingFace Transformers and benchmarking with Solar-Open 100B

4 min read · January 30, 2026

2026 · moe experts-implementation huggingface transformers torch-compile grouped-gemm · ml-engineering

Lucas AI Blog

AI Research Engineering

Cross-Instance KV Cache Sharing for Disaggregated LLM Serving: Cutting TTFT with Mooncake and LMCache

NIXL for KV Cache in Disaggregated Serving

CUDA Graph in vLLM: Eliminating CPU Overhead in LLM Inference

Multi-Node P/D Disagg vLLM Serving: How EFA Works Compared to InfiniBand?

MoE Expert FFN Backend: experts_implementation