Kuntai DU (UChicago)- Optimizing Communication for Distributed LLM Inference
Abstract: Previous work has identified GPU memory capacity as the primary bottleneck in LLM inference, necessitating effective KV cache management strategies. However, based on a long-term collaboration with the most popular open-source serving engine vLLM, we observe that the landscape is shifting toward distributed LLM inference due to new emerging trends such as long-context KV cache reuse, disaggregated prefilling, and multi-modal LLM inference. The central challenge thus evolves to efficient KV cache communication mechanisms. This talk explores potential solutions for optimizing communication in distributed systems and argues that effective communication requires two new roles: an orchestrator and a KV cache-store. These roles must collaborate closely to meet stringent service-level objectives, paving the way for scalable and efficient distributed LLM-serving systems
Speakers
Kuntai Du
Kuntai Du is a 6th-year PhD from UChicago. His research focus is data transfer for distributed inference systems, including analytic-aware video streaming in distributed video analytic settings and effective KV cache transfer for distributed LLM inference. He is the recipient of the Siebel Scholarship.