Figure 1: Performance gains in tokens per second across different LLM serving frameworks, highlighting optimizations in vLLM. Source: 2025-02-27 - vLLM Office Hours - DeepSeek and vLLM.Open Infra Week contributions
DeepSeek 正在 2025 年 2 月举办的 Open Infra 周举动中,拉出了旨正在放慢模子施行速率的一系列拉理内乱核改良步伐。咱们的团队勤奋于将那些劣化步伐调整到 vLLM 中,并提拔其功用。
“盛开根底装备周”的主要奉献包罗:
• FlashMLA (Multi-Head Latent Attention): A kernel for MLA that increases speeds up batched decoding.• DPP (Dynamic Partitioning for Parallelism): A new method to balance computational loads across distributed environments.• Speculative decoding enhancements: Techniques that boost inference speed while maintaining accuracy.
• Expert Parallelism (EP): Assigns specific experts to dedicated GPUs, ensuring efficient utilization and reducing redundancy.• Data Parallelism (DP): Distributes batched sequences between GPUs for the attention layers, avoiding KV cache duplication to improve memory efficiency.
那些手艺使咱们能够有用天分派计较背载,进而完毕更可扩大的拉理。请检察对于使用 vLLM中止 散布式拉理的 Office Hours 灌音。
Future roadmap and next steps