MLA compression reduces KV cache size per token from 516 KB to 70 KB, significantly lowering memory demands during inference.
MLA 紧缩将每个标记的 KV 缓存大小从 516 KB 减少到 70 KB,从而分明降低了推理时期的内存需求。Only 37 billion of the 671 billion total parameters are activated per token, dramatically reducing compute and memory requirements without compromising model performance.
每个令牌仅激活 6710 亿个总参数中的 370 亿个,从而大大降低了计算和内存需求,同时又不影响模型功能。DeepSeek-V3 requires just 250 GFLOPS per token, compared to 2,448 GFLOPS for dense models like LLaMA-3.1, highlighting its computational efficiency.
DeepSeek-V3 每个令牌仅需求 250 GFLOPS,而 LLaMA-3.1 等密集模型则需求 2,448 GFLOPS,这突显了其计算效率。Achieves up to 67 tokens per second (TPS) on a 400 Gbps InfiniBand network, with the potential to scale to 1,200 TPS using advanced interconnects like NVL72.
在 400 Gbps InfiniBand 网络上完成高达每秒 67 个令牌 (TPS),并有能够运用 NVL72 等高级互连扩展到 1,200 TPS。Multi-Token Prediction (MTP) improves generation speed by 1.8×, with a token acceptance rate of 80-90%, enhancing inference throughput.
多令牌预测(MTP)将生成速度提高了 1.8 倍,令牌接受率达到 80-90%,加强了推理吞吐量。FP8 mixed-precision training enables faster computation with less than 0.25% accuracy degradation, validated through extensive small-scale ablations.
FP8 混合精度训练可以完成更快的计算,准确度下降不到 0.25%,这已经过大量小规模消融得到验证。Capable of running on a $10,000 server equipped with a consumer-grade GPU, delivering nearly 20 TPS, making high-performance LLMs more accessible.
可以在装备消费级 GPU 的 10,000 美元服务器上运转,提供近 20 TPS,使高功能 LLM 更容易获得。