“Different from Gloeckle et al. (2024), which parallelly predicts D additional tokens using independent output heads, we sequentially predict additional tokens and keep the complete causal chain at each prediction depth.”
DeepSeek-v3 的图
meta 的 paper 里,理论也比照了并止架媾和 casual 架构,论断是结果持仄,他们挑选了谋利解码功用更佳的并止架构。但是理论上,名字皆是 casual,理论确年夜差别。
可是 DeepSeek主要 是使用它去提拔 pre-train 的服从,Draft Model 是他的可选项。
Our MTP strategy mainly aims to improve the performance of the main model, so during inference, we can directly discard the MTP modules and the main model can function independently and normally.
On the one hand, an MTP objective densifies the training signals and may improve data efficiency. On the other hand, MTP may enable the model to pre-plan its representations for better prediction of future tokens.