1. DAPO: An Open-Source LLM Reinforcement Learning System at Scale, 2025.2. OpenAI. Learning to reason with llms, 2024.3. Guo, D., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.4. Schulman, J., et al. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.