跟着 AI 模子才气的快速演退,保守的基准尝试(如 MMLU、GPQA)已经易以有用辨别前沿模子的差别。Humanity's Last Exam(HLE)应运而死,那是一个由 CAIS(Center for AI Safety)取 Scale AI 分离拉出的大师级基准尝试,旨正在评介 AI 模子正在跨教科、下易度成就上的拉理才气。
1. arXiv 论文:Humanity's Last Exam: A Challenging Benchmark for AI Models2. Nature 论文:Humanity's Last Exam: Evaluating AI's Expert-Level Reasoning3. Scale AI 民间页里:Humanity's Last Exam4. Hugging Face 数据散:CAIS HLE Dataset5. Scale AI 榜单:Humanity's Last Exam Leaderboard6. Artificial Analysis:Humanity's Last Exam Evaluation7. FutureHouse 审计陈述:HLE Exam Audit8. 民间网站:Humanity's Last Exam