本页面展示了多个主流大模型在各项评测基准上的表现,包括MMLU、GSM8K、HumanEval等多个标准数据集。我们通过实时更新的评测结果,帮助开发者和研究人员了解不同大模型在各种任务下的表现。用户可以选择自定义模型与评测基准进行对比,快速获取不同模型在实际应用中的优劣势。
各个评测基准的详细介绍可见: LLM 评测基准列表与介绍
模型名称 | MMLU | MMLU Pro | GSM8K | HumanEval | MBPP | MATH | BBH | GPQA Diamond |
---|---|---|---|---|---|---|---|---|
OpenAI o1 | 90.8 | 0.9104 | 0 | 0 | 0 | 94.8 | 0 | 77.3 |
Qwen2.5-Max | 87.9 | 0.69 | 94.5 | 73.2 | 80.6 | 68.5 | 0 | 0 |
Qwen2.5-14B | 0 | 0.6369 | 0 | 0 | 0 | 0 | 0 | 0 |
Gemini 2.0 Pro Experimental | 0 | 0.791 | 0 | 0 | 0 | 91.8 | 0 | 64.7 |
Llama3.1-70B | 0 | 0.5247 | 0 | 0 | 0 | 0 | 0 | 0 |
Llama3-70B | 0 | 0.5278 | 0 | 0 | 0 | 0 | 0 | 0 |
Llama3-70B-Instruct | 0 | 0.562 | 0 | 0 | 0 | 0 | 0 | 0 |
Mixtral-8x22B-Instruct-v0.1 | 0 | 0.5633 | 0 | 0 | 0 | 0 | 0 | 0 |
Gemma2-27B | 0 | 0.5654 | 0 | 0 | 0 | 0 | 0 | 0 |
Claude3-Sonnet | 0 | 0.568 | 0 | 0 | 0 | 0 | 0 | 0 |
Llama3.1-405B | 0 | 0.616 | 0 | 0 | 0 | 0 | 0 | 0 |
Claude 3.5 Haiku | 0 | 0.6212 | 0 | 0 | 0 | 0 | 0 | 0 |
Llama3.1-70B-Instruct | 0 | 0.6284 | 0 | 0 | 0 | 0 | 0 | 0 |
GPT-4o mini | 0 | 0.6309 | 0 | 0 | 0 | 0 | 0 | 0 |
DeepSeek-R1 | 0 | 0.84 | 0 | 0 | 0 | 0 | 0 | 0 |
Llama3.3-70B-Instruct | 0 | 0.6592 | 0 | 0 | 0 | 0 | 0 | 0 |
Claude3-Opus | 0 | 0.6845 | 0 | 0 | 0 | 0 | 0 | 0 |
Qwen2.5-32B | 0 | 0.6923 | 0 | 0 | 0 | 0 | 0 | 0 |
Gemini 1.5 Pro | 0 | 0.7025 | 0 | 0 | 0 | 0 | 0 | 0 |
Phi 4 - 14B | 0 | 0.704 | 0 | 0 | 0 | 0 | 0 | 0 |
QwQ-32B-Preview | 0 | 0.7097 | 0 | 0 | 0 | 0 | 0 | 0 |
Qwen2.5-72B | 0 | 0.7159 | 0 | 0 | 0 | 0 | 0 | 0 |
Llama3.1-405B Instruct | 0 | 0.733 | 0 | 0 | 0 | 0 | 0 | 0 |
DeepSeek-V3 | 0 | 0.7587 | 0 | 0 | 0 | 0 | 0 | 0 |
Gemini 2.0 Flash Experimental | 0 | 0.7624 | 0 | 0 | 0 | 0 | 0 | 0 |
Claude 3.5 Sonnet | 0 | 0.7764 | 0 | 0 | 0 | 0 | 0 | 0 |
GPT-4o | 0 | 0.779 | 0 | 0 | 0 | 0 | 0 | 0 |