大模型评测基准与性能对比

本页面展示了多个主流大模型在各项评测基准上的表现,包括MMLU、GSM8K、HumanEval等多个标准数据集。我们通过实时更新的评测结果,帮助开发者和研究人员了解不同大模型在各种任务下的表现。用户可以选择自定义模型与评测基准进行对比,快速获取不同模型在实际应用中的优劣势。

各个评测基准的详细介绍可见: LLM 评测基准列表与介绍

自定义评测选择

+
+
模型名称
MMLU Pro
知识问答
MMLU
知识问答
GSM8K
数学推理
MATH
数学推理
GPQA Diamond
常识推理
BBH
综合评估
HumanEval
代码生成
MBPP
代码生成
OpenAI o1 91.04 91.80 0.00 96.40 77.30 0.00 0.00 0.00
DeepSeek-R1 84.00 90.80 0.00 0.00 71.50 0.00 0.00 0.00
OpenAI o1-mini 80.30 85.20 0.00 0.00 60.00 0.00 92.40 0.00
Gemini 2.0 Pro Experimental 79.10 86.50 0.00 91.80 64.70 0.00 0.00 0.00
Claude 3.5 Sonnet New 78.00 88.30 0.00 78.30 65.00 0.00 93.70 0.00
GPT-4o 77.90 0.00 0.00 0.00 0.00 0.00 0.00 0.00
GPT-4o(2024-11-20) 77.90 85.70 0.00 68.50 0.00 0.00 90.20 0.00
Claude 3.5 Sonnet 77.64 88.30 0.00 71.10 59.40 0.00 92.00 0.00
Gemini 2.0 Flash Experimental 76.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Gemini 1.5 Pro 76.10 87.10 0.00 82.90 53.50 0.00 89.00 87.80
DeepSeek-V3 75.90 88.50 0.00 0.00 59.10 0.00 0.00 0.00
Grok 2 75.50 87.50 0.00 76.10 56.00 0.00 88.40 0.00
Llama3.1-405B Instruct 73.40 88.60 0.00 73.90 49.00 0.00 89.00 88.60
QwQ-32B-Preview 70.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Phi 4 - 14B 70.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Qwen2.5-32B 69.23 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Qwen2.5-Max 69.00 87.90 94.50 68.50 0.00 0.00 73.20 80.60
Llama3.3-70B-Instruct 68.90 86.00 0.00 77.00 50.50 0.00 88.40 87.60
Claude3-Opus 68.45 86.80 95.00 60.10 50.40 0.00 84.90 0.00
Llama3.1-70B-Instruct 66.40 86.00 0.00 67.80 48.00 0.00 80.50 86.00
Qwen2.5-14B 63.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00
GPT-4o mini 63.09 82.00 0.00 70.20 0.00 0.00 87.20 0.00
Claude 3.5 Haiku 62.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Llama3.1-405B 61.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Qwen2.5-72B 58.10 86.10 91.50 62.10 45.90 86.30 59.10 84.70
Claude3-Sonnet 56.80 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Gemma2-27B 56.54 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Mixtral-8x22B-Instruct-v0.1 56.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Llama3-70B-Instruct 56.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Llama3-70B 52.78 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Llama3.1-70B 52.47 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Grok-1.5 51.00 81.30 0.00 50.60 35.90 0.00 74.10 0.00
OpenAI o3-mini (high) 0.00 86.90 0.00 97.90 0.00 0.00 97.60 0.00
Amazon Nova Pro 0.00 85.90 0.00 76.60 0.00 0.00 89.00 0.00
Kimi k1.5 (Short-CoT) 0.00 87.40 0.00 0.00 0.00 0.00 0.00 0.00
DeepSeek-R1-Distill-Llama-70B 0.00 0.00 0.00 0.00 65.20 0.00 0.00 0.00
Grok 3 0.00 0.00 0.00 0.00 75.00 0.00 0.00 0.00
Grok 3 mini 0.00 0.00 0.00 0.00 65.00 0.00 0.00 0.00
Grok-3 mini - Reasoning 0.00 0.00 0.00 0.00 84.00 0.00 0.00 0.00
Grok-3 - Reasoning Beta 0.00 0.00 0.00 0.00 85.00 0.00 0.00 0.00
MMLU Pro
91.04
MMLU
91.80
GSM8K
0.00
MATH
96.40
GPQA Diamond
77.30
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
84.00
MMLU
90.80
GSM8K
0.00
MATH
0.00
GPQA Diamond
71.50
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
80.30
MMLU
85.20
GSM8K
0.00
MATH
0.00
GPQA Diamond
60.00
BBH
0.00
HumanEval
92.40
MBPP
0.00
MMLU Pro
79.10
MMLU
86.50
GSM8K
0.00
MATH
91.80
GPQA Diamond
64.70
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
78.00
MMLU
88.30
GSM8K
0.00
MATH
78.30
GPQA Diamond
65.00
BBH
0.00
HumanEval
93.70
MBPP
0.00
MMLU Pro
77.90
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
77.90
MMLU
85.70
GSM8K
0.00
MATH
68.50
GPQA Diamond
0.00
BBH
0.00
HumanEval
90.20
MBPP
0.00
MMLU Pro
77.64
MMLU
88.30
GSM8K
0.00
MATH
71.10
GPQA Diamond
59.40
BBH
0.00
HumanEval
92.00
MBPP
0.00
MMLU Pro
76.24
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
76.10
MMLU
87.10
GSM8K
0.00
MATH
82.90
GPQA Diamond
53.50
BBH
0.00
HumanEval
89.00
MBPP
87.80
MMLU Pro
75.90
MMLU
88.50
GSM8K
0.00
MATH
0.00
GPQA Diamond
59.10
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
75.50
MMLU
87.50
GSM8K
0.00
MATH
76.10
GPQA Diamond
56.00
BBH
0.00
HumanEval
88.40
MBPP
0.00
MMLU Pro
73.40
MMLU
88.60
GSM8K
0.00
MATH
73.90
GPQA Diamond
49.00
BBH
0.00
HumanEval
89.00
MBPP
88.60
MMLU Pro
70.97
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
70.40
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
69.23
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
69.00
MMLU
87.90
GSM8K
94.50
MATH
68.50
GPQA Diamond
0.00
BBH
0.00
HumanEval
73.20
MBPP
80.60
MMLU Pro
68.90
MMLU
86.00
GSM8K
0.00
MATH
77.00
GPQA Diamond
50.50
BBH
0.00
HumanEval
88.40
MBPP
87.60
MMLU Pro
68.45
MMLU
86.80
GSM8K
95.00
MATH
60.10
GPQA Diamond
50.40
BBH
0.00
HumanEval
84.90
MBPP
0.00
MMLU Pro
66.40
MMLU
86.00
GSM8K
0.00
MATH
67.80
GPQA Diamond
48.00
BBH
0.00
HumanEval
80.50
MBPP
86.00
MMLU Pro
63.69
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
63.09
MMLU
82.00
GSM8K
0.00
MATH
70.20
GPQA Diamond
0.00
BBH
0.00
HumanEval
87.20
MBPP
0.00
MMLU Pro
62.12
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
61.60
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
58.10
MMLU
86.10
GSM8K
91.50
MATH
62.10
GPQA Diamond
45.90
BBH
86.30
HumanEval
59.10
MBPP
84.70
MMLU Pro
56.80
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
56.54
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
56.33
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
56.20
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
52.78
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
52.47
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
51.00
MMLU
81.30
GSM8K
0.00
MATH
50.60
GPQA Diamond
35.90
BBH
0.00
HumanEval
74.10
MBPP
0.00
MMLU Pro
0.00
MMLU
86.90
GSM8K
0.00
MATH
97.90
GPQA Diamond
0.00
BBH
0.00
HumanEval
97.60
MBPP
0.00
MMLU Pro
0.00
MMLU
85.90
GSM8K
0.00
MATH
76.60
GPQA Diamond
0.00
BBH
0.00
HumanEval
89.00
MBPP
0.00
MMLU Pro
0.00
MMLU
87.40
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
65.20
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
75.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
65.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
84.00
BBH
0.00
HumanEval
0.00
MBPP
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
85.00
BBH
0.00
HumanEval
0.00
MBPP
0.00