大模型评测基准与性能对比

本页面展示了多个主流大模型在各项评测基准上的表现,包括MMLU、GSM8K、HumanEval等多个标准数据集。我们通过实时更新的评测结果,帮助开发者和研究人员了解不同大模型在各种任务下的表现。用户可以选择自定义模型与评测基准进行对比,快速获取不同模型在实际应用中的优劣势。

各个评测基准的详细介绍可见: LLM 评测基准列表与介绍

自定义评测选择

+
+
模型名称
MMLU Pro
知识问答
MMLU
知识问答
GSM8K
数学推理
MATH
数学推理
GPQA Diamond
常识推理
HumanEval
代码生成
MATH-500
数学推理
LiveCodeBench
代码生成
参数数量 开源情况 发布机构
OpenAI o1 91.04 91.80 0.00 96.40 77.30 0.00 96.40 71.00 未知 OpenAI
Hunyuan-T1 87.20 0.00 0.00 0.00 69.30 0.00 96.20 64.90 未知 腾讯AI实验室
GPT-4.5 86.10 0.00 0.00 0.00 71.40 0.00 90.70 46.40 未知 OpenAI
DeepSeek-R1 84.00 90.80 0.00 0.00 71.50 0.00 97.30 65.90 6710.0 DeepSeek-AI
Llama 4 Behemoth Instruct 82.20 0.00 0.00 0.00 73.70 0.00 95.00 49.40 20000.0 Facebook AI研究实验室
DeepSeek-V3-0324 81.20 0.00 0.00 0.00 68.40 0.00 94.00 49.20 6810.0 DeepSeek-AI
Llama 4 Maverick Instruct 80.50 0.00 0.00 0.00 69.80 0.00 0.00 43.40 4000.0
OpenAI o1-mini 80.30 85.20 0.00 0.00 60.00 92.40 90.00 52.00 未知 OpenAI
Gemini 2.0 Pro Experimental 79.10 86.50 0.00 91.80 64.70 0.00 0.00 0.00 未知 DeepMind
Hunyuan-TurboS 79.00 89.50 0.00 89.70 57.50 91.00 0.00 32.00 未知 腾讯AI实验室
Claude 3.5 Sonnet New 78.00 88.30 0.00 78.30 65.00 93.70 78.00 38.70 未知 Anthropic
GPT-4o 77.90 88.70 0.00 75.90 53.60 90.00 75.90 35.10 未知 OpenAI
GPT-4o(2024-11-20) 77.90 85.70 0.00 68.50 0.00 90.20 0.00 0.00 未知 OpenAI
Claude 3.5 Sonnet 77.64 88.30 0.00 71.10 59.40 92.00 0.00 0.00 未知 Anthropic
Gemini 2.0 Flash Experimental 76.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 未知 DeepMind
Qwen2.5-Max 76.10 87.90 94.50 68.50 0.00 73.20 0.00 0.00 未知 阿里巴巴
Gemini 1.5 Pro 76.10 87.10 0.00 82.90 53.50 89.00 0.00 0.00 未知 Google Deep Mind
QwQ-32B 76.00 0.00 0.00 0.00 58.00 19.00 91.00 0.00 325.0 阿里巴巴
DeepSeek-V3 75.90 88.50 0.00 87.80 59.10 89.00 87.80 34.60 6810.0 DeepSeek-AI
Grok 2 75.50 87.50 0.00 76.10 56.00 88.40 0.00 0.00 未知 xAI
Llama 4 Scout Instruct 74.30 0.00 0.00 0.00 57.20 0.00 0.00 32.80 1090.0
Llama3.1-405B Instruct 73.40 88.60 0.00 73.90 49.00 89.00 0.00 30.20 4050.0 Facebook AI研究实验室
QwQ-32B-Preview 70.97 0.00 0.00 0.00 0.00 0.00 90.60 0.00 320.0 阿里巴巴
Phi 4 - 14B 70.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 140.0 Microsoft
Qwen2.5-32B 69.23 0.00 0.00 0.00 0.00 0.00 0.00 0.00 320.0 阿里巴巴
Llama3.3-70B-Instruct 68.90 86.00 0.00 77.00 50.50 88.40 0.00 33.30 700.0 Facebook AI研究实验室
Claude3-Opus 68.45 86.80 95.00 60.10 50.40 84.90 0.00 0.00 未知 Anthropic
Gemma 3 - 27B (IT) 67.50 76.90 0.00 89.00 42.40 87.80 0.00 29.70 270.0 Google Deep Mind
Mistral-Small-3.1-24B-Instruct-2503 66.76 80.62 0.00 69.30 45.96 88.41 0.00 0.00 240.0 MistralAI
Llama3.1-70B-Instruct 66.40 86.00 0.00 67.80 48.00 80.50 0.00 33.30 700.0 Facebook AI研究实验室
Claude 3.5 Haiku 65.00 77.60 0.00 69.20 41.60 88.10 0.00 0.00 未知 Anthropic
Qwen2.5-14B 63.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00 140.0 阿里巴巴
Llama 4 Maverick 62.90 85.50 0.00 61.20 0.00 0.00 0.00 0.00 4000.0
GPT-4o mini 61.70 82.00 91.30 70.20 41.10 87.20 0.00 0.00 未知 OpenAI
Llama3.1-405B 61.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4050.0 Facebook AI研究实验室
Gemma 3 - 12B (IT) 60.60 0.00 0.00 83.80 40.90 0.00 0.00 24.60 120.0 Google Deep Mind
Llama 4 Scout 58.20 79.60 0.00 50.30 0.00 0.00 0.00 0.00 1090.0 Facebook AI研究实验室
Qwen2.5-72B 58.10 86.10 91.50 62.10 45.90 59.10 0.00 0.00 727.0 阿里巴巴
Claude3-Sonnet 56.80 0.00 0.00 0.00 0.00 0.00 0.00 0.00 未知 Anthropic
Gemma2-27B 56.54 0.00 0.00 0.00 0.00 0.00 0.00 0.00 270.0 Google Deep Mind
Mixtral-8x22B-Instruct-v0.1 56.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1410.0 MistralAI
Llama3-70B-Instruct 56.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 700.0 Facebook AI研究实验室
Phi-4-mini-instruct (3.8B) 52.80 67.30 88.60 64.00 36.00 74.40 71.80 0.00 38.0 Microsoft
Llama3-70B 52.78 0.00 0.00 0.00 0.00 0.00 0.00 0.00 700.0 Facebook AI研究实验室
Llama3.1-70B 52.47 0.00 0.00 0.00 0.00 0.00 0.00 0.00 700.0 Facebook AI研究实验室
Grok-1.5 51.00 81.30 0.00 50.60 35.90 74.10 0.00 0.00 未知 xAI
C4AI Aya Vision 32B 47.16 72.14 0.00 69.30 33.84 62.20 0.00 0.00 320.0 CohereAI
Qwen2.5-7B 45.00 74.20 85.40 49.80 36.40 57.90 0.00 0.00 70.0 阿里巴巴
Gemma 2 - 9B 44.70 71.30 70.70 37.70 32.80 37.80 0.00 0.00 90.0 Google Research
Llama3.1-8B-Instruct 44.00 68.10 82.40 47.60 26.30 66.50 0.00 0.00 80.0 Facebook AI研究实验室
Moonlight-16B-A3B-Instruct 42.40 70.00 77.40 45.30 0.00 48.10 0.00 0.00 160.0 Moonshot AI
Llama3.1-8B 35.40 66.60 55.30 20.50 25.80 33.50 0.00 0.00 80.0 Facebook AI研究实验室
Qwen2.5-3B 34.60 65.60 79.10 42.60 24.30 42.10 0.00 0.00 30.0 阿里巴巴
Mistral-7B-Instruct-v0.3 30.90 64.20 36.20 10.20 24.70 29.30 0.00 0.00 70.0 MistralAI
Llama-3.2-3B 25.00 54.75 34.00 8.50 26.60 28.00 0.00 0.00 32.0 Facebook AI研究实验室
QwQ-Max-Preview 0.00 0.00 0.00 0.00 0.00 0.00 0.00 65.60 未知 阿里巴巴
DeepSeek-R1-Distill-Qwen-7B 0.00 0.00 0.00 0.00 49.50 0.00 91.40 0.00 70.0 DeepSeek-AI
Amazon Nova Pro 0.00 85.90 0.00 76.60 0.00 89.00 0.00 0.00 未知 亚马逊
Kimi k1.5 (Short-CoT) 0.00 87.40 0.00 0.00 0.00 0.00 94.60 0.00 未知 Moonshot AI
Kimi k1.5 (Long-CoT) 0.00 0.00 0.00 0.00 0.00 0.00 96.20 0.00 未知 普林斯顿大学
DeepSeek-R1-Distill-Llama-70B 0.00 0.00 0.00 0.00 65.20 0.00 94.50 0.00 700.0 DeepSeek-AI
Kimi-k1.6-IOI-high 0.00 0.00 0.00 0.00 0.00 0.00 0.00 73.80 未知 Moonshot AI
Grok 3 mini 0.00 0.00 0.00 0.00 65.00 0.00 0.00 0.00 未知 xAI
Grok 3 0.00 0.00 0.00 0.00 80.20 0.00 0.00 70.60 未知 xAI
Grok-3 - Reasoning Beta 0.00 0.00 0.00 0.00 84.60 0.00 0.00 79.40 未知 xAI
Gemini 2.5 Pro Experimental 03-25 0.00 0.00 0.00 0.00 84.00 0.00 0.00 70.40 未知 Google Deep Mind
Grok-3 mini - Reasoning 0.00 0.00 0.00 0.00 84.00 0.00 0.00 0.00 未知 xAI
OpenAI o3-mini (medium) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 67.40 未知 OpenAI
Kimi-k1.6-IOI 0.00 0.00 0.00 0.00 0.00 0.00 0.00 65.90 未知 Moonshot AI
OpenAI o3-mini (high) 0.00 86.90 0.00 97.90 79.70 97.60 97.90 69.50 未知 OpenAI
Claude Sonnet 3.7-64K Extended Thinking 0.00 0.00 0.00 0.00 84.80 0.00 96.20 0.00 未知 Anthropic
Claude Sonnet 3.7 0.00 0.00 0.00 0.00 68.00 0.00 82.20 0.00 未知 Anthropic
Phi-4-instruct (reasoning-trained) 0.00 0.00 0.00 0.00 49.00 0.00 90.40 0.00 38.0 Microsoft
MMLU Pro
91.04
MMLU
91.80
GSM8K
0.00
MATH
96.40
GPQA Diamond
77.30
HumanEval
0.00
MATH-500
96.40
LiveCodeBench
71.00
MMLU Pro
87.20
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
69.30
HumanEval
0.00
MATH-500
96.20
LiveCodeBench
64.90
MMLU Pro
86.10
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
71.40
HumanEval
0.00
MATH-500
90.70
LiveCodeBench
46.40
MMLU Pro
84.00
MMLU
90.80
GSM8K
0.00
MATH
0.00
GPQA Diamond
71.50
HumanEval
0.00
MATH-500
97.30
LiveCodeBench
65.90
MMLU Pro
82.20
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
73.70
HumanEval
0.00
MATH-500
95.00
LiveCodeBench
49.40
MMLU Pro
81.20
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
68.40
HumanEval
0.00
MATH-500
94.00
LiveCodeBench
49.20
MMLU Pro
80.50
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
69.80
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
43.40
MMLU Pro
80.30
MMLU
85.20
GSM8K
0.00
MATH
0.00
GPQA Diamond
60.00
HumanEval
92.40
MATH-500
90.00
LiveCodeBench
52.00
MMLU Pro
79.10
MMLU
86.50
GSM8K
0.00
MATH
91.80
GPQA Diamond
64.70
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
79.00
MMLU
89.50
GSM8K
0.00
MATH
89.70
GPQA Diamond
57.50
HumanEval
91.00
MATH-500
0.00
LiveCodeBench
32.00
MMLU Pro
78.00
MMLU
88.30
GSM8K
0.00
MATH
78.30
GPQA Diamond
65.00
HumanEval
93.70
MATH-500
78.00
LiveCodeBench
38.70
MMLU Pro
77.90
MMLU
88.70
GSM8K
0.00
MATH
75.90
GPQA Diamond
53.60
HumanEval
90.00
MATH-500
75.90
LiveCodeBench
35.10
MMLU Pro
77.90
MMLU
85.70
GSM8K
0.00
MATH
68.50
GPQA Diamond
0.00
HumanEval
90.20
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
77.64
MMLU
88.30
GSM8K
0.00
MATH
71.10
GPQA Diamond
59.40
HumanEval
92.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
76.24
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
76.10
MMLU
87.90
GSM8K
94.50
MATH
68.50
GPQA Diamond
0.00
HumanEval
73.20
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
76.10
MMLU
87.10
GSM8K
0.00
MATH
82.90
GPQA Diamond
53.50
HumanEval
89.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
76.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
58.00
HumanEval
19.00
MATH-500
91.00
LiveCodeBench
0.00
MMLU Pro
75.90
MMLU
88.50
GSM8K
0.00
MATH
87.80
GPQA Diamond
59.10
HumanEval
89.00
MATH-500
87.80
LiveCodeBench
34.60
MMLU Pro
75.50
MMLU
87.50
GSM8K
0.00
MATH
76.10
GPQA Diamond
56.00
HumanEval
88.40
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
74.30
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
57.20
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
32.80
MMLU Pro
73.40
MMLU
88.60
GSM8K
0.00
MATH
73.90
GPQA Diamond
49.00
HumanEval
89.00
MATH-500
0.00
LiveCodeBench
30.20
MMLU Pro
70.97
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
90.60
LiveCodeBench
0.00
MMLU Pro
70.40
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
69.23
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
68.90
MMLU
86.00
GSM8K
0.00
MATH
77.00
GPQA Diamond
50.50
HumanEval
88.40
MATH-500
0.00
LiveCodeBench
33.30
MMLU Pro
68.45
MMLU
86.80
GSM8K
95.00
MATH
60.10
GPQA Diamond
50.40
HumanEval
84.90
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
67.50
MMLU
76.90
GSM8K
0.00
MATH
89.00
GPQA Diamond
42.40
HumanEval
87.80
MATH-500
0.00
LiveCodeBench
29.70
MMLU Pro
66.76
MMLU
80.62
GSM8K
0.00
MATH
69.30
GPQA Diamond
45.96
HumanEval
88.41
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
66.40
MMLU
86.00
GSM8K
0.00
MATH
67.80
GPQA Diamond
48.00
HumanEval
80.50
MATH-500
0.00
LiveCodeBench
33.30
MMLU Pro
65.00
MMLU
77.60
GSM8K
0.00
MATH
69.20
GPQA Diamond
41.60
HumanEval
88.10
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
63.69
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
62.90
MMLU
85.50
GSM8K
0.00
MATH
61.20
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
61.70
MMLU
82.00
GSM8K
91.30
MATH
70.20
GPQA Diamond
41.10
HumanEval
87.20
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
61.60
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
60.60
MMLU
0.00
GSM8K
0.00
MATH
83.80
GPQA Diamond
40.90
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
24.60
MMLU Pro
58.20
MMLU
79.60
GSM8K
0.00
MATH
50.30
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
58.10
MMLU
86.10
GSM8K
91.50
MATH
62.10
GPQA Diamond
45.90
HumanEval
59.10
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
56.80
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
56.54
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
56.33
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
56.20
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
52.80
MMLU
67.30
GSM8K
88.60
MATH
64.00
GPQA Diamond
36.00
HumanEval
74.40
MATH-500
71.80
LiveCodeBench
0.00
MMLU Pro
52.78
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
52.47
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
51.00
MMLU
81.30
GSM8K
0.00
MATH
50.60
GPQA Diamond
35.90
HumanEval
74.10
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
47.16
MMLU
72.14
GSM8K
0.00
MATH
69.30
GPQA Diamond
33.84
HumanEval
62.20
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
45.00
MMLU
74.20
GSM8K
85.40
MATH
49.80
GPQA Diamond
36.40
HumanEval
57.90
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
44.70
MMLU
71.30
GSM8K
70.70
MATH
37.70
GPQA Diamond
32.80
HumanEval
37.80
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
44.00
MMLU
68.10
GSM8K
82.40
MATH
47.60
GPQA Diamond
26.30
HumanEval
66.50
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
42.40
MMLU
70.00
GSM8K
77.40
MATH
45.30
GPQA Diamond
0.00
HumanEval
48.10
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
35.40
MMLU
66.60
GSM8K
55.30
MATH
20.50
GPQA Diamond
25.80
HumanEval
33.50
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
34.60
MMLU
65.60
GSM8K
79.10
MATH
42.60
GPQA Diamond
24.30
HumanEval
42.10
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
30.90
MMLU
64.20
GSM8K
36.20
MATH
10.20
GPQA Diamond
24.70
HumanEval
29.30
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
25.00
MMLU
54.75
GSM8K
34.00
MATH
8.50
GPQA Diamond
26.60
HumanEval
28.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
65.60
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
49.50
HumanEval
0.00
MATH-500
91.40
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
85.90
GSM8K
0.00
MATH
76.60
GPQA Diamond
0.00
HumanEval
89.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
87.40
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
94.60
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
96.20
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
65.20
HumanEval
0.00
MATH-500
94.50
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
73.80
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
65.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
80.20
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
70.60
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
84.60
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
79.40
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
84.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
70.40
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
84.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
67.40
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
65.90
MMLU Pro
0.00
MMLU
86.90
GSM8K
0.00
MATH
97.90
GPQA Diamond
79.70
HumanEval
97.60
MATH-500
97.90
LiveCodeBench
69.50
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
84.80
HumanEval
0.00
MATH-500
96.20
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
68.00
HumanEval
0.00
MATH-500
82.20
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
49.00
HumanEval
0.00
MATH-500
90.40
LiveCodeBench
0.00