MMLU - 一种针对大模型的语言理解能力的测评,是目前最著名的大模型语义理解测评之一,任务涵盖的知识很广泛,语言是英文,用以评测大模型基本的知识覆盖范围和理解能力。
C Eval - C-Eval 是一个全面的中文基础模型评估套件。它包含了13948个多项选择题,涵盖了52个不同的学科和四个难度级别。用以评测大模型中文理解能力。
AGI Eval - 微软发布的大模型基础能力评测基准,主要评测大模型在人类认知和解决问题的一般能力,涵盖全球20种面向普通人类考生的官方、公共和高标准录取和资格考试,包含中英文数据。
GSM8K - OpenAI发布的大模型数学推理能力评测基准,涵盖了8500个中学水平的高质量数学题数据集。数据集比之前的数学文字题数据集规模更大,语言更具多样性,题目也更具挑战性。
- 免费商用授权
- 收费商用授权
- 开源不可商用
- 不开源
模型名称 | 参数大小 | MMLU | CEval | AGIEval | GSM8K | MATH | BBH | MT Bench | 发布者 | 开源情况 | 模型地址 |
---|---|---|---|---|---|---|---|---|---|---|---|
DeepSeek-R1 |
6710.0 |
90.8 |
91.8 |
/ |
/ |
97.3 |
/ |
/ |
![]() |
DeepSeek-R1模型地址 | |
OpenAI o1 |
90.8 |
/ |
/ |
/ |
94.8 |
/ |
/ |
![]() |
OpenAI o1模型地址 | ||
Hunyuan-TurboS |
89.5 |
/ |
/ |
/ |
89.7 |
92.2 |
/ |
![]() |
Hunyuan-TurboS模型地址 | ||
GPT-4o |
88.7 |
/ |
/ |
90.5 |
76.6 |
/ |
/ |
![]() |
GPT-4o模型地址 | ||
Claude 3.5 Sonnet |
88.7 |
/ |
/ |
96.4 |
71.1 |
/ |
/ |
![]() |
Claude 3.5 Sonnet模型地址 | ||
DeepSeek-V3 |
6810.0 |
88.5 |
86.5 |
/ |
/ |
90.2 |
/ |
/ |
![]() |
DeepSeek-V3模型地址 | |
Qwen2.5-Max |
87.9 |
/ |
/ |
94.5 |
/ |
/ |
/ |
![]() |
Qwen2.5-Max模型地址 | ||
Grok 2 |
87.5 |
/ |
/ |
/ |
76.1 |
/ |
/ |
Grok 2模型地址 | |||
Kimi k1.5 (Short-CoT) |
87.4 |
/ |
/ |
/ |
94.6 |
/ |
/ |
![]() |
Kimi k1.5 (Short-CoT)模型地址 | ||
Llama3.1-405B Instruct |
4050.0 |
87.3 |
/ |
/ |
96.8 |
73.8 |
/ |
/ |
![]() |
Llama3.1-405B Instruct模型地址 | |
DeepSeek-V3-Base |
6810.0 |
87.1 |
90.1 |
79.6 |
89.3 |
61.6 |
87.5 |
/ |
![]() |
DeepSeek-V3-Base模型地址 | |
OpenAI o3-mini (high) |
86.9 |
/ |
97.9 |
/ |
97.9 |
/ |
/ |
![]() |
OpenAI o3-mini (high)模型地址 | ||
GPT-4 |
1750.0 |
86.4 |
68.7 |
/ |
87.1 |
42.5 |
/ |
9.32 |
![]() |
GPT-4模型地址 | |
Llama3-400B-Instruct-InTraining |
4000.0 |
86.1 |
/ |
/ |
94.1 |
57.8 |
/ |
/ |
![]() |
Llama3-400B-Instruct-InTraining模型地址 | |
C4AI Command A (202503) |
1110.0 |
86.0 |
/ |
/ |
/ |
/ |
/ |
/ |
![]() |
C4AI Command A (202503)模型地址 | |
Amazon Nova Pro |
85.9 |
/ |
/ |
94.8 |
76.6 |
/ |
/ |
![]() |
Amazon Nova Pro模型地址 | ||
OpenAI o3-mini (medium) |
85.9 |
/ |
/ |
/ |
97.3 |
/ |
/ |
![]() |
OpenAI o3-mini (medium)模型地址 | ||
GPT-4o(2024-11-20) |
85.7 |
/ |
/ |
/ |
68.5 |
/ |
/ |
![]() |
GPT-4o(2024-11-20)模型地址 | ||
Llama3.1-405B |
4050.0 |
85.2 |
/ |
/ |
/ |
/ |
/ |
/ |
![]() |
Llama3.1-405B模型地址 | |
OpenAI o1-mini |
85.2 |
/ |
/ |
/ |
90.0 |
/ |
/ |
![]() |
OpenAI o1-mini模型地址 | ||
OpenAI o3-mini (low) |
84.9 |
/ |
/ |
/ |
95.8 |
/ |
/ |
![]() |
OpenAI o3-mini (low)模型地址 | ||
Llama3-400B-InTraining |
4000.0 |
84.8 |
/ |
/ |
/ |
/ |
/ |
/ |
![]() |
Llama3-400B-InTraining模型地址 | |
Grok-1.5 |
81.3 |
/ |
/ |
90.0 |
50.6 |
/ |
/ |
Grok-1.5模型地址 | |||
Amazon Nova Lite |
80.5 |
/ |
/ |
94.5 |
73.3 |
/ |
/ |
![]() |
Amazon Nova Lite模型地址 | ||
Qwen1.5-110B |
1100.0 |
80.4 |
/ |
/ |
85.4 |
49.6 |
74.8 |
8.88 |
![]() |
Qwen1.5-110B模型地址 | |
DeepSeek V2.5 |
2360.0 |
80.4 |
/ |
/ |
95.1 |
74.7 |
/ |
/ |
![]() |
DeepSeek V2.5模型地址 | |
DeepSeek-V2-236B |
2360.0 |
78.5 |
81.7 |
/ |
79.2 |
43.6 |
78.9 |
/ |
![]() |
DeepSeek-V2-236B模型地址 | |
PaLM 2 |
3400.0 |
78.3 |
/ |
/ |
80.7 |
/ |
/ |
/ |
![]() |
PaLM 2模型地址 | |
Mixtral-8×22B-MoE |
1410.0 |
77.75 |
/ |
/ |
78.6 |
41.8 |
/ |
/ |
![]() |
Mixtral-8×22B-MoE模型地址 | |
Amazon Nova Micro |
77.6 |
/ |
/ |
92.3 |
69.3 |
/ |
/ |
![]() |
Amazon Nova Micro模型地址 | ||
DBRX Instruct |
1320.0 |
73.7 |
/ |
/ |
72.8 |
/ |
/ |
8.39 |
![]() |
DBRX Instruct模型地址 | |
Grok-1 |
3140.0 |
73.0 |
/ |
/ |
62.9 |
/ |
/ |
/ |
Grok-1模型地址 | ||
DeepSeek-V2-236B-Chat |
2360.0 |
71.1 |
65.2 |
/ |
84.4 |
32.6 |
71.7 |
/ |
![]() |
DeepSeek-V2-236B-Chat模型地址 | |
GPT-3.5 |
1750.0 |
70.0 |
54.4 |
/ |
57.1 |
/ |
/ |
8.39 |
![]() |
GPT-3.5模型地址 | |
PaLM |
5400.0 |
69.3 |
/ |
/ |
56.5 |
/ |
/ |
/ |
![]() |
PaLM模型地址 | |
GPT-3 |
1750.0 |
53.9 |
/ |
/ |
/ |
/ |
/ |
/ |
![]() |
GPT-3模型地址 | |
GLM-130B |
1300.0 |
44.8 |
44.0 |
/ |
/ |
/ |
/ |
/ |
![]() |
GLM-130B模型地址 | |
OPT |
1750.0 |
25.2 |
25.0 |
24.2 |
/ |
/ |
/ |
/ |
![]() |
OPT模型地址 | |
WizardLM-2 8x22B |
1760.0 |
/ |
/ |
/ |
/ |
/ |
/ |
9.12 |
![]() |
WizardLM-2 8x22B模型地址 | |
DeepSeek-R1-Lite-Preview |
/ |
/ |
/ |
/ |
91.6 |
/ |
/ |
![]() |
DeepSeek-R1-Lite-Preview模型地址 | ||
Gemini 2.0 Flash Experimental |
/ |
/ |
/ |
/ |
89.7 |
/ |
/ |
![]() |
Gemini 2.0 Flash Experimental模型地址 | ||
Gemini 2.0 Pro Experimental |
/ |
/ |
/ |
/ |
91.8 |
/ |
/ |
![]() |
Gemini 2.0 Pro Experimental模型地址 | ||
Gemini 2.0 Flash-Lite |
/ |
/ |
/ |
/ |
86.8 |
/ |
/ |
![]() |
Gemini 2.0 Flash-Lite模型地址 | ||
Kimi k1.5 (Long-CoT) |
/ |
/ |
/ |
/ |
96.2 |
/ |
/ |
![]() |
Kimi k1.5 (Long-CoT)模型地址 |
模型名称 | 参数大小 | MMLU | CEval | AGIEval | GSM8K | MATH | BBH | MT Bench | 发布者 | 开源情况 | 模型地址 |
---|---|---|---|---|---|---|---|---|---|---|---|
6710.0 |
90.8 |
91.8 |
/ |
/ |
97.3 |
/ |
/ |
![]() |
DeepSeek-R1模型地址 | ||
90.8 |
/ |
/ |
/ |
94.8 |
/ |
/ |
![]() |
OpenAI o1模型地址 | |||
89.5 |
/ |
/ |
/ |
89.7 |
92.2 |
/ |
![]() |
Hunyuan-TurboS模型地址 | |||
88.7 |
/ |
/ |
90.5 |
76.6 |
/ |
/ |
![]() |
GPT-4o模型地址 | |||
88.7 |
/ |
/ |
96.4 |
71.1 |
/ |
/ |
![]() |
Claude 3.5 Sonnet模型地址 | |||
6810.0 |
88.5 |
86.5 |
/ |
/ |
90.2 |
/ |
/ |
![]() |
DeepSeek-V3模型地址 | ||
87.9 |
/ |
/ |
94.5 |
/ |
/ |
/ |
![]() |
Qwen2.5-Max模型地址 | |||
87.5 |
/ |
/ |
/ |
76.1 |
/ |
/ |
Grok 2模型地址 | ||||
87.4 |
/ |
/ |
/ |
94.6 |
/ |
/ |
![]() |
Kimi k1.5 (Short-CoT)模型地址 | |||
4050.0 |
87.3 |
/ |
/ |
96.8 |
73.8 |
/ |
/ |
![]() |
Llama3.1-405B Instruct模型地址 | ||
6810.0 |
87.1 |
90.1 |
79.6 |
89.3 |
61.6 |
87.5 |
/ |
![]() |
DeepSeek-V3-Base模型地址 | ||
86.9 |
/ |
97.9 |
/ |
97.9 |
/ |
/ |
![]() |
OpenAI o3-mini (high)模型地址 | |||
1750.0 |
86.4 |
68.7 |
/ |
87.1 |
42.5 |
/ |
9.32 |
![]() |
GPT-4模型地址 | ||
4000.0 |
86.1 |
/ |
/ |
94.1 |
57.8 |
/ |
/ |
![]() |
Llama3-400B-Instruct-InTraining模型地址 | ||
1110.0 |
86.0 |
/ |
/ |
/ |
/ |
/ |
/ |
![]() |
C4AI Command A (202503)模型地址 | ||
85.9 |
/ |
/ |
94.8 |
76.6 |
/ |
/ |
![]() |
Amazon Nova Pro模型地址 | |||
85.9 |
/ |
/ |
/ |
97.3 |
/ |
/ |
![]() |
OpenAI o3-mini (medium)模型地址 | |||
85.7 |
/ |
/ |
/ |
68.5 |
/ |
/ |
![]() |
GPT-4o(2024-11-20)模型地址 | |||
4050.0 |
85.2 |
/ |
/ |
/ |
/ |
/ |
/ |
![]() |
Llama3.1-405B模型地址 | ||
85.2 |
/ |
/ |
/ |
90.0 |
/ |
/ |
![]() |
OpenAI o1-mini模型地址 | |||
84.9 |
/ |
/ |
/ |
95.8 |
/ |
/ |
![]() |
OpenAI o3-mini (low)模型地址 | |||
4000.0 |
84.8 |
/ |
/ |
/ |
/ |
/ |
/ |
![]() |
Llama3-400B-InTraining模型地址 | ||
81.3 |
/ |
/ |
90.0 |
50.6 |
/ |
/ |
Grok-1.5模型地址 | ||||
80.5 |
/ |
/ |
94.5 |
73.3 |
/ |
/ |
![]() |
Amazon Nova Lite模型地址 | |||
1100.0 |
80.4 |
/ |
/ |
85.4 |
49.6 |
74.8 |
8.88 |
![]() |
Qwen1.5-110B模型地址 | ||
2360.0 |
80.4 |
/ |
/ |
95.1 |
74.7 |
/ |
/ |
![]() |
DeepSeek V2.5模型地址 | ||
2360.0 |
78.5 |
81.7 |
/ |
79.2 |
43.6 |
78.9 |
/ |
![]() |
DeepSeek-V2-236B模型地址 | ||
3400.0 |
78.3 |
/ |
/ |
80.7 |
/ |
/ |
/ |
![]() |
PaLM 2模型地址 | ||
1410.0 |
77.75 |
/ |
/ |
78.6 |
41.8 |
/ |
/ |
![]() |
Mixtral-8×22B-MoE模型地址 | ||
77.6 |
/ |
/ |
92.3 |
69.3 |
/ |
/ |
![]() |
Amazon Nova Micro模型地址 | |||
1320.0 |
73.7 |
/ |
/ |
72.8 |
/ |
/ |
8.39 |
![]() |
DBRX Instruct模型地址 | ||
3140.0 |
73.0 |
/ |
/ |
62.9 |
/ |
/ |
/ |
Grok-1模型地址 | |||
2360.0 |
71.1 |
65.2 |
/ |
84.4 |
32.6 |
71.7 |
/ |
![]() |
DeepSeek-V2-236B-Chat模型地址 | ||
1750.0 |
70.0 |
54.4 |
/ |
57.1 |
/ |
/ |
8.39 |
![]() |
GPT-3.5模型地址 | ||
5400.0 |
69.3 |
/ |
/ |
56.5 |
/ |
/ |
/ |
![]() |
PaLM模型地址 | ||
1750.0 |
53.9 |
/ |
/ |
/ |
/ |
/ |
/ |
![]() |
GPT-3模型地址 | ||
1300.0 |
44.8 |
44.0 |
/ |
/ |
/ |
/ |
/ |
![]() |
GLM-130B模型地址 | ||
1750.0 |
25.2 |
25.0 |
24.2 |
/ |
/ |
/ |
/ |
![]() |
OPT模型地址 | ||
1760.0 |
/ |
/ |
/ |
/ |
/ |
/ |
9.12 |
![]() |
WizardLM-2 8x22B模型地址 | ||
/ |
/ |
/ |
/ |
91.6 |
/ |
/ |
![]() |
DeepSeek-R1-Lite-Preview模型地址 | |||
/ |
/ |
/ |
/ |
89.7 |
/ |
/ |
![]() |
Gemini 2.0 Flash Experimental模型地址 | |||
/ |
/ |
/ |
/ |
91.8 |
/ |
/ |
![]() |
Gemini 2.0 Pro Experimental模型地址 | |||
/ |
/ |
/ |
/ |
86.8 |
/ |
/ |
![]() |
Gemini 2.0 Flash-Lite模型地址 | |||
/ |
/ |
/ |
/ |
96.2 |
/ |
/ |
![]() |
Kimi k1.5 (Long-CoT)模型地址 |
数据说明:所有数据来源于论文或者GitHub上的评测结果,以官方论文为主,部分数据来源第三方评测!