模型详细情况和参数
微软开源的最新一代的大语言模型,是相比较phi3系列,主要是数学和推理方面有了大幅提升。
Phi4-14B模型的评测结果非常优秀,与其它模型对比如下:
评测基准 | phi-4 14b | phi-3 14b | Qwen 2.5 14b | instruct GPT 4o-mini | Llama-3.3 70b instruct | Qwen 2.5 72b instruct | GPT 4o |
---|---|---|---|---|---|---|---|
MMLU | 84.8 | 77.9 | 79.9 | 81.8 | 86.3 | 85.3 | 88.1 |
GPQA | 56.1 | 31.2 | 42.9 | 40.9 | 49.1 | 49.0 | 50.6 |
MATH | 80.4 | 44.6 | 75.6 | 73.0 | 66.31 | 80.0 | 74.6 |
HumanEval | 82.6 | 67.8 | 72.1 | 86.2 | 78.91 | 80.4 | 90.6 |
MGSM | 80.6 | 53.5 | 79.6 | 86.5 | 89.1 | 87.3 | 90.4 |
SimpleQA | 3.0 | 5.4 | 9.9 | 20.9 | 10.2 | 39.4 | 39.4 |
DROP | 75.5 | 68.3 | 85.5 | 79.3 | 90.2 | 76.7 | 80.9 |
MMLUPro | 70.4 | 51.3 | 63.2 | 63.4 | 64.4 | 69.6 | 73.0 |
HumanEval+ | 82.8 | 69.2 | 79.1 | 82.0 | 77.9 | 78.4 | 88.0 |
ArenaHard | 75.4 | 45.8 | 70.2 | 76.2 | 65.5 | 78.4 | 75.6 |
LiveBench | 47.6 | 28.1 | 46.6 | 48.1 | 57.6 | 55.3 | 57.6 |
IFEval | 63.0 | 57.9 | 78.7 | 80.0 | 89.3 | 85.0 | 84.8 |
PhiBench (internal) | 56.2 | 43.9 | 49.8 | 58.7 | 57.1 | 64.6 | 72.4 |