Evals

MMLU Benchmark

MMLU (Measuring Massive Multitask Language Understanding) is a multiple-choice benchmark spanning 57 subjects from STEM and international law to nutrition and religion. It contains 15,908 multiple-choice questions (four options each), of which 1,540 are reserved for hyperparameter tuning. Per its overview, "It was one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024."

Documentation GitHub

SDKs

📦

GitHubRepository

https://github.com/hendrycks/test

Other Resources

🔗

Paper

https://arxiv.org/abs/2009.03300

🔗

Dataset

https://huggingface.co/datasets/cais/mmlu

API entry from apis.yml

name: MMLU Benchmark
description: MMLU (Measuring Massive Multitask Language Understanding) is a multiple-choice benchmark
  spanning 57 subjects from STEM and international law to nutrition and religion. It contains 15,908 multiple-choice
  questions (four options each), of which 1,540 are reserved for hyperparameter tuning. Per its overview,
  "It was one of the most commonly used benchmarks for comparing the capabilities of large language models,
  with over 100 million downloads as of July 2024."
humanURL: https://github.com/hendrycks/test
baseURL: https://github.com/hendrycks/test
tags:
- Benchmark
- Knowledge
- Multiple Choice
- Multitask
- Reference-Based
properties:
- type: GitHubRepository
  url: https://github.com/hendrycks/test
- type: Paper
  url: https://arxiv.org/abs/2009.03300
- type: Dataset
  url: https://huggingface.co/datasets/cais/mmlu