Evals

HumanEval Benchmark

HumanEval is OpenAI's evaluation harness for code-generation models, described in its README as "an evaluation harness for the HumanEval problem solving dataset described in the paper 'Evaluating Large Language Models Trained on Code'." Functional correctness is measured by executing model-generated code against unit tests, reported as pass@1, pass@10, and pass@100 by default.

Documentation GitHub

SDKs

📦

GitHubRepository

https://github.com/openai/human-eval

Other Resources

🔗

Paper

https://arxiv.org/abs/2107.03374

🔗

Dataset

https://huggingface.co/datasets/openai/openai_humaneval

API entry from apis.yml

name: HumanEval Benchmark
description: HumanEval is OpenAI's evaluation harness for code-generation models, described in its README
  as "an evaluation harness for the HumanEval problem solving dataset described in the paper 'Evaluating
  Large Language Models Trained on Code'." Functional correctness is measured by executing model-generated
  code against unit tests, reported as pass@1, pass@10, and pass@100 by default.
humanURL: https://github.com/openai/human-eval
baseURL: https://github.com/openai/human-eval
tags:
- Benchmark
- Code Generation
- Functional Correctness
- Pass@k
- Reference-Based
properties:
- type: GitHubRepository
  url: https://github.com/openai/human-eval
- type: Paper
  url: https://arxiv.org/abs/2107.03374
- type: Dataset
  url: https://huggingface.co/datasets/openai/openai_humaneval