HumanEval Benchmark

HumanEval is OpenAI's evaluation harness for code-generation models, described in its README as "an evaluation harness for the HumanEval problem solving dataset described in the paper 'Evaluating Large Language Models Trained on Code'." Functional correctness is measured by executing model-generated code against unit tests, reported as pass@1, pass@10, and pass@100 by default.

API entry from apis.yml

apis.yml Raw ↑
name: HumanEval Benchmark
description: HumanEval is OpenAI's evaluation harness for code-generation models, described in its README
  as "an evaluation harness for the HumanEval problem solving dataset described in the paper 'Evaluating
  Large Language Models Trained on Code'." Functional correctness is measured by executing model-generated
  code against unit tests, reported as pass@1, pass@10, and pass@100 by default.
humanURL: https://github.com/openai/human-eval
baseURL: https://github.com/openai/human-eval
tags:
- Benchmark
- Code Generation
- Functional Correctness
- Pass@k
- Reference-Based
properties:
- type: GitHubRepository
  url: https://github.com/openai/human-eval
- type: Paper
  url: https://arxiv.org/abs/2107.03374
- type: Dataset
  url: https://huggingface.co/datasets/openai/openai_humaneval