OpenAI Evals

OpenAI Evals is the open-source framework released by OpenAI for evaluating large language models and LLM-based systems. The README states "Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs." The repo bundles a registry of benchmark evals, support for model-graded grading without writing custom code, private eval data via Snowflake logging, and templates for prompt chains and tool-using agents. Written primarily in Python, the project sits at roughly 18.5k stars / 3k forks.

API entry from apis.yml

apis.yml Raw ↑
name: OpenAI Evals
description: OpenAI Evals is the open-source framework released by OpenAI for evaluating large language
  models and LLM-based systems. The README states "Evals provide a framework for evaluating large language
  models (LLMs) or systems built using LLMs." The repo bundles a registry of benchmark evals, support
  for model-graded grading without writing custom code, private eval data via Snowflake logging, and templates
  for prompt chains and tool-using agents. Written primarily in Python, the project sits at roughly 18.5k
  stars / 3k forks.
humanURL: https://github.com/openai/evals
baseURL: https://github.com/openai/evals
tags:
- OpenAI
- Open Source
- Model Graded
- Benchmark Registry
- Python
properties:
- type: GitHubRepository
  url: https://github.com/openai/evals
- type: Documentation
  url: https://github.com/openai/evals/tree/main/docs
- type: License
  url: https://github.com/openai/evals/blob/main/LICENSE.md