Evals

Braintrust

Braintrust is a commercial evaluation platform that captures eval runs as immutable, comparable experiment snapshots. The product supports code-based scorers, built-in autoevals, and LLM-as-a-judge evaluators for both offline and production use. Datasets are collections of test cases (input, optional expected output, metadata) sourced from production logs, user feedback, or manual curation. Experiments slot into CI/CD pipelines to detect regressions "before they reach production."

Documentation GitHub

Documentation

📖

Documentation

https://www.braintrust.dev/docs

Other Resources

🔗

EvaluationGuide

https://www.braintrust.dev/docs/guides/evals

🔗

Pricing

https://www.braintrust.dev/pricing

API entry from apis.yml

name: Braintrust
description: Braintrust is a commercial evaluation platform that captures eval runs as immutable, comparable
  experiment snapshots. The product supports code-based scorers, built-in autoevals, and LLM-as-a-judge
  evaluators for both offline and production use. Datasets are collections of test cases (input, optional
  expected output, metadata) sourced from production logs, user feedback, or manual curation. Experiments
  slot into CI/CD pipelines to detect regressions "before they reach production."
humanURL: https://www.braintrust.dev/
baseURL: https://www.braintrust.dev
tags:
- Commercial
- LLM as a Judge
- CI/CD
- Experiments
- Regression Detection
properties:
- type: Documentation
  url: https://www.braintrust.dev/docs
- type: EvaluationGuide
  url: https://www.braintrust.dev/docs/guides/evals
- type: Pricing
  url: https://www.braintrust.dev/pricing