Evals

AgentBench

AgentBench is the first benchmark designed to evaluate LLM-as-Agent across a diverse spectrum of environments. It bundles 8 environments — 5 newly created (Operating System, Database, Knowledge Graph, Digital Card Game, Lateral Thinking Puzzles) and 3 adapted (House-Holding from ALFWorld, Web Shopping from WebShop, Web Browsing from Mind2Web). The benchmark requires roughly 4,000 dev-set and 13,000 test-set interactions per model.

Documentation GitHub

SDKs

📦

GitHubRepository

https://github.com/THUDM/AgentBench

Other Resources

🔗

Paper

https://arxiv.org/abs/2308.03688

🔗

Leaderboard

https://llmbench.ai/agent

API entry from apis.yml

name: AgentBench
description: AgentBench is the first benchmark designed to evaluate LLM-as-Agent across a diverse spectrum
  of environments. It bundles 8 environments — 5 newly created (Operating System, Database, Knowledge
  Graph, Digital Card Game, Lateral Thinking Puzzles) and 3 adapted (House-Holding from ALFWorld, Web
  Shopping from WebShop, Web Browsing from Mind2Web). The benchmark requires roughly 4,000 dev-set and
  13,000 test-set interactions per model.
humanURL: https://github.com/THUDM/AgentBench
baseURL: https://github.com/THUDM/AgentBench
tags:
- Benchmark
- AI Agents
- Multi-Environment
- LLM-as-Agent
- Tsinghua
properties:
- type: GitHubRepository
  url: https://github.com/THUDM/AgentBench
- type: Paper
  url: https://arxiv.org/abs/2308.03688
- type: Leaderboard
  url: https://llmbench.ai/agent