Evals

GAIA Benchmark

GAIA is "a benchmark for General AI Assistants," published in 2023 (arXiv 2311.12983). It tests general-purpose AI agent capability across reasoning, tool use, multi-modality, and web browsing, with a public leaderboard hosted on Hugging Face for community submissions. The benchmark has become a reference point for evaluating agentic systems that combine an LLM with tools and a browser.

Documentation GitHub

Other Resources

🔗

Dataset

https://huggingface.co/datasets/gaia-benchmark/GAIA

🔗

Paper

https://arxiv.org/abs/2311.12983

🔗

Leaderboard

https://huggingface.co/spaces/gaia-benchmark/leaderboard

API entry from apis.yml

name: GAIA Benchmark
description: GAIA is "a benchmark for General AI Assistants," published in 2023 (arXiv 2311.12983). It
  tests general-purpose AI agent capability across reasoning, tool use, multi-modality, and web browsing,
  with a public leaderboard hosted on Hugging Face for community submissions. The benchmark has become
  a reference point for evaluating agentic systems that combine an LLM with tools and a browser.
humanURL: https://huggingface.co/gaia-benchmark
baseURL: https://huggingface.co/gaia-benchmark
tags:
- Benchmark
- AI Agents
- Reasoning
- Tool Use
- Leaderboard
properties:
- type: Dataset
  url: https://huggingface.co/datasets/gaia-benchmark/GAIA
- type: Paper
  url: https://arxiv.org/abs/2311.12983
- type: Leaderboard
  url: https://huggingface.co/spaces/gaia-benchmark/leaderboard