AIBenchmarks
Transparency

Our Methodology

We believe in full transparency. Here's exactly how we collect, verify, and update benchmark data — and what each benchmark actually measures.

✓ Editorial Independence

AI Benchmarks is an independent platform. We are not affiliated with, funded by, or in any commercial relationship with OpenAI, Anthropic, Google, Meta, xAI, or Mistral AI. No AI company can pay to improve their ranking.

Data Sources

All benchmark scores are sourced from peer-reviewed publications, official technical reports, and the LMSYS Chatbot Arena leaderboard. We do not run benchmarks ourselves — we aggregate and verify scores from primary sources.

Our primary sources include official model cards and technical reports from each AI provider, the LMSYS Chatbot Arena human preference leaderboard, arXiv preprints for independently replicated results, and community-verified benchmark repositories on GitHub.

When scores differ between sources, we use the most recently published, peer-reviewed figure. We note discrepancies in the model detail pages.

Benchmark Definitions & Weights

MMLUMassive Multitask Language Understanding20% weight

57 subjects across STEM, social sciences, humanities. Tests broad world knowledge and problem-solving. Higher scores indicate more comprehensive knowledge.

HumanEvalHuman-Evaluated Code Generation20% weight

164 hand-crafted programming challenges. Models generate Python code that is evaluated by running unit tests. The gold standard for coding ability.

MATHMathematics Problem Solving15% weight

12,500 competition mathematics problems across algebra, calculus, geometry, and more. Tests deep mathematical reasoning.

GSM8KGrade School Math 8K10% weight

8,500 grade school math word problems requiring multi-step reasoning. A good proxy for everyday numerical reasoning.

GPQAGraduate-Level Google-Proof Q&A15% weight

448 expert-crafted questions in biology, chemistry, and physics. Designed so that Google search cannot easily answer them.

BBHBIG-Bench Hard10% weight

23 challenging tasks from the BIG-Bench benchmark that resist simple few-shot prompting, requiring genuine reasoning.

Arena ELOLMSYS Chatbot Arena ELO Rating10% weight

Human preference rating from blind A/B comparisons at LMSYS Chatbot Arena. The most realistic measure of real-world usefulness.

Update Process

Arena ELO scores are updated weekly via the LMSYS Chatbot Arena public leaderboard API. All other benchmark scores are reviewed monthly and updated when new official figures are published.

Our automated update script (scripts/fetch-arena.js) fetches the latest ELO data and opens a pull request on our GitHub repository. A human reviewer verifies the changes before merging.

When a provider releases a new model version, we add it to our database within 72 hours of the official announcement, using scores from the official technical report.

Known Limitations

Benchmarks are not perfect. MMLU and HumanEval scores may be inflated for models trained on contaminated datasets. We flag known contamination issues when reported in peer-reviewed literature.

Performance on benchmarks does not always predict real-world usefulness. We include LMSYS Arena ELO (human preference) specifically to counterbalance this limitation.

Pricing data reflects public list prices and may not reflect negotiated enterprise discounts. Always verify pricing directly with the provider before making purchasing decisions.

Found an error? Contact us with a source link and we'll review it within 48 hours.