andersch.dev

contact [at} andersch {dot) dev

<2025-03-10 Mon>

[ ai ]

LLM Benchmarking

LLM benchmarking aims to evaluate the performance of LLMs to measure their capabilities in areas such as reasoning, understanding, and generation.

Prominent Examples

Chatbot Arena: users rate LLM responses in conversations (crowdsourced)
MMLU (Massive Multitask Language Understanding): Broad knowledge tests
HumanEval: Coding ability benchmark, checked against test cases

Resources

Aider LLM Leaderboards: LLM Coding Benchmark