LLM Benchmarking
LLM benchmarking aims to evaluate the performance of LLMs to measure their capabilities in areas such as reasoning, understanding, and generation.
Prominent Examples
- Chatbot Arena: users rate LLM responses in conversations (crowdsourced)
- MMLU (Massive Multitask Language Understanding): Broad knowledge tests
- HumanEval: Coding ability benchmark, checked against test cases
Resources
- Aider LLM Leaderboards: LLM Coding Benchmark