All Evals Listings
Loading
...
Loading
...
Loading
...
Loading
...
Loading
...
Loading
...
...
...
...
...
...
...
A framework for few-shot evaluation of language models.
Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents
Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
Code for the paper "Evaluating Large Language Models Trained on Code"
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Listings per page