All Evals Listings
Internal link to /explore/evals
evals
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Internal link to /explore/lm-evaluation-harness
lm-evaluation-harness
A framework for few-shot evaluation of language models.
Internal link to /explore/pyrit
PyRIT
The Python Risk Identification Tool for generative AI (PyRIT) is an open source framework built to empower security professionals and engineers to proactively identify risks in generative AI systems.
Internal link to /explore/hallucination-leaderboard
hallucination-leaderboard
Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents
Internal link to /explore/human-eval
human-eval
Code for the paper "Evaluating Large Language Models Trained on Code"
Internal link to /explore/evalplus
evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Internal link to /explore/swe-bench
SWE-bench
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
Listings per page
Showing1 - 12of13listings
Page 1 of 2