All Evals Listings

Internal link to /explore/evals

evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Internal link to /explore/pyrit

PyRIT

The Python Risk Identification Tool for generative AI (PyRIT) is an open source framework built to empower security professionals and engineers to proactively identify risks in generative AI systems.

Internal link to /explore/hallucination-leaderboard

hallucination-leaderboard

Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents

Internal link to /explore/human-eval

human-eval

Code for the paper "Evaluating Large Language Models Trained on Code"

Internal link to /explore/evalplus

evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

Internal link to /explore/swe-bench

SWE-bench

[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?

Listings per page

Showing1 - 12of13listings
Page 1 of 2