Filter

All Listings

AI Engineering 🤖

lm-evaluation-harness

A framework for few-shot evaluation of language models.

Open Source Framework Evals

AI Engineering 🤖

hallucination-leaderboard

Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents

Open Source Evals

AI Engineering 🤖

ragas

Supercharge Your LLM Application Evaluations 🚀

Open Source Evals

AI Engineering 🤖

adversarial-robustness-toolbox

Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams

Open Source Evals

Model Development 📊🧠

deepeval

The LLM Evaluation Framework

Open Source Evals

AI Engineering 🤖

SWE-bench

[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?

Open Source Evals

AI Engineering 🤖

ARC

The Abstraction and Reasoning Corpus

Open Source Evals

AI Engineering 🤖

human-eval

Code for the paper "Evaluating Large Language Models Trained on Code"

Open Source Evals

AI Engineering 🤖

phoenix

AI Observability & Evaluation

Open Source Artificial Intelligence Evals

AI Engineering 🤖

evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Open Source Framework Evals

AI Engineering 🤖

PurpleLlama

Set of tools to assess and improve LLM security.

Open Source Evals

AI Engineering 🤖

evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

Open Source Evals

Listings per page

Showing1 - 12of13listings

Page 1 of 2

Loading

...

Loading

...

Loading

...

Loading

...

Loading

...

Loading

...