An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: ai-evaluation

rungalileo/agent-leaderboard

Ranking LLMs on agentic tasks

Language: Jupyter Notebook - Size: 10.9 MB - Last synced at: about 5 hours ago - Pushed at: about 6 hours ago - Stars: 113 - Forks: 11

mhamzaerol/Cost-of-Pass

Cost-of-Pass: An Economic Framework for Evaluating Language Models

Language: Python - Size: 939 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 3 - Forks: 0

DanielButler1/AI-Stats

The Most Comprehensive Set of AI Model Benchmark Scores, Prices & Information

Language: TypeScript - Size: 1.69 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

lechmazur/confabulations

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

Language: HTML - Size: 23.4 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 122 - Forks: 3

METR/vivaria

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

Language: TypeScript - Size: 17.6 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 89 - Forks: 31

lechmazur/deception

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

Size: 36.1 KB - Last synced at: 10 days ago - Pushed at: about 1 month ago - Stars: 26 - Forks: 2

kereva-dev/kereva-scanner

Code scanner to check for issues in prompts and LLM calls

Language: Python - Size: 7.12 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 29 - Forks: 2

aloth/JudgeGPT

JudgeGPT - (Fake) News Evaluation, a research project

Language: Python - Size: 627 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 1

taoAIGC/AI-Shortcuts

one click to open multi AI sites | 一键打开多个 AI 站点,查看 AI 结果

Size: 47.9 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 62 - Forks: 3

bigdata-ustc/CAT4AI

Adaptive Testing Framework for AI Models (Psychometrics in AI Evaluation)

Language: Python - Size: 163 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

dpc10ster/RJafrocRocBook

ROC methodology explained with R-examples

Language: TeX - Size: 82.3 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 3 - Forks: 0

dpc10ster/RJafrocQuickStart

RJafroc quick start for those already familiar with windows jafroc

Language: TeX - Size: 76.8 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

dpc10ster/RJafrocFrocBook

FROC methodology explained with R-examples

Language: TeX - Size: 133 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 2

dpc10ster/WindowsJAFROC

Installation files for Windows JAFROC software

Size: 16.2 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 1

dpc10ster/datasets

ROC/FROC datasets from my collaborations

Size: 984 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0