GitHub topics: ai-evaluation
rungalileo/agent-leaderboard
Ranking LLMs on agentic tasks
Language: Jupyter Notebook - Size: 10.9 MB - Last synced at: about 5 hours ago - Pushed at: about 6 hours ago - Stars: 113 - Forks: 11

mhamzaerol/Cost-of-Pass
Cost-of-Pass: An Economic Framework for Evaluating Language Models
Language: Python - Size: 939 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 3 - Forks: 0

DanielButler1/AI-Stats
The Most Comprehensive Set of AI Model Benchmark Scores, Prices & Information
Language: TypeScript - Size: 1.69 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

lechmazur/confabulations
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
Language: HTML - Size: 23.4 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 122 - Forks: 3

METR/vivaria
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
Language: TypeScript - Size: 17.6 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 89 - Forks: 31

lechmazur/deception
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.
Size: 36.1 KB - Last synced at: 10 days ago - Pushed at: about 1 month ago - Stars: 26 - Forks: 2

kereva-dev/kereva-scanner
Code scanner to check for issues in prompts and LLM calls
Language: Python - Size: 7.12 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 29 - Forks: 2

aloth/JudgeGPT
JudgeGPT - (Fake) News Evaluation, a research project
Language: Python - Size: 627 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 1

taoAIGC/AI-Shortcuts
one click to open multi AI sites | 一键打开多个 AI 站点,查看 AI 结果
Size: 47.9 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 62 - Forks: 3

bigdata-ustc/CAT4AI
Adaptive Testing Framework for AI Models (Psychometrics in AI Evaluation)
Language: Python - Size: 163 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

dpc10ster/RJafrocRocBook
ROC methodology explained with R-examples
Language: TeX - Size: 82.3 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 3 - Forks: 0

dpc10ster/RJafrocQuickStart
RJafroc quick start for those already familiar with windows jafroc
Language: TeX - Size: 76.8 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

dpc10ster/RJafrocFrocBook
FROC methodology explained with R-examples
Language: TeX - Size: 133 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 2

dpc10ster/WindowsJAFROC
Installation files for Windows JAFROC software
Size: 16.2 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 1

dpc10ster/datasets
ROC/FROC datasets from my collaborations
Size: 984 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0
