GitHub topics: ai-evaluation

Repositories

rungalileo/agent-leaderboard

Ranking LLMs on agentic tasks

Language: Jupyter Notebook - Size: 10.9 MB - Last synced at: about 5 hours ago - Pushed at: about 6 hours ago - Stars: 113 - Forks: 11

mhamzaerol/Cost-of-Pass

Cost-of-Pass: An Economic Framework for Evaluating Language Models

Language: Python - Size: 939 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 3 - Forks: 0

DanielButler1/AI-Stats

The Most Comprehensive Set of AI Model Benchmark Scores, Prices & Information

Language: TypeScript - Size: 1.69 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

lechmazur/confabulations

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

Language: HTML - Size: 23.4 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 122 - Forks: 3

METR/vivaria

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

Language: TypeScript - Size: 17.6 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 89 - Forks: 31

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

Size: 36.1 KB - Last synced at: 10 days ago - Pushed at: about 1 month ago - Stars: 26 - Forks: 2

kereva-dev/kereva-scanner

Code scanner to check for issues in prompts and LLM calls

Language: Python - Size: 7.12 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 29 - Forks: 2

aloth/JudgeGPT

JudgeGPT - (Fake) News Evaluation, a research project

Language: Python - Size: 627 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 1

taoAIGC/AI-Shortcuts

one click to open multi AI sites ｜一键打开多个 AI 站点，查看 AI 结果

Size: 47.9 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 62 - Forks: 3

bigdata-ustc/CAT4AI

Adaptive Testing Framework for AI Models (Psychometrics in AI Evaluation)

Language: Python - Size: 163 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

dpc10ster/RJafrocRocBook

ROC methodology explained with R-examples

Language: TeX - Size: 82.3 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 3 - Forks: 0

dpc10ster/RJafrocQuickStart

RJafroc quick start for those already familiar with windows jafroc

Language: TeX - Size: 76.8 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

dpc10ster/RJafrocFrocBook

FROC methodology explained with R-examples

Language: TeX - Size: 133 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 2

dpc10ster/WindowsJAFROC

Installation files for Windows JAFROC software

Size: 16.2 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 1

dpc10ster/datasets

ROC/FROC datasets from my collaborations

Size: 984 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

Related Keywords

ai-evaluation 15 ai 5 llm 5 gemini 3 claude 3 language-model 3 ai-security 2 machine-learning 2 nlp 2 llm-benchmarking 2 llama 2 book 2 evaluation 2 benchmark 2 r 2 ai-benchmarks 2 crowdsource-experiments 1 ai-ethics 1 security 1 explainable-ai 1 fake-news 1 fake-news-analysis 1 red-teaming 1 fake-news-challenge 1 generative-ai 1 prompt-injection 1 human-ai-interaction 1 misinformation 1 ai-agents 1 mongodb 1 research-project 1 streamlit 1 streamlit-webapp 1 survey 1 survey-app 1 chatgpt 1 perplexity 1 poe 1 adaptive-testing 1 psychometrics 1 roc 1 rjafroc 1 pdf 1 jafroc 1 windows 1 datasets 1 llms 1 cost-efficiency 1 cost-performance 1 economics 1 evaluation-framework 1 confabulations 1 deepseek-r1 1 gemini-pro 1 hallucinations 1 leaderboard 1 o1 1 o3-mini 1 rag 1 elicitation 1 evals 1 ai-safety 1 disinformation 1 gpt4o 1 mistral 1 model-evaluation 1 ai-code-review 1 ai-performance 1 ai-red-teaming 1 cli 1 code-scanning 1 hallucination 1 linter 1 llm-evaluation 1 llm-performance 1 llm-security 1 owasp-llm-top-10 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Repos