An open API service providing repository metadata for many open source software ecosystems.

Topic: "ai-evaluation"

lechmazur/confabulations

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

Language: HTML - Size: 23.4 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 122 - Forks: 3

rungalileo/agent-leaderboard

Ranking LLMs on agentic tasks

Language: Jupyter Notebook - Size: 10.9 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 113 - Forks: 11

METR/vivaria

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

Language: TypeScript - Size: 17.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 89 - Forks: 31

taoAIGC/AI-Shortcuts

one click to open multi AI sites | 一键打开多个 AI 站点,查看 AI 结果

Size: 47.9 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 62 - Forks: 3

kereva-dev/kereva-scanner

Code scanner to check for issues in prompts and LLM calls

Language: Python - Size: 7.12 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 29 - Forks: 2

lechmazur/deception

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

Size: 36.1 KB - Last synced at: 15 days ago - Pushed at: about 1 month ago - Stars: 26 - Forks: 2

aloth/JudgeGPT

JudgeGPT - (Fake) News Evaluation, a research project

Language: Python - Size: 627 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 4 - Forks: 1

mhamzaerol/Cost-of-Pass

Cost-of-Pass: An Economic Framework for Evaluating Language Models

Language: Python - Size: 939 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 3 - Forks: 0

dpc10ster/RJafrocRocBook

ROC methodology explained with R-examples

Language: TeX - Size: 82.3 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

dpc10ster/RJafrocFrocBook

FROC methodology explained with R-examples

Language: TeX - Size: 133 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 2

DanielButler1/AI-Stats

The Most Comprehensive Set of AI Model Benchmark Scores, Prices & Information

Language: TypeScript - Size: 1.69 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 0

bigdata-ustc/CAT4AI

Adaptive Testing Framework for AI Models (Psychometrics in AI Evaluation)

Language: Python - Size: 163 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

dpc10ster/RJafrocQuickStart

RJafroc quick start for those already familiar with windows jafroc

Language: TeX - Size: 76.8 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

dpc10ster/WindowsJAFROC

Installation files for Windows JAFROC software

Size: 16.2 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 1

cvs-health/uqlm

UQLM (Uncertainty Quantification for Language Models) is a Python package for UQ-based LLM hallucination detection

Size: 8.79 KB - Last synced at: about 3 hours ago - Pushed at: about 5 hours ago - Stars: 0 - Forks: 0

dpc10ster/datasets

ROC/FROC datasets from my collaborations

Size: 984 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0