GitHub topics: llm-evaluation-framework
petmal/MindTrial
MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek), custom tasks in YAML, and HTML/CSV reports.
Language: Go - Size: 4.41 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

confident-ai/deepeval
The LLM Evaluation Framework
Language: Python - Size: 85 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 8,451 - Forks: 733

promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Language: TypeScript - Size: 223 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 7,305 - Forks: 584

JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval.
Language: Python - Size: 9.37 MB - Last synced at: 5 days ago - Pushed at: 8 months ago - Stars: 243 - Forks: 41

zli12321/qa_metrics
An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.
Language: Python - Size: 16.6 MB - Last synced at: 7 days ago - Pushed at: 10 days ago - Stars: 50 - Forks: 4

cvs-health/langfair
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
Language: Python - Size: 30.7 MB - Last synced at: 4 days ago - Pushed at: 13 days ago - Stars: 216 - Forks: 32

msoedov/agentic_security
Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪
Language: Python - Size: 21.8 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 1,481 - Forks: 225

multinear/multinear
Develop reliable AI apps
Language: Svelte - Size: 1.13 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 39 - Forks: 2

NetoAI/Netoai-Llm-Eval
A modular Python framework for evaluating Large Language Models (LLMs) using another LLM as a judge. This framework generates responses to questions and evaluates them against ground truth answers, producing numerical scores.
Language: Jupyter Notebook - Size: 582 KB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

wassname/open_pref_eval
Hackable, simple, llm evals on preference datasets
Language: Python - Size: 15.8 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 2 - Forks: 0

Addepto/contextcheck
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.
Language: Python - Size: 464 KB - Last synced at: 21 days ago - Pushed at: 7 months ago - Stars: 72 - Forks: 9

yukinagae/genkitx-promptfoo
Community Plugin for Genkit to use Promptfoo
Language: TypeScript - Size: 553 KB - Last synced at: 24 days ago - Pushed at: 6 months ago - Stars: 4 - Forks: 0

parea-ai/parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language: Python - Size: 5.48 MB - Last synced at: 6 days ago - Pushed at: 5 months ago - Stars: 78 - Forks: 7

MLD3/steerability
An open-source evaluation framework for measuring LLM steerability.
Language: Jupyter Notebook - Size: 73.9 MB - Last synced at: 28 days ago - Pushed at: 29 days ago - Stars: 2 - Forks: 0

flexpa/llm-fhir-eval
Benchmarking Large Language Models for FHIR
Language: TypeScript - Size: 119 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 30 - Forks: 3

ronniross/llm-confidence-scorer
A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.
Language: Python - Size: 143 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

nhsengland/evalsense
Tools for systematic large language model evaluations
Language: Python - Size: 877 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

pyladiesams/eval-llm-based-apps-jan2025
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
Language: Jupyter Notebook - Size: 11.6 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 7 - Forks: 5

Fbxfax/llm-confidence-scorer
A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.
Language: Python - Size: 96.7 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

honeyhiveai/realign
Realign is a testing and simulation framework for AI applications.
Language: Python - Size: 27.3 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 16 - Forks: 1

zhuohaoyu/KIEval
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
Language: Python - Size: 10.6 MB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 36 - Forks: 2

yukinagae/promptfoo-sample
Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models
Size: 334 KB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 1 - Forks: 0

parea-ai/parea-sdk-ts
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language: TypeScript - Size: 2.94 MB - Last synced at: 4 days ago - Pushed at: 6 months ago - Stars: 4 - Forks: 1

Networks-Learning/prediction-powered-ranking
Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.
Language: Jupyter Notebook - Size: 4.74 MB - Last synced at: 23 days ago - Pushed at: 8 months ago - Stars: 9 - Forks: 1

stair-lab/melt
Multilingual Evaluation Toolkits
Language: Python - Size: 204 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 3 - Forks: 3

aws-samples/fm-leaderboarder
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
Language: Python - Size: 514 KB - Last synced at: 26 days ago - Pushed at: 8 months ago - Stars: 18 - Forks: 5

yukinagae/genkit-promptfoo-sample
Sample implementation demonstrating how to use Firebase Genkit with Promptfoo
Language: TypeScript - Size: 2.3 MB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

jaaack-wang/multi-problem-eval-llm
Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities
Language: Jupyter Notebook - Size: 23.1 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

nagababumo/Building-and-Evaluating-Advanced-RAG
Language: Jupyter Notebook - Size: 51.8 KB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 1
