Topic: "llm-eval"
promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Language: TypeScript - Size: 223 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 7,226 - Forks: 575

Arize-ai/phoenix
AI Observability & Evaluation
Language: Jupyter Notebook - Size: 337 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 5,985 - Forks: 460

Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
Language: Python - Size: 176 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 4,614 - Forks: 327

iterative/datachain
ETL, Analytics, Versioning for Unstructured Data
Language: Python - Size: 10.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2,584 - Forks: 114

truera/trulens
Evaluation and Tracking for LLM Experiments
Language: Python - Size: 344 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2,571 - Forks: 217

uptrain-ai/uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
Language: Python - Size: 36.9 MB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 2,265 - Forks: 198

athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Language: Python - Size: 1.82 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 280 - Forks: 19

Re-Align/just-eval
A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
Language: Python - Size: 17.7 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 85 - Forks: 6

parea-ai/parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language: Python - Size: 5.48 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 76 - Forks: 7

kuk/rulm-sbs2
Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat
Language: Jupyter Notebook - Size: 19.9 MB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 61 - Forks: 2

multinear/multinear
Develop reliable AI apps
Language: Svelte - Size: 1.13 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 39 - Forks: 2

whitecircle-ai/circle-guard-bench
First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)
Language: Python - Size: 20.1 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 39 - Forks: 2

Auto-Playground/ragrank
🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.
Language: Python - Size: 577 KB - Last synced at: about 7 hours ago - Pushed at: 10 days ago - Stars: 39 - Forks: 13

alan-turing-institute/prompto
An open source library for asynchronous querying of LLM endpoints
Language: Python - Size: 6.77 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 27 - Forks: 1

Supahands/llm-comparison-backend
This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated and used in the front-end
Language: Python - Size: 402 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 21 - Forks: 2

honeyhiveai/realign
Realign is a testing and simulation framework for AI applications.
Language: Python - Size: 27.3 MB - Last synced at: 23 days ago - Pushed at: 7 months ago - Stars: 16 - Forks: 1

genia-dev/vibraniumdome
LLM Security Platform.
Language: Python - Size: 2.87 MB - Last synced at: 3 months ago - Pushed at: 8 months ago - Stars: 10 - Forks: 2

Networks-Learning/prediction-powered-ranking
Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.
Language: Jupyter Notebook - Size: 4.74 MB - Last synced at: 13 days ago - Pushed at: 8 months ago - Stars: 9 - Forks: 1

amplifying-ai/ai-product-bench
Language: HTML - Size: 1.95 MB - Last synced at: 7 days ago - Pushed at: 25 days ago - Stars: 7 - Forks: 1

pyladiesams/eval-llm-based-apps-jan2025
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
Language: Jupyter Notebook - Size: 11.6 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 7 - Forks: 5

prompt-foundry/python-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Python
Language: Python - Size: 20.7 MB - Last synced at: 27 days ago - Pushed at: 9 months ago - Stars: 7 - Forks: 0

attogram/ollama-multirun
A bash shell script to run a single prompt against any or all of your locally installed ollama models, saving the output and performance statistics as easily navigable web pages.
Language: Shell - Size: 4.02 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 6 - Forks: 1

prompt-foundry/typescript-sdk
The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.
Language: TypeScript - Size: 20.9 MB - Last synced at: 15 days ago - Pushed at: 9 months ago - Stars: 6 - Forks: 1

parea-ai/parea-sdk-ts
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language: TypeScript - Size: 2.94 MB - Last synced at: 26 days ago - Pushed at: 5 months ago - Stars: 4 - Forks: 1

yukinagae/genkitx-promptfoo
Community Plugin for Genkit to use Promptfoo
Language: TypeScript - Size: 553 KB - Last synced at: 14 days ago - Pushed at: 6 months ago - Stars: 4 - Forks: 0

genia-dev/vibraniumdome-docs
LLM Security Platform Docs
Language: MDX - Size: 635 KB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

IAAR-Shanghai/GuessArena
[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning
Language: Python - Size: 498 KB - Last synced at: 23 days ago - Pushed at: 24 days ago - Stars: 1 - Forks: 0

yukinagae/promptfoo-sample
Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models
Size: 334 KB - Last synced at: 2 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

prompt-foundry/ruby-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Ruby.
Size: 5.86 KB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

prompt-foundry/go-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Go.
Size: 1000 Bytes - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

Justinjody/mirror-model-eval-tests
LLM behavior QA: tone collapse, false consent, and reroute logic scoring.
Size: 30.3 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

regankight/mirror-model-eval-tests
LLM behavior QA: tone collapse, false consent, and reroute logic scoring.
Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

UmberellaxCorp/PHOENIX
🐦 Phoenix AI is an advanced chatbot built using ⚛️ React, 🎨 TailwindCSS, 🌍 Node.js, 🔐 Appwrite, and 🤖 Gemini API. It delivers real-time, intelligent conversations with a sleek UI, secure authentication, and seamless scalability, ensuring a smooth and interactive user experience. 🚀✨
Language: JavaScript - Size: 202 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

daqh/llm-eval
This project applies the LLM-Eval framework to the PersonaChat dataset to assess response quality in a conversational context. Using GPT-4o-mini via the OpenAI API, the system generates scores (on a 0-5 or 0-100 scale) for four evaluation metrics: context, grammar, relevance, and appropriateness.
Language: Python - Size: 5.17 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

cuiyuheng/opencompass Fork of open-compass/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Size: 5.71 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

yukinagae/genkit-promptfoo-sample
Sample implementation demonstrating how to use Firebase Genkit with Promptfoo
Language: TypeScript - Size: 2.3 MB - Last synced at: 2 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

jaaack-wang/multi-problem-eval-llm
Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities
Language: Jupyter Notebook - Size: 23.1 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

prompt-foundry/dotnet-sdk
The prompt engineering, prompt management, and prompt evaluation tool for C# and .NET
Size: 5.86 KB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

prompt-foundry/kotlin-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Kotlin.
Size: 6.84 KB - Last synced at: 16 days ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

kdcyberdude/punjabi-llm-eval Fork of gordicaleksa/serbian-llm-eval
First Punjabi LLM Eval.
Language: Python - Size: 12.3 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

harshagrawal523/GenerativeAgents
Generative agents — computational software agents that simulate believable human behavior and OpenAI LLM models. Our main focus was to develop a game - “Werewolves of Miller’s Hollow”, aiming to replicate human-like behavior.
Language: Python - Size: 249 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

awesome-software/evals Fork of openai/evals
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Size: 1.94 MB - Last synced at: almost 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0
