llm-eval | Topic | Ecosyste.ms: Repos

Topic: "llm-eval"

promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Language: TypeScript - Size: 223 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 7,226 - Forks: 575

Arize-ai/phoenix

AI Observability & Evaluation

Language: Jupyter Notebook - Size: 337 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 5,985 - Forks: 460

Giskard-AI/giskard

🐢 Open-Source Evaluation & Testing for AI & LLM systems

Language: Python - Size: 176 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 4,614 - Forks: 327

iterative/datachain

ETL, Analytics, Versioning for Unstructured Data

Language: Python - Size: 10.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2,584 - Forks: 114

truera/trulens

Evaluation and Tracking for LLM Experiments

Language: Python - Size: 344 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2,571 - Forks: 217

uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Language: Python - Size: 36.9 MB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 2,265 - Forks: 198

athina-ai/athina-evals

Python SDK for running evaluations on LLM generated responses

Language: Python - Size: 1.82 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 280 - Forks: 19

Re-Align/just-eval

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

Language: Python - Size: 17.7 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 85 - Forks: 6

parea-ai/parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: Python - Size: 5.48 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 76 - Forks: 7

kuk/rulm-sbs2

Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat

Language: Jupyter Notebook - Size: 19.9 MB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 61 - Forks: 2

multinear/multinear

Develop reliable AI apps

Language: Svelte - Size: 1.13 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 39 - Forks: 2

whitecircle-ai/circle-guard-bench

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

Language: Python - Size: 20.1 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 39 - Forks: 2

Auto-Playground/ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

Language: Python - Size: 577 KB - Last synced at: about 7 hours ago - Pushed at: 10 days ago - Stars: 39 - Forks: 13

alan-turing-institute/prompto

An open source library for asynchronous querying of LLM endpoints

Language: Python - Size: 6.77 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 27 - Forks: 1

Supahands/llm-comparison-backend

This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated and used in the front-end

Language: Python - Size: 402 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 21 - Forks: 2

honeyhiveai/realign

Realign is a testing and simulation framework for AI applications.

Language: Python - Size: 27.3 MB - Last synced at: 23 days ago - Pushed at: 7 months ago - Stars: 16 - Forks: 1

genia-dev/vibraniumdome

LLM Security Platform.

Language: Python - Size: 2.87 MB - Last synced at: 3 months ago - Pushed at: 8 months ago - Stars: 10 - Forks: 2

Networks-Learning/prediction-powered-ranking

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

Language: Jupyter Notebook - Size: 4.74 MB - Last synced at: 13 days ago - Pushed at: 8 months ago - Stars: 9 - Forks: 1

amplifying-ai/ai-product-bench

Language: HTML - Size: 1.95 MB - Last synced at: 7 days ago - Pushed at: 25 days ago - Stars: 7 - Forks: 1

pyladiesams/eval-llm-based-apps-jan2025

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

Language: Jupyter Notebook - Size: 11.6 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 7 - Forks: 5

prompt-foundry/python-sdk

The prompt engineering, prompt management, and prompt evaluation tool for Python

Language: Python - Size: 20.7 MB - Last synced at: 27 days ago - Pushed at: 9 months ago - Stars: 7 - Forks: 0

attogram/ollama-multirun

A bash shell script to run a single prompt against any or all of your locally installed ollama models, saving the output and performance statistics as easily navigable web pages.

Language: Shell - Size: 4.02 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 6 - Forks: 1

prompt-foundry/typescript-sdk

The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.

Language: TypeScript - Size: 20.9 MB - Last synced at: 15 days ago - Pushed at: 9 months ago - Stars: 6 - Forks: 1

parea-ai/parea-sdk-ts

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: TypeScript - Size: 2.94 MB - Last synced at: 26 days ago - Pushed at: 5 months ago - Stars: 4 - Forks: 1

yukinagae/genkitx-promptfoo

Community Plugin for Genkit to use Promptfoo

Language: TypeScript - Size: 553 KB - Last synced at: 14 days ago - Pushed at: 6 months ago - Stars: 4 - Forks: 0

genia-dev/vibraniumdome-docs

LLM Security Platform Docs

Language: MDX - Size: 635 KB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

IAAR-Shanghai/GuessArena

[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

Language: Python - Size: 498 KB - Last synced at: 23 days ago - Pushed at: 24 days ago - Stars: 1 - Forks: 0

yukinagae/promptfoo-sample

Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models

Size: 334 KB - Last synced at: 2 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

prompt-foundry/ruby-sdk

The prompt engineering, prompt management, and prompt evaluation tool for Ruby.

Size: 5.86 KB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

prompt-foundry/go-sdk

The prompt engineering, prompt management, and prompt evaluation tool for Go.

Size: 1000 Bytes - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

Justinjody/mirror-model-eval-tests

LLM behavior QA: tone collapse, false consent, and reroute logic scoring.

Size: 30.3 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

regankight/mirror-model-eval-tests

LLM behavior QA: tone collapse, false consent, and reroute logic scoring.

Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

UmberellaxCorp/PHOENIX

🐦 Phoenix AI is an advanced chatbot built using ⚛️ React, 🎨 TailwindCSS, 🌍 Node.js, 🔐 Appwrite, and 🤖 Gemini API. It delivers real-time, intelligent conversations with a sleek UI, secure authentication, and seamless scalability, ensuring a smooth and interactive user experience. 🚀✨

Language: JavaScript - Size: 202 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

daqh/llm-eval

This project applies the LLM-Eval framework to the PersonaChat dataset to assess response quality in a conversational context. Using GPT-4o-mini via the OpenAI API, the system generates scores (on a 0-5 or 0-100 scale) for four evaluation metrics: context, grammar, relevance, and appropriateness.

Language: Python - Size: 5.17 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

cuiyuheng/opencompass Fork of open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Size: 5.71 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

yukinagae/genkit-promptfoo-sample

Sample implementation demonstrating how to use Firebase Genkit with Promptfoo

Language: TypeScript - Size: 2.3 MB - Last synced at: 2 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

jaaack-wang/multi-problem-eval-llm

Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities

Language: Jupyter Notebook - Size: 23.1 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

prompt-foundry/dotnet-sdk

The prompt engineering, prompt management, and prompt evaluation tool for C# and .NET

Size: 5.86 KB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

prompt-foundry/kotlin-sdk

The prompt engineering, prompt management, and prompt evaluation tool for Kotlin.

Size: 6.84 KB - Last synced at: 16 days ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

kdcyberdude/punjabi-llm-eval Fork of gordicaleksa/serbian-llm-eval

First Punjabi LLM Eval.

Language: Python - Size: 12.3 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

harshagrawal523/GenerativeAgents

Generative agents — computational software agents that simulate believable human behavior and OpenAI LLM models. Our main focus was to develop a game - “Werewolves of Miller’s Hollow”, aiming to replicate human-like behavior.

Language: Python - Size: 249 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

awesome-software/evals Fork of openai/evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Size: 1.94 MB - Last synced at: almost 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0