GitHub topics: llm-eval

Repositories

truera/trulens

Evaluation and Tracking for LLM Experiments and AI Agents

Language: Python - Size: 334 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2,665 - Forks: 217

Arize-ai/phoenix

AI Observability & Evaluation

Language: Jupyter Notebook - Size: 345 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 6,440 - Forks: 500

iterative/datachain

ETL, Analytics, Versioning for Unstructured Data

Language: Python - Size: 10.6 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2,610 - Forks: 116

lehigh-university-libraries/htr

Handwritten Text Recognition

Language: Go - Size: 29.3 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Language: TypeScript - Size: 353 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 7,666 - Forks: 617

attogram/ollama-multirun

Multirun - Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance statistics and model info. All in a single Bash shell script.

Language: Shell - Size: 4.17 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 8 - Forks: 1

uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Language: Python - Size: 36.9 MB - Last synced at: 5 days ago - Pushed at: 11 months ago - Stars: 2,299 - Forks: 199

yukinagae/genkitx-promptfoo

Community Plugin for Genkit to use Promptfoo

Language: TypeScript - Size: 553 KB - Last synced at: about 6 hours ago - Pushed at: 7 months ago - Stars: 4 - Forks: 0

multinear/multinear

Develop reliable AI apps

Language: Python - Size: 1.19 MB - Last synced at: 9 days ago - Pushed at: about 1 month ago - Stars: 41 - Forks: 2

jaaack-wang/multi-problem-eval-llm

Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities

Language: Jupyter Notebook - Size: 48.8 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 2 - Forks: 0

Giskard-AI/giskard

🐢 Open-Source Evaluation & Testing for AI & LLM systems

Language: Python - Size: 175 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 4,679 - Forks: 335

whitecircle-ai/circle-guard-bench

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

Language: Python - Size: 20.3 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 39 - Forks: 2

Auto-Playground/ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

Language: Python - Size: 577 KB - Last synced at: 6 days ago - Pushed at: about 2 months ago - Stars: 39 - Forks: 13

IAAR-Shanghai/GuessArena

[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

Language: Python - Size: 606 KB - Last synced at: 30 days ago - Pushed at: about 2 months ago - Stars: 5 - Forks: 0

parea-ai/parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: Python - Size: 5.48 MB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 78 - Forks: 8

amplifying-ai/ai-product-bench

Language: HTML - Size: 1.95 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 7 - Forks: 1

athina-ai/athina-evals

Python SDK for running evaluations on LLM generated responses

Language: Python - Size: 1.82 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 280 - Forks: 19

Justinjody/mirror-model-eval-tests

LLM behavior QA: tone collapse, false consent, and reroute logic scoring.

Size: 30.3 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

regankight/mirror-model-eval-tests

LLM behavior QA: tone collapse, false consent, and reroute logic scoring.

Size: 0 Bytes - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

pyladiesams/eval-llm-based-apps-jan2025

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

Language: Jupyter Notebook - Size: 11.6 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 7 - Forks: 5

Supahands/llm-comparison-backend

This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated and used in the front-end

Language: Python - Size: 402 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 21 - Forks: 2

prompt-foundry/typescript-sdk

The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.

Language: TypeScript - Size: 20.9 MB - Last synced at: 20 days ago - Pushed at: 11 months ago - Stars: 6 - Forks: 1

prompt-foundry/python-sdk

The prompt engineering, prompt management, and prompt evaluation tool for Python

Language: Python - Size: 20.7 MB - Last synced at: 30 days ago - Pushed at: 10 months ago - Stars: 7 - Forks: 0

alan-turing-institute/prompto

An open source library for asynchronous querying of LLM endpoints

Language: Python - Size: 6.77 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 27 - Forks: 1

honeyhiveai/realign

Realign is a testing and simulation framework for AI applications.

Language: Python - Size: 27.3 MB - Last synced at: 25 days ago - Pushed at: 8 months ago - Stars: 16 - Forks: 1

UmberellaxCorp/PHOENIX

🐦 Phoenix AI is an advanced chatbot built using ⚛️ React, 🎨 TailwindCSS, 🌍 Node.js, 🔐 Appwrite, and 🤖 Gemini API. It delivers real-time, intelligent conversations with a sleek UI, secure authentication, and seamless scalability, ensuring a smooth and interactive user experience. 🚀✨

Language: JavaScript - Size: 202 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

daqh/llm-eval

This project applies the LLM-Eval framework to the PersonaChat dataset to assess response quality in a conversational context. Using GPT-4o-mini via the OpenAI API, the system generates scores (on a 0-5 or 0-100 scale) for four evaluation metrics: context, grammar, relevance, and appropriateness.

Language: Python - Size: 5.17 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Re-Align/just-eval

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

Language: Python - Size: 17.7 MB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 85 - Forks: 6

kuk/rulm-sbs2

Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat

Language: Jupyter Notebook - Size: 19.9 MB - Last synced at: 3 months ago - Pushed at: almost 2 years ago - Stars: 61 - Forks: 2

genia-dev/vibraniumdome

LLM Security Platform.

Language: Python - Size: 2.87 MB - Last synced at: 5 months ago - Pushed at: 9 months ago - Stars: 10 - Forks: 2

yukinagae/promptfoo-sample

Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models

Size: 334 KB - Last synced at: 3 months ago - Pushed at: 11 months ago - Stars: 1 - Forks: 0

cuiyuheng/opencompass Fork of open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Size: 5.71 MB - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

parea-ai/parea-sdk-ts

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: TypeScript - Size: 2.94 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 4 - Forks: 1

Networks-Learning/prediction-powered-ranking

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

Language: Jupyter Notebook - Size: 4.74 MB - Last synced at: about 2 months ago - Pushed at: 9 months ago - Stars: 9 - Forks: 1

genia-dev/vibraniumdome-docs

LLM Security Platform Docs

Language: MDX - Size: 635 KB - Last synced at: 5 months ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

yukinagae/genkit-promptfoo-sample

Sample implementation demonstrating how to use Firebase Genkit with Promptfoo

Language: TypeScript - Size: 2.3 MB - Last synced at: 3 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

prompt-foundry/ruby-sdk

The prompt engineering, prompt management, and prompt evaluation tool for Ruby.

Size: 5.86 KB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

prompt-foundry/go-sdk

The prompt engineering, prompt management, and prompt evaluation tool for Go.

Size: 1000 Bytes - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

prompt-foundry/dotnet-sdk

The prompt engineering, prompt management, and prompt evaluation tool for C# and .NET

Size: 5.86 KB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

prompt-foundry/kotlin-sdk

The prompt engineering, prompt management, and prompt evaluation tool for Kotlin.

Size: 6.84 KB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

kdcyberdude/punjabi-llm-eval Fork of gordicaleksa/serbian-llm-eval

First Punjabi LLM Eval.

Language: Python - Size: 12.3 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

awesome-software/evals Fork of openai/evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Size: 1.94 MB - Last synced at: almost 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

harshagrawal523/GenerativeAgents

Generative agents — computational software agents that simulate believable human behavior and OpenAI LLM models. Our main focus was to develop a game - “Werewolves of Miller’s Hollow”, aiming to replicate human-like behavior.

Language: Python - Size: 249 MB - Last synced at: almost 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

Related Keywords

llm-eval 43 llm-evaluation 25 llm 23 prompt-engineering 16 llmops 16 llm-evaluation-framework 11 evaluation 10 llms 8 ai 7 prompt-management 6 openai 6 evaluation-framework 6 prompts 6 machine-learning 5 large-language-models 5 prompt-evaluation 5 prompt-testing 5 llm-evaluation-toolkit 4 testing 4 prompt-manager 4 chatgpt 4 llm-security 4 llm-test 4 large-language-model 3 aiengineering 3 open-ai 3 gpt 3 benchmark 3 rag 3 promptfoo 3 llms-benchmarking 3 evals 3 genkit 2 prompt 2 fallback-logic 2 emotional-alignment 2 behavior-logic 2 ai-safety 2 llm-ops 2 llm-tools 2 generative-ai 2 security 2 prompt-injection-tool 2 prompt-injection 2 llm-serving 2 llm-inference 2 llm-framework 2 llm-firewall 2 llm-agent 2 transformers 2 python 2 gpt-4 2 llm-testing 2 trust-testing 2 tone-misfire 2 reroute-system 2 prompt-repair 2 prompt-qa 2 nlp-evaluation 2 mirror-model 2 anthropic 2 agent-evaluation 2 llm-evaluation-metrics 2 mlops 2 smolagents 2 ai-monitoring 2 llm-prompting 2 langchain 2 ai-observability 2 red-teaming 2 transformer 1 alignment 1 simulation 1 dashboard 1 database 1 elixir-lang 1 elixir-phoenix 1 hbase 1 listview 1 redux 1 nlp 1 natural-language-processing 1 hut23 1 deep-learning 1 python3 1 datasets 1 typescript 1 llamaindex 1 cv 1 data-analytics 1 gpt-3 1 data-wrangling 1 llm-comparison 1 language-model 1 pygame-gui 1 mongodb-atlas 1 docker 1 punjabi 1 panjabi 1 eval 1