GitHub topics: llm-evaluation-framework

Repositories

snowz123/team-agents

🐙 Team Agents unifica 82 especialistas en IA para resolver desafíos con chat inteligente, analista de requisitos y subida de documentos. Plataforma futurista y modular.

Language: Python - Size: 126 KB - Last synced at: about 11 hours ago - Pushed at: about 14 hours ago - Stars: 0 - Forks: 0

cvs-health/langfair

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

Language: Python - Size: 32.1 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 231 - Forks: 39

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Language: TypeScript - Size: 285 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 8,334 - Forks: 691

msoedov/agentic_security

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

Language: Python - Size: 21.5 MB - Last synced at: 4 days ago - Pushed at: 19 days ago - Stars: 1,658 - Forks: 256

confident-ai/deepeval

The LLM Evaluation Framework

Language: Python - Size: 98.4 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 10,626 - Forks: 912

nhsengland/evalsense

Tools for systematic large language model evaluations

Language: Python - Size: 1.46 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

artefactop/promptdev

A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.

Language: Python - Size: 1.12 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

lalitkpal/VerifyAI

VerifyAI is a simple UI application to test GenAI outputs

Language: Python - Size: 24.4 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

Language: Python - Size: 9.37 MB - Last synced at: 2 days ago - Pushed at: 10 months ago - Stars: 245 - Forks: 41

petmal/MindTrial

MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek, Mistral AI, xAI), custom tasks in YAML, and HTML/CSV reports.

Language: Go - Size: 4.96 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 1 - Forks: 0

zli12321/qa_metrics

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

Language: Python - Size: 16.2 MB - Last synced at: 3 days ago - Pushed at: about 2 months ago - Stars: 59 - Forks: 6

flexpa/llm-fhir-eval

Benchmarking Large Language Models for FHIR

Language: TypeScript - Size: 176 KB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 40 - Forks: 6

honeyhiveai/realign

Realign is a testing and simulation framework for AI applications.

Language: Python - Size: 27.3 MB - Last synced at: 8 days ago - Pushed at: 9 months ago - Stars: 17 - Forks: 1

multinear/multinear

Develop reliable AI apps

Language: Python - Size: 1.19 MB - Last synced at: 25 days ago - Pushed at: 3 months ago - Stars: 43 - Forks: 2

yukinagae/genkitx-promptfoo

Community Plugin for Genkit to use Promptfoo

Language: TypeScript - Size: 553 KB - Last synced at: 14 days ago - Pushed at: 8 months ago - Stars: 4 - Forks: 0

dishant2009/android-llm-agent-eval

LLM agent evaluation framework achieving 89% step accuracy. Features OpenAI/Anthropic integration, Streamlit dashboard, memory buffer system. Benchmarks GPT-4 vs Claude-3 on Android navigation tasks. Built for QualGent Research challenge with comprehensive documentation.

Language: Python - Size: 22.6 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

jaaack-wang/multi-problem-eval-llm

Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities

Language: Jupyter Notebook - Size: 48.8 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

Addepto/contextcheck

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

Language: Python - Size: 464 KB - Last synced at: 2 months ago - Pushed at: 9 months ago - Stars: 80 - Forks: 9

NetoAI/Netoai-Llm-Eval

A modular Python framework for evaluating Large Language Models (LLMs) using another LLM as a judge. This framework generates responses to questions and evaluates them against ground truth answers, producing numerical scores.

Language: Jupyter Notebook - Size: 582 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

wassname/open_pref_eval

Hackable, simple, llm evals on preference datasets

Language: Python - Size: 15.8 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

parea-ai/parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: Python - Size: 5.48 MB - Last synced at: 15 days ago - Pushed at: 7 months ago - Stars: 78 - Forks: 8

MLD3/steerability

An open-source evaluation framework for measuring LLM steerability.

Language: Jupyter Notebook - Size: 73.9 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

ronniross/llm-confidence-scorer

A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.

Language: Python - Size: 143 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

pyladiesams/eval-llm-based-apps-jan2025

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

Language: Jupyter Notebook - Size: 11.6 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 7 - Forks: 5

Fbxfax/llm-confidence-scorer

A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.

Language: Python - Size: 96.7 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

zhuohaoyu/KIEval

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

Language: Python - Size: 10.6 MB - Last synced at: 6 months ago - Pushed at: about 1 year ago - Stars: 36 - Forks: 2

yukinagae/promptfoo-sample

Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models

Size: 334 KB - Last synced at: 5 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

parea-ai/parea-sdk-ts

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: TypeScript - Size: 2.94 MB - Last synced at: 8 days ago - Pushed at: 8 months ago - Stars: 4 - Forks: 1

Networks-Learning/prediction-powered-ranking

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

Language: Jupyter Notebook - Size: 4.74 MB - Last synced at: 3 months ago - Pushed at: 11 months ago - Stars: 9 - Forks: 1

stair-lab/melt

Multilingual Evaluation Toolkits

Language: Python - Size: 204 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 3 - Forks: 3

aws-samples/fm-leaderboarder

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

Language: Python - Size: 514 KB - Last synced at: 3 months ago - Pushed at: 11 months ago - Stars: 18 - Forks: 5

yukinagae/genkit-promptfoo-sample

Sample implementation demonstrating how to use Firebase Genkit with Promptfoo

Language: TypeScript - Size: 2.3 MB - Last synced at: 29 days ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

nagababumo/Building-and-Evaluating-Advanced-RAG

Language: Jupyter Notebook - Size: 51.8 KB - Last synced at: 6 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 1

Related Keywords

llm-evaluation-framework 33 llm-evaluation 25 llm 21 llm-eval 12 evaluation-framework 9 llm-evaluation-metrics 8 llm-evaluation-toolkit 7 evaluation 7 llmops 7 llms 6 prompt-testing 6 llms-benchmarking 6 prompt-engineering 5 testing 5 large-language-models 4 rag 4 prompts 4 promptfoo 3 llm-benchmarking 3 generative-ai 3 ai 3 llm-testing 3 ci-cd 2 ci 2 llm-vulnerabilities 2 red-teaming 2 genkit 2 llm-inference 2 evaluation-metrics 2 prompt 2 explainable-ai 2 llm-test 2 llm-security 2 llm-scanner 2 llms-reasoning 2 llm-guardrails 2 llms-evalution 2 llms-efficency 2 llm-training 2 datasets 2 dataset 2 python 2 llm-tools 2 fhir 1 fhir-llm 1 evals 1 rl-training 1 android-automation 1 anthropic-claude 1 openai-gpt4 1 reward-modeling 1 llm-prompting 1 ai-chat 1 qa-automation-test 1 exact-matching 1 ai-agents 1 plugin 1 genkitx 1 genkit-plugin 1 firebase 1 fhir-resources 1 fhirpath 1 aiengineering 1 alignment 1 reliability 1 simulation 1 retrieval-augmented-generation 1 llamaindex 1 multilingual 1 ranking-algorithm 1 rank-sets 1 prediction-powered-inference 1 machine-learning 1 acl2024 1 workshop 1 llm-monitoring 1 llm-evals 1 metrics 1 good-first-issue 1 transformer 1 rlhf 1 preference-learning 1 language-model 1 huggingface 1 testing-tools 1 testing-framework 1 summarization-testing 1 rag-testing 1 prompt-test 1 open-source 1 generative-ai-testing 1 chatbot-testing 1 chatbot-framework 1 ai-testing-tool 1 ai-testing 1 red-team 1 prompt-toolkit 1 llm-jailbreaks 1 llm-fuzzing 1 llm-fuzzer-aggregator 1