llm-evaluation | Topic | Ecosyste.ms: Repos

Topic: "llm-evaluation"

langfuse/langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

Language: TypeScript - Size: 19.9 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 10,594 - Forks: 959

comet-ml/opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Language: Python - Size: 145 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 6,668 - Forks: 481

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Language: TypeScript - Size: 342 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 6,241 - Forks: 513

confident-ai/deepeval

The LLM Evaluation Framework

Language: Python - Size: 78.5 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 6,045 - Forks: 526

Arize-ai/phoenix

AI Observability & Evaluation

Language: Jupyter Notebook - Size: 300 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 5,433 - Forks: 401

Giskard-AI/giskard

🐢 Open-Source Evaluation & Testing for AI & LLM systems

Language: Python - Size: 175 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 4,486 - Forks: 318

NVIDIA/garak

the LLM vulnerability scanner

Language: Python - Size: 4.81 MB - Last synced at: about 23 hours ago - Pushed at: 1 day ago - Stars: 4,317 - Forks: 422

Marker-Inc-Korea/AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

Language: Python - Size: 70 MB - Last synced at: about 8 hours ago - Pushed at: about 2 months ago - Stars: 3,852 - Forks: 304

Helicone/helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

Language: TypeScript - Size: 409 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 3,640 - Forks: 364

PacktPublishing/LLM-Engineers-Handbook

The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

Language: Python - Size: 4.46 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 2,838 - Forks: 572

Agenta-AI/agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

Language: TypeScript - Size: 163 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2,606 - Forks: 305

lmnr-ai/lmnr

Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.

Language: TypeScript - Size: 30.5 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1,866 - Forks: 113

msoedov/agentic_security

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

Language: Python - Size: 21.7 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 1,283 - Forks: 200

microsoft/prompty

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

Language: Python - Size: 5.53 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 856 - Forks: 82

cyberark/FuzzyAI

A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.

Language: Jupyter Notebook - Size: 14.4 MB - Last synced at: about 13 hours ago - Pushed at: about 14 hours ago - Stars: 520 - Forks: 53

onejune2018/Awesome-LLM-Eval

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

Size: 12.6 MB - Last synced at: 5 days ago - Pushed at: 6 months ago - Stars: 514 - Forks: 44

relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

Language: Python - Size: 1.7 MB - Last synced at: 5 months ago - Pushed at: 8 months ago - Stars: 446 - Forks: 29

Value4AI/Awesome-LLM-in-Social-Science

Awesome papers involving LLMs in Social Science.

Size: 121 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 426 - Forks: 29

kimtth/awesome-azure-openai-llm

A curated list of 🌌 Azure OpenAI, 🦙 Large Language Models (incl. RAG, Agent), and references with memos.

Language: Python - Size: 285 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 351 - Forks: 43

palico-ai/palico-ai

Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework

Language: TypeScript - Size: 13.7 MB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 339 - Forks: 27

athina-ai/athina-evals

Python SDK for running evaluations on LLM generated responses

Language: Python - Size: 1.84 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 276 - Forks: 17

iMeanAI/WebCanvas

All-in-one Web Agent framework for post-training. Start building with a few clicks!

Language: Python - Size: 5.84 MB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 239 - Forks: 17

JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

Language: Python - Size: 9.37 MB - Last synced at: 5 days ago - Pushed at: 5 months ago - Stars: 235 - Forks: 41

PetroIvaniuk/llms-tools

A list of LLMs Tools & Projects

Size: 187 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 222 - Forks: 26

cvs-health/langfair

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

Language: Python - Size: 30.1 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 201 - Forks: 32

JonathanChavezTamales/LLMStats

A comprehensive set of LLM benchmark scores and provider prices.

Language: JavaScript - Size: 167 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 130 - Forks: 10

raga-ai-hub/raga-llm-hub

Framework for LLM evaluation, guardrails and security

Language: Python - Size: 1.22 MB - Last synced at: about 19 hours ago - Pushed at: 8 months ago - Stars: 112 - Forks: 14

alopatenko/LLMEvaluation

A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

Language: HTML - Size: 3.96 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 111 - Forks: 9

villagecomputing/superpipe

Superpipe - optimized LLM pipelines for structured data

Language: Python - Size: 11.2 MB - Last synced at: 23 days ago - Pushed at: 10 months ago - Stars: 108 - Forks: 3

kolenaIO/autoarena

Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation

Language: TypeScript - Size: 2.52 MB - Last synced at: 3 days ago - Pushed at: 4 months ago - Stars: 103 - Forks: 8

hkust-nlp/dart-math

[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*

Language: Jupyter Notebook - Size: 4.18 MB - Last synced at: 18 days ago - Pushed at: 4 months ago - Stars: 100 - Forks: 4

rungalileo/hallucination-index

Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.

Size: 1.4 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 91 - Forks: 6

allenai/CommonGen-Eval

Evaluating LLMs with CommonGen-Lite

Language: Python - Size: 1.28 MB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 89 - Forks: 3

Re-Align/just-eval

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

Language: Python - Size: 17.7 MB - Last synced at: about 22 hours ago - Pushed at: about 1 year ago - Stars: 85 - Forks: 6

open-compass/GTA

[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents

Language: Python - Size: 9.98 MB - Last synced at: 16 days ago - Pushed at: 26 days ago - Stars: 82 - Forks: 7

parea-ai/parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: Python - Size: 5.48 MB - Last synced at: 5 days ago - Pushed at: 2 months ago - Stars: 76 - Forks: 6

loganrjmurphy/LeanEuclid

LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.

Language: Lean - Size: 3.59 MB - Last synced at: 5 months ago - Pushed at: 11 months ago - Stars: 76 - Forks: 5

Addepto/contextcheck

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

Language: Python - Size: 464 KB - Last synced at: about 2 months ago - Pushed at: 4 months ago - Stars: 61 - Forks: 7

azminewasi/Awesome-LLMs-ICLR-24

It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.

Size: 821 KB - Last synced at: 4 days ago - Pushed at: about 1 year ago - Stars: 60 - Forks: 3

PromptMixerDev/prompt-mixer-app-ce

A desktop application for comparing outputs from different Large Language Models (LLMs).

Language: TypeScript - Size: 3.2 MB - Last synced at: 14 days ago - Pushed at: 17 days ago - Stars: 50 - Forks: 6

deshwalmahesh/PHUDGE

Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.

Language: Jupyter Notebook - Size: 13.1 MB - Last synced at: 23 days ago - Pushed at: 10 months ago - Stars: 49 - Forks: 7

Chainlit/literalai-cookbooks

Cookbooks and tutorials on Literal AI

Language: Jupyter Notebook - Size: 8.65 MB - Last synced at: 9 days ago - Pushed at: 5 months ago - Stars: 48 - Forks: 13

cedrickchee/vibe-jet

A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"

Language: HTML - Size: 8.64 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 45 - Forks: 5

SajiJohnMiranda/DoCoreAI

DoCoreAI is a next-gen open-source AI profiler that optimizes reasoning, creativity, precision and temperature in a single step—cutting token usage by 15-30% and lowering LLM API costs

Language: Python - Size: 1.88 MB - Last synced at: 4 days ago - Pushed at: 7 days ago - Stars: 43 - Forks: 1

multinear/multinear

Develop reliable AI apps

Language: Svelte - Size: 1.12 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 36 - Forks: 1

zhuohaoyu/KIEval

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

Language: Python - Size: 10.6 MB - Last synced at: 23 days ago - Pushed at: 9 months ago - Stars: 36 - Forks: 2

ZeroSumEval/ZeroSumEval

A framework for pitting LLMs against each other in an evolving library of games ⚔

Language: Python - Size: 10.4 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 33 - Forks: 4

adithya-s-k/indic_eval

A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks

Language: Python - Size: 555 KB - Last synced at: 22 days ago - Pushed at: 11 months ago - Stars: 33 - Forks: 7

google/litmus

Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI Application Development. It provides a robust platform with a user-friendly UI for streamlining the process of building and assessing the performance of your LLM-powered applications.

Language: Vue - Size: 303 MB - Last synced at: 4 days ago - Pushed at: 9 days ago - Stars: 31 - Forks: 4

fuxiAIlab/CivAgent

CivAgent is an LLM-based Human-like Agent acting as a Digital Player within the Strategy Game Unciv.

Language: Python - Size: 53.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 31 - Forks: 4

SapienzaNLP/ita-bench

A collection of Italian benchmarks for LLM evaluation

Language: Python - Size: 728 KB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 30 - Forks: 0

ChanLiang/CONNER

The implementation for EMNLP 2023 paper ”Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators“

Language: Python - Size: 15.8 MB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 30 - Forks: 2

kereva-dev/kereva-scanner

Code scanner to check for issues in prompts and LLM calls

Language: Python - Size: 7.12 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 29 - Forks: 2

Yifan-Song793/GoodBadGreedy

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Language: Python - Size: 2.04 MB - Last synced at: 8 days ago - Pushed at: 9 months ago - Stars: 28 - Forks: 1

alan-turing-institute/prompto

An open source library for asynchronous querying of LLM endpoints

Language: Python - Size: 6.7 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 27 - Forks: 0

mts-ai/rurage

Language: Python - Size: 3.85 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 27 - Forks: 0

dannylee1020/openpo

Language: Python - Size: 10.7 MB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 27 - Forks: 0

Praful932/llmsearch

Find better generation parameters for your LLM

Language: Python - Size: 5.04 MB - Last synced at: 23 days ago - Pushed at: 11 months ago - Stars: 27 - Forks: 1

CodeEval-Pro/CodeEval-Pro

Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"

Language: Python - Size: 4.1 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 26 - Forks: 2

LLM-Evaluation-s-Always-Fatiguing/leaf-playground

A framework to build scenario simulation projects where human and LLM based agents can participant in, with a user-friendly web UI to visualize simulation, support automatically evaluation on agent action level.

Language: Python - Size: 868 KB - Last synced at: 19 days ago - Pushed at: 10 months ago - Stars: 24 - Forks: 0

VITA-Group/llm-kick

[ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.

Language: Python - Size: 7.11 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 23 - Forks: 5

AgiFlow/agiflow-sdks

LLM QA, Observability, Evaluation and User Feedback

Language: TypeScript - Size: 2.5 MB - Last synced at: 8 days ago - Pushed at: 9 months ago - Stars: 22 - Forks: 2

Supahands/llm-comparison

This is an opensource project allowing you to compare two LLM's head to head with a given prompt, it has a wide range of supported models, from opensource ollama ones to the likes of openai and claude

Language: TypeScript - Size: 888 KB - Last synced at: 19 days ago - Pushed at: about 1 month ago - Stars: 20 - Forks: 1

kieranklaassen/leva

LLM Evaluation Framework for Rails apps to be used with production data.

Language: HTML - Size: 262 KB - Last synced at: 5 days ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 1

Babelscape/ALERT

Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"

Language: Python - Size: 177 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 18 - Forks: 1

rhesis-ai/rhesis-sdk

Open-source test generation SDK for LLM applications. Access curated test sets. Build context-specific test sets and collaborate with subject matter experts.

Language: Python - Size: 420 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 17 - Forks: 0

Cybonto/OllaDeck

OllaDeck is a purple technology stack for Generative AI (text modality) cybersecurity. It provides a comprehensive set of tools for both blue team and red team operations in the context of text-based generative AI.

Language: Python - Size: 82.9 MB - Last synced at: about 11 hours ago - Pushed at: 7 months ago - Stars: 17 - Forks: 2

equinor/promptly

A prompt collection for testing and evaluation of LLMs.

Language: Jupyter Notebook - Size: 1.74 MB - Last synced at: 5 days ago - Pushed at: 2 months ago - Stars: 16 - Forks: 1

intuit-ai-research/DCR-consistency

DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

Language: Python - Size: 2.07 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 16 - Forks: 1

honeyhiveai/realign

Realign is a testing and simulation framework for AI applications.

Language: Python - Size: 27.3 MB - Last synced at: 29 days ago - Pushed at: 5 months ago - Stars: 15 - Forks: 1

EveripediaNetwork/grokit

Grok Unofficial Python SDK for any X Premium account

Language: Python - Size: 34.2 KB - Last synced at: 9 days ago - Pushed at: 6 months ago - Stars: 15 - Forks: 5

aws-samples/fm-leaderboarder

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

Language: Python - Size: 511 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 14 - Forks: 4

Now-Join-Us/OmniEvalKit Fork of AIDC-AI/M3Bench

The code repository for "OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions"

Language: Python - Size: 3.82 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 13 - Forks: 2

ksm26/Pretraining-LLMs

Master the essential steps of pretraining large language models (LLMs). Learn to create high-quality datasets, configure model architectures, execute training runs, and assess model performance for efficient and effective LLM pretraining.

Language: Jupyter Notebook - Size: 29.3 KB - Last synced at: 26 days ago - Pushed at: 9 months ago - Stars: 13 - Forks: 5

minnesotanlp/cobbler

Code and data for ACL ARR 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

Language: Jupyter Notebook - Size: 3.92 MB - Last synced at: 12 months ago - Pushed at: about 1 year ago - Stars: 13 - Forks: 1

AI4Bharat/Anudesh-Frontend

Language: JavaScript - Size: 166 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 12 - Forks: 7

aigc-apps/PertEval

[NeurIPS '24 Spotlight] PertEval: Unveiling Real Knowledge Capacity of LLMs via Knowledge-invariant Perturbations

Language: Jupyter Notebook - Size: 12.9 MB - Last synced at: 10 days ago - Pushed at: 6 months ago - Stars: 12 - Forks: 2

VILA-Lab/Open-LLM-Leaderboard

Open-LLM-Leaderboard: Open-Style Question Evaluation. Paper at https://arxiv.org/abs/2406.07545

Language: Python - Size: 4.33 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 12 - Forks: 0

AI4Bharat/MILU

MILU (Multi-task Indic Language Understanding Benchmark) is a comprehensive evaluation dataset designed to assess the performance of LLMs across 11 Indic languages.

Language: Python - Size: 1.73 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 11 - Forks: 4

gretelai/navigator-helpers 📦

Navigator Helpers

Language: Python - Size: 9.31 MB - Last synced at: 19 days ago - Pushed at: 6 months ago - Stars: 11 - Forks: 0

armingh2000/FactScoreLite

FactScoreLite is an implementation of the FactScore metric, designed for detailed accuracy assessment in text generation. This package builds upon the framework provided by the original FactScore repository, which is no longer maintained and contains outdated functions.

Language: Python - Size: 674 KB - Last synced at: about 1 month ago - Pushed at: 12 months ago - Stars: 11 - Forks: 1