GitHub topics: evaluation

Repositories

DocAILab/XRAG

XRAG: eXamining the Core - Benchmarking Foundational Component Modules in Advanced Retrieval-Augmented Generation

Language: Python - Size: 11.8 MB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 100 - Forks: 17

tenemos/langwatch

The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨

Language: TypeScript - Size: 17.9 MB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 1 - Forks: 0

Kakz/prometheus-llm

PrometheusLLM is a unique transformer architecture inspired by dignity and recursion. This project aims to explore new frontiers in AI research and welcomes contributions from the community. 🐙🌟

Language: Python - Size: 257 KB - Last synced at: about 9 hours ago - Pushed at: about 10 hours ago - Stars: 0 - Forks: 0

EXP-Tools/steam-discount

steam 特惠游戏榜单（自动刷新）

Language: Python - Size: 10.6 GB - Last synced at: about 16 hours ago - Pushed at: about 17 hours ago - Stars: 64 - Forks: 33

SatvikPraveen/Optimal-Demo-Selection-ICL

Implements and benchmarks optimal demonstration selection strategies for In-Context Learning (ICL) using LLMs. Covers IDS, RDES, Influence-based Selection, Se², and TopK+ConE across reasoning and classification tasks, analyzing the impact of example relevance, diversity, and ordering on model performance across multiple architectures.

Language: Jupyter Notebook - Size: 1.46 MB - Last synced at: about 18 hours ago - Pushed at: about 19 hours ago - Stars: 1 - Forks: 0

bibanyok89/karate-ui-test

This repository contains automated tests using Karate Test Automation. The tests include scenarios for validating different pages and actions on the NASA Hubble Mission website and image download functionality.

Language: Gherkin - Size: 12.7 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

yHenrique2/june-2025-coding-agent-report

Explore the June 2025 Coding Agent Report for insights on top coding agents, their performance, and implementation examples. Discover the best tools for developers! 🐙💻

Language: TypeScript - Size: 2.54 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

dustalov/evalica

Evalica, your favourite evaluation toolkit

Language: Python - Size: 599 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 38 - Forks: 5

langwatch/langwatch

The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨

Language: TypeScript - Size: 28.1 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 2,120 - Forks: 184

langchain-ai/langsmith-sdk

LangSmith Client SDK Implementations

Language: Python - Size: 11.2 MB - Last synced at: about 23 hours ago - Pushed at: about 24 hours ago - Stars: 581 - Forks: 124

CCAFS/MARLO

Managing Agricultural Research for Learning and Outcomes

Language: Java - Size: 242 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 6 - Forks: 8

root-signals/rs-sdk

Root Signals SDK

Language: Python - Size: 1.84 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 12 - Forks: 1

EvolvingLMMs-Lab/lmms-eval

One for All Modalities Evaluation Toolkit - including text, image, video, audio tasks.

Language: Python - Size: 7.88 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2,680 - Forks: 318

langchain-ai/langsmith-docs

Documentation for langsmith

Language: JavaScript - Size: 709 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 131 - Forks: 68

Kiln-AI/Kiln

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

Language: Python - Size: 20.3 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 3,838 - Forks: 270

natanim-haile/prompt-optimizer

Prompt Optimizer enhances your prompts for better results in AI applications. Join the community on GitHub and contribute to improving prompt efficiency! 🌟🌐

Language: TypeScript - Size: 3.48 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

Rampati/MIT-6.S191-Lab3

In this lab, you will fine-tune a multi-billion parameter language model to generate specific style responses. You'll work with tokenization strategies, prompt templates, and a complete fine-tuning workflow to enhance LLM outputs. 🐙✨

Language: Jupyter Notebook - Size: 43 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

prometheus-eval/prometheus-eval

Evaluate your LLM's response with Prometheus and GPT4 💯

Language: Python - Size: 15.1 MB - Last synced at: 2 days ago - Pushed at: 2 months ago - Stars: 953 - Forks: 55

microsoft/promptbench

A unified evaluation framework for large language models

Language: Python - Size: 5.56 MB - Last synced at: 2 days ago - Pushed at: 29 days ago - Stars: 2,650 - Forks: 203

obss/jury

Comprehensive NLP Evaluation System

Language: Python - Size: 291 KB - Last synced at: 2 days ago - Pushed at: 11 months ago - Stars: 186 - Forks: 19

lmnr-ai/lmnr

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

Language: TypeScript - Size: 33.2 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2,098 - Forks: 125

gereon-t/trajectopy

Trajectopy - Trajectory Evaluation in Python

Language: Python - Size: 27.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 82 - Forks: 4

microsoft/genaiops-promptflow-template

GenAIOps with Prompt Flow is a "GenAIOps template and guidance" to help you build LLM-infused apps using Prompt Flow. It offers a range of features including Centralized Code Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B Deployment, reporting for all runs and experiments and so on.

Language: Python - Size: 6.78 MB - Last synced at: 2 days ago - Pushed at: 2 months ago - Stars: 329 - Forks: 264

huggingface/lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Language: Python - Size: 5.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1,655 - Forks: 295

stack-rs/rattan

Rattan is a fast and extensible Internet path emulator framework

Language: Rust - Size: 2 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 6 - Forks: 4

explodinggradients/ragas

Supercharge Your LLM Application Evaluations 🚀

Language: Python - Size: 41 MB - Last synced at: 3 days ago - Pushed at: 5 days ago - Stars: 9,663 - Forks: 956

dnotitia/qllm-infer

A modular framework for evaluating quantization algorithms with reproducible and consistent benchmarks.

Language: Python - Size: 64.1 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2 - Forks: 1

OpenBMB/UltraEval-Audio

An easy-to-use, fast, and easily integrable tool for evaluating audio LLM

Language: Python - Size: 13.9 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 115 - Forks: 3

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Language: Python - Size: 7.86 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2,588 - Forks: 417

The-Focus-AI/june-2025-coding-agent-report

Comprehensive evaluation of 15 AI coding agents (Cursor, Copilot, Claude, Replit, v0, Warp, etc.) with implementations, screenshots, and professional scoring. Published on Turing Post.

Language: TypeScript - Size: 2.55 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

Marker-Inc-Korea/AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

Language: Python - Size: 70.5 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 4,060 - Forks: 320

rdnfn/feedback-forensics

Feedback Forensics: An open-source toolkit to measure AI personality

Language: Python - Size: 16.2 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 20 - Forks: 3

langfuse/langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

Language: TypeScript - Size: 22.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 13,048 - Forks: 1,189

joevincentgaltie/PropensityScoreMatching

📈 Outils pour réaliser un appariement par score de propension et en représenter les résultats. Méthode d'évaluation des politiques publiques a posteriori.

Language: Jupyter Notebook - Size: 430 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

acconeer/acconeer-python-exploration

Acconeer Exploration Tool

Language: Python - Size: 81.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 192 - Forks: 67

plurai-ai/intellagent

A framework for comprehensive diagnosis and optimization of agents using simulated, realistic synthetic interactions

Language: Python - Size: 14.2 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1,071 - Forks: 133

zouharvi/subset2evaluate

Find informative examples to efficiently (human)-evaluate NLG models.

Language: Python - Size: 6.77 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11 - Forks: 3

Wscats/compile-hero

🔰Visual Studio Code Extension For Compiling Language

Language: TypeScript - Size: 35.7 MB - Last synced at: about 10 hours ago - Pushed at: over 1 year ago - Stars: 263 - Forks: 59

deepsense-ai/ragbits

Building blocks for rapid development of GenAI applications

Language: Python - Size: 10.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1,396 - Forks: 101

swallow-llm/leaderboard

Swallow LLM Leaderboard

Language: HTML - Size: 15.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

Abhisang3/xVerify

xVerify: Efficient Answer Verifier for Large Language Model Evaluations

Language: Python - Size: 806 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 1

time-series-machine-learning/tsml-eval

Evaluation tools for time series machine learning algorithms.

Language: Python - Size: 24.4 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 52 - Forks: 17

InfernoWasTaken2/kendin-denetle

Automated testing tool designed for developers to quickly and efficiently test their own code. Simplifies the process of checking for bugs and errors in software projects.

Size: 1000 Bytes - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

HowieHwong/TrustLLM

[ICML 2024] TrustLLM: Trustworthiness in Large Language Models

Language: Python - Size: 10.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 574 - Forks: 54

Knetic/govaluate 📦

Arbitrary expression evaluation for golang

Language: Go - Size: 292 KB - Last synced at: 3 days ago - Pushed at: 3 months ago - Stars: 3,883 - Forks: 513

promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Language: TypeScript - Size: 223 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 7,305 - Forks: 584

RichardObi/frd-score

Official implementation of the Fréchet Radiomic Distance

Language: Python - Size: 2.13 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 5 - Forks: 1

stack-rs/mitosis

Mitosis: A Unified Transport Evaluation Framework

Language: Rust - Size: 1.62 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3 - Forks: 0

Ryota-Kawamura/LangChain-for-LLM-Application-Development

In LangChain for LLM Application Development, you will gain essential skills in expanding the use cases and capabilities of language models in application development using the LangChain framework.

Language: Jupyter Notebook - Size: 274 KB - Last synced at: about 10 hours ago - Pushed at: about 2 years ago - Stars: 200 - Forks: 152

UnbrokenCocoon/BERTopic_Stability

This project is a practical, beginner-friendly guide for users to learn about making stable and empirically justifiable BERTopics

Language: Jupyter Notebook - Size: 8.04 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

1517005260/graph-rag-agent

拼好RAG：手搓并融合了GraphRAG、LightRAG、Neo4j-llm-graph-builder进行知识图谱构建以及搜索；整合DeepSearch技术实现私域RAG的推理；自制针对GraphRAG的评估框架| Integrate GraphRAG, LightRAG, and Neo4j-llm-graph-builder for knowledge graph construction and search. Combine DeepSearch for private RAG reasoning. Create a custom evaluation framework for GraphRAG.

Language: Python - Size: 67.4 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 692 - Forks: 96

CUC-ZIHANG-LIU/AudioEvaluation

多种音频评估方法复现

Language: Python - Size: 135 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

rungalileo/agent-leaderboard

Ranking LLMs on agentic tasks

Language: Jupyter Notebook - Size: 11 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 144 - Forks: 13

Schuch666/eva3dm

A package to evaluate 3d weather and air quality models

Language: R - Size: 10.7 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 5 - Forks: 0

ItzAntonis2012/evalpatch

Evaluation Patch By ItzAntonis2012 | Check evalpatch wiki regurarly for upcoming releases

Language: Batchfile - Size: 1.11 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 11 - Forks: 0

sebastiansauer/hans

Evaluation of the Matomo data in the context of the BMBF project "HaNS"

Language: HTML - Size: 57.3 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

Cloud-CV/EvalAI

:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI

Language: Python - Size: 63.6 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,870 - Forks: 868

Helicone/helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

Language: TypeScript - Size: 500 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3,995 - Forks: 396

bltlab/seqscore

SeqScore: Scoring for named entity recognition and other sequence labeling tasks

Language: Python - Size: 295 KB - Last synced at: 2 days ago - Pushed at: 4 months ago - Stars: 23 - Forks: 5

IBM/unitxt

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking

Language: Python - Size: 96.4 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 199 - Forks: 58

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Language: Python - Size: 6.2 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 5,556 - Forks: 604

carpentries-incubator/machine-learning-novice-python

Introduction to Machine Learning with Python

Language: Python - Size: 7.05 MB - Last synced at: 1 day ago - Pushed at: about 2 months ago - Stars: 5 - Forks: 15

oumi-ai/oumi

Easily fine-tune, evaluate and deploy Qwen3, DeepSeek-R1, Llama 4 or any open source LLM / VLM!

Language: Python - Size: 9.59 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 8,200 - Forks: 613

etiams/optiscope

A Lévy-optimal lambda calculus reducer with a backdoor to C

Language: C - Size: 13.1 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 15 - Forks: 0

Xnhyacinth/Awesome-LLM-Long-Context-Modeling

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

Size: 1.59 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,539 - Forks: 59

teilomillet/kushim

eval creator

Language: Python - Size: 505 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2 - Forks: 0

modelscope/evalscope

A streamlined and customizable framework for efficient large model evaluation and performance benchmarking

Language: Python - Size: 59.9 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,192 - Forks: 131

ModelTC/llmc

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

Language: Python - Size: 29.9 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 490 - Forks: 56

fiddler-labs/fiddler-auditor

Fiddler Auditor is a tool to evaluate language models.

Language: Python - Size: 1.73 MB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 183 - Forks: 21

legml-ai/les-audits-affaires-eval-harness

CLI Python ultra-légère pour bench-tester les LLMs français en droit des affaires — action, délai, documents, coût, risques.

Language: Python - Size: 95.7 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

archersleeping72/CryptoFormalEval

We introduce a benchmark for testing how well LLMs can find vulnerabilities in cryptographic protocols. By combining LLMs with symbolic reasoning tools like Tamarin, we aim to improve the efficiency and thoroughness of protocol analysis, paving the way for future AI-powered cybersecurity defenses.

Size: 2.93 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

k4black/codebleu

Pip compatible CodeBLEU metric implementation available for linux/macos/win

Language: Python - Size: 1.27 MB - Last synced at: 3 days ago - Pushed at: 3 months ago - Stars: 95 - Forks: 20

yassinelahdiy/page-language-model

Open-source framework for defining Page Language Models (PLMs) for intelligent app understanding and AI-assisted testing.

Language: Python - Size: 26.4 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

sepandhaghighi/pycm

Multi-class confusion matrix library in Python

Language: Python - Size: 11.5 MB - Last synced at: 2 days ago - Pushed at: 17 days ago - Stars: 1,478 - Forks: 125

JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

Language: Python - Size: 9.37 MB - Last synced at: 3 days ago - Pushed at: 8 months ago - Stars: 243 - Forks: 41

huggingface/evaluation-guidebook

Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!

Language: Jupyter Notebook - Size: 1.01 MB - Last synced at: 7 days ago - Pushed at: 6 months ago - Stars: 1,435 - Forks: 85

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

Language: Jupyter Notebook - Size: 302 MB - Last synced at: 6 days ago - Pushed at: 6 months ago - Stars: 1,774 - Forks: 279

uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Language: Python - Size: 36.9 MB - Last synced at: 8 days ago - Pushed at: 10 months ago - Stars: 2,282 - Forks: 199