GitHub topics: evaluation
DocAILab/XRAG
XRAG: eXamining the Core - Benchmarking Foundational Component Modules in Advanced Retrieval-Augmented Generation
Language: Python - Size: 11.8 MB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 100 - Forks: 17

tenemos/langwatch
The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨
Language: TypeScript - Size: 17.9 MB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 1 - Forks: 0

Kakz/prometheus-llm
PrometheusLLM is a unique transformer architecture inspired by dignity and recursion. This project aims to explore new frontiers in AI research and welcomes contributions from the community. 🐙🌟
Language: Python - Size: 257 KB - Last synced at: about 9 hours ago - Pushed at: about 10 hours ago - Stars: 0 - Forks: 0

EXP-Tools/steam-discount
steam 特惠游戏榜单(自动刷新)
Language: Python - Size: 10.6 GB - Last synced at: about 16 hours ago - Pushed at: about 17 hours ago - Stars: 64 - Forks: 33

SatvikPraveen/Optimal-Demo-Selection-ICL
Implements and benchmarks optimal demonstration selection strategies for In-Context Learning (ICL) using LLMs. Covers IDS, RDES, Influence-based Selection, Se², and TopK+ConE across reasoning and classification tasks, analyzing the impact of example relevance, diversity, and ordering on model performance across multiple architectures.
Language: Jupyter Notebook - Size: 1.46 MB - Last synced at: about 18 hours ago - Pushed at: about 19 hours ago - Stars: 1 - Forks: 0

bibanyok89/karate-ui-test
This repository contains automated tests using Karate Test Automation. The tests include scenarios for validating different pages and actions on the NASA Hubble Mission website and image download functionality.
Language: Gherkin - Size: 12.7 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

yHenrique2/june-2025-coding-agent-report
Explore the June 2025 Coding Agent Report for insights on top coding agents, their performance, and implementation examples. Discover the best tools for developers! 🐙💻
Language: TypeScript - Size: 2.54 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

dustalov/evalica
Evalica, your favourite evaluation toolkit
Language: Python - Size: 599 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 38 - Forks: 5

langwatch/langwatch
The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and Prompt Optimization ✨
Language: TypeScript - Size: 28.1 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 2,120 - Forks: 184

langchain-ai/langsmith-sdk
LangSmith Client SDK Implementations
Language: Python - Size: 11.2 MB - Last synced at: about 23 hours ago - Pushed at: about 24 hours ago - Stars: 581 - Forks: 124

CCAFS/MARLO
Managing Agricultural Research for Learning and Outcomes
Language: Java - Size: 242 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 6 - Forks: 8

root-signals/rs-sdk
Root Signals SDK
Language: Python - Size: 1.84 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 12 - Forks: 1

EvolvingLMMs-Lab/lmms-eval
One for All Modalities Evaluation Toolkit - including text, image, video, audio tasks.
Language: Python - Size: 7.88 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2,680 - Forks: 318

langchain-ai/langsmith-docs
Documentation for langsmith
Language: JavaScript - Size: 709 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 131 - Forks: 68

Kiln-AI/Kiln
The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.
Language: Python - Size: 20.3 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 3,838 - Forks: 270

natanim-haile/prompt-optimizer
Prompt Optimizer enhances your prompts for better results in AI applications. Join the community on GitHub and contribute to improving prompt efficiency! 🌟🌐
Language: TypeScript - Size: 3.48 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

Rampati/MIT-6.S191-Lab3
In this lab, you will fine-tune a multi-billion parameter language model to generate specific style responses. You'll work with tokenization strategies, prompt templates, and a complete fine-tuning workflow to enhance LLM outputs. 🐙✨
Language: Jupyter Notebook - Size: 43 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

prometheus-eval/prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
Language: Python - Size: 15.1 MB - Last synced at: 2 days ago - Pushed at: 2 months ago - Stars: 953 - Forks: 55

microsoft/promptbench
A unified evaluation framework for large language models
Language: Python - Size: 5.56 MB - Last synced at: 2 days ago - Pushed at: 29 days ago - Stars: 2,650 - Forks: 203

obss/jury
Comprehensive NLP Evaluation System
Language: Python - Size: 291 KB - Last synced at: 2 days ago - Pushed at: 11 months ago - Stars: 186 - Forks: 19

lmnr-ai/lmnr
Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.
Language: TypeScript - Size: 33.2 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2,098 - Forks: 125

gereon-t/trajectopy
Trajectopy - Trajectory Evaluation in Python
Language: Python - Size: 27.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 82 - Forks: 4

microsoft/genaiops-promptflow-template
GenAIOps with Prompt Flow is a "GenAIOps template and guidance" to help you build LLM-infused apps using Prompt Flow. It offers a range of features including Centralized Code Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B Deployment, reporting for all runs and experiments and so on.
Language: Python - Size: 6.78 MB - Last synced at: 2 days ago - Pushed at: 2 months ago - Stars: 329 - Forks: 264

huggingface/lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Language: Python - Size: 5.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1,655 - Forks: 295

stack-rs/rattan
Rattan is a fast and extensible Internet path emulator framework
Language: Rust - Size: 2 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 6 - Forks: 4

explodinggradients/ragas
Supercharge Your LLM Application Evaluations 🚀
Language: Python - Size: 41 MB - Last synced at: 3 days ago - Pushed at: 5 days ago - Stars: 9,663 - Forks: 956

dnotitia/qllm-infer
A modular framework for evaluating quantization algorithms with reproducible and consistent benchmarks.
Language: Python - Size: 64.1 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2 - Forks: 1

OpenBMB/UltraEval-Audio
An easy-to-use, fast, and easily integrable tool for evaluating audio LLM
Language: Python - Size: 13.9 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 115 - Forks: 3

open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Language: Python - Size: 7.86 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2,588 - Forks: 417

The-Focus-AI/june-2025-coding-agent-report
Comprehensive evaluation of 15 AI coding agents (Cursor, Copilot, Claude, Replit, v0, Warp, etc.) with implementations, screenshots, and professional scoring. Published on Turing Post.
Language: TypeScript - Size: 2.55 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

Marker-Inc-Korea/AutoRAG
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Language: Python - Size: 70.5 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 4,060 - Forks: 320

rdnfn/feedback-forensics
Feedback Forensics: An open-source toolkit to measure AI personality
Language: Python - Size: 16.2 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 20 - Forks: 3

langfuse/langfuse
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
Language: TypeScript - Size: 22.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 13,048 - Forks: 1,189

joevincentgaltie/PropensityScoreMatching
📈 Outils pour réaliser un appariement par score de propension et en représenter les résultats. Méthode d'évaluation des politiques publiques a posteriori.
Language: Jupyter Notebook - Size: 430 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

acconeer/acconeer-python-exploration
Acconeer Exploration Tool
Language: Python - Size: 81.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 192 - Forks: 67

plurai-ai/intellagent
A framework for comprehensive diagnosis and optimization of agents using simulated, realistic synthetic interactions
Language: Python - Size: 14.2 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1,071 - Forks: 133

zouharvi/subset2evaluate
Find informative examples to efficiently (human)-evaluate NLG models.
Language: Python - Size: 6.77 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11 - Forks: 3

Wscats/compile-hero
🔰Visual Studio Code Extension For Compiling Language
Language: TypeScript - Size: 35.7 MB - Last synced at: about 10 hours ago - Pushed at: over 1 year ago - Stars: 263 - Forks: 59

deepsense-ai/ragbits
Building blocks for rapid development of GenAI applications
Language: Python - Size: 10.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1,396 - Forks: 101

swallow-llm/leaderboard
Swallow LLM Leaderboard
Language: HTML - Size: 15.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

Abhisang3/xVerify
xVerify: Efficient Answer Verifier for Large Language Model Evaluations
Language: Python - Size: 806 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 1

time-series-machine-learning/tsml-eval
Evaluation tools for time series machine learning algorithms.
Language: Python - Size: 24.4 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 52 - Forks: 17

InfernoWasTaken2/kendin-denetle
Automated testing tool designed for developers to quickly and efficiently test their own code. Simplifies the process of checking for bugs and errors in software projects.
Size: 1000 Bytes - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

HowieHwong/TrustLLM
[ICML 2024] TrustLLM: Trustworthiness in Large Language Models
Language: Python - Size: 10.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 574 - Forks: 54

Knetic/govaluate 📦
Arbitrary expression evaluation for golang
Language: Go - Size: 292 KB - Last synced at: 3 days ago - Pushed at: 3 months ago - Stars: 3,883 - Forks: 513

promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Language: TypeScript - Size: 223 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 7,305 - Forks: 584

RichardObi/frd-score
Official implementation of the Fréchet Radiomic Distance
Language: Python - Size: 2.13 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 5 - Forks: 1

stack-rs/mitosis
Mitosis: A Unified Transport Evaluation Framework
Language: Rust - Size: 1.62 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3 - Forks: 0

Ryota-Kawamura/LangChain-for-LLM-Application-Development
In LangChain for LLM Application Development, you will gain essential skills in expanding the use cases and capabilities of language models in application development using the LangChain framework.
Language: Jupyter Notebook - Size: 274 KB - Last synced at: about 10 hours ago - Pushed at: about 2 years ago - Stars: 200 - Forks: 152

UnbrokenCocoon/BERTopic_Stability
This project is a practical, beginner-friendly guide for users to learn about making stable and empirically justifiable BERTopics
Language: Jupyter Notebook - Size: 8.04 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

1517005260/graph-rag-agent
拼好RAG:手搓并融合了GraphRAG、LightRAG、Neo4j-llm-graph-builder进行知识图谱构建以及搜索;整合DeepSearch技术实现私域RAG的推理;自制针对GraphRAG的评估框架| Integrate GraphRAG, LightRAG, and Neo4j-llm-graph-builder for knowledge graph construction and search. Combine DeepSearch for private RAG reasoning. Create a custom evaluation framework for GraphRAG.
Language: Python - Size: 67.4 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 692 - Forks: 96

CUC-ZIHANG-LIU/AudioEvaluation
多种音频评估方法复现
Language: Python - Size: 135 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

rungalileo/agent-leaderboard
Ranking LLMs on agentic tasks
Language: Jupyter Notebook - Size: 11 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 144 - Forks: 13

Schuch666/eva3dm
A package to evaluate 3d weather and air quality models
Language: R - Size: 10.7 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 5 - Forks: 0

ItzAntonis2012/evalpatch
Evaluation Patch By ItzAntonis2012 | Check evalpatch wiki regurarly for upcoming releases
Language: Batchfile - Size: 1.11 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 11 - Forks: 0

sebastiansauer/hans
Evaluation of the Matomo data in the context of the BMBF project "HaNS"
Language: HTML - Size: 57.3 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

Cloud-CV/EvalAI
:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI
Language: Python - Size: 63.6 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,870 - Forks: 868

Helicone/helicone
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
Language: TypeScript - Size: 500 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3,995 - Forks: 396

bltlab/seqscore
SeqScore: Scoring for named entity recognition and other sequence labeling tasks
Language: Python - Size: 295 KB - Last synced at: 2 days ago - Pushed at: 4 months ago - Stars: 23 - Forks: 5

IBM/unitxt
🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking
Language: Python - Size: 96.4 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 199 - Forks: 58

open-compass/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Language: Python - Size: 6.2 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 5,556 - Forks: 604

carpentries-incubator/machine-learning-novice-python
Introduction to Machine Learning with Python
Language: Python - Size: 7.05 MB - Last synced at: 1 day ago - Pushed at: about 2 months ago - Stars: 5 - Forks: 15

oumi-ai/oumi
Easily fine-tune, evaluate and deploy Qwen3, DeepSeek-R1, Llama 4 or any open source LLM / VLM!
Language: Python - Size: 9.59 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 8,200 - Forks: 613

etiams/optiscope
A Lévy-optimal lambda calculus reducer with a backdoor to C
Language: C - Size: 13.1 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 15 - Forks: 0

Xnhyacinth/Awesome-LLM-Long-Context-Modeling
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Size: 1.59 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,539 - Forks: 59

teilomillet/kushim
eval creator
Language: Python - Size: 505 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2 - Forks: 0

modelscope/evalscope
A streamlined and customizable framework for efficient large model evaluation and performance benchmarking
Language: Python - Size: 59.9 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,192 - Forks: 131

ModelTC/llmc
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
Language: Python - Size: 29.9 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 490 - Forks: 56

fiddler-labs/fiddler-auditor
Fiddler Auditor is a tool to evaluate language models.
Language: Python - Size: 1.73 MB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 183 - Forks: 21

legml-ai/les-audits-affaires-eval-harness
CLI Python ultra-légère pour bench-tester les LLMs français en droit des affaires — action, délai, documents, coût, risques.
Language: Python - Size: 95.7 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

archersleeping72/CryptoFormalEval
We introduce a benchmark for testing how well LLMs can find vulnerabilities in cryptographic protocols. By combining LLMs with symbolic reasoning tools like Tamarin, we aim to improve the efficiency and thoroughness of protocol analysis, paving the way for future AI-powered cybersecurity defenses.
Size: 2.93 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

k4black/codebleu
Pip compatible CodeBLEU metric implementation available for linux/macos/win
Language: Python - Size: 1.27 MB - Last synced at: 3 days ago - Pushed at: 3 months ago - Stars: 95 - Forks: 20

yassinelahdiy/page-language-model
Open-source framework for defining Page Language Models (PLMs) for intelligent app understanding and AI-assisted testing.
Language: Python - Size: 26.4 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

sepandhaghighi/pycm
Multi-class confusion matrix library in Python
Language: Python - Size: 11.5 MB - Last synced at: 2 days ago - Pushed at: 17 days ago - Stars: 1,478 - Forks: 125

JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval.
Language: Python - Size: 9.37 MB - Last synced at: 3 days ago - Pushed at: 8 months ago - Stars: 243 - Forks: 41

huggingface/evaluation-guidebook
Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
Language: Jupyter Notebook - Size: 1.01 MB - Last synced at: 7 days ago - Pushed at: 6 months ago - Stars: 1,435 - Forks: 85

tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
Language: Jupyter Notebook - Size: 302 MB - Last synced at: 6 days ago - Pushed at: 6 months ago - Stars: 1,774 - Forks: 279

uptrain-ai/uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
Language: Python - Size: 36.9 MB - Last synced at: 8 days ago - Pushed at: 10 months ago - Stars: 2,282 - Forks: 199

onejune2018/Awesome-LLM-Eval
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
Size: 12.6 MB - Last synced at: 5 days ago - Pushed at: 8 months ago - Stars: 542 - Forks: 45

AmenRa/ranx
⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
Language: Python - Size: 34.6 MB - Last synced at: 6 days ago - Pushed at: 15 days ago - Stars: 565 - Forks: 28

jianzfb/antgo
Machine Learning Experiment Manage Platform
Language: Python - Size: 17.1 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 314 - Forks: 8

X-PLUG/CValues
面向中文大模型价值观的评估与对齐研究
Language: Python - Size: 4.2 MB - Last synced at: 3 days ago - Pushed at: almost 2 years ago - Stars: 524 - Forks: 20

EthicalML/xai
XAI - An eXplainability toolbox for machine learning
Language: Python - Size: 17.8 MB - Last synced at: 6 days ago - Pushed at: over 3 years ago - Stars: 1,179 - Forks: 179

FairRecKit/FairRecKitApp Fork of TheMinefreak23/fair-rec-kit-app
Web application for recommender system analysis. Designed to use the FairRecKitLib
Language: Vue - Size: 244 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

FairRecKit/FairRecKitLib Fork of TheMinefreak23/fairreckitlib
Library package designed for the FairRecKitApp
Language: Python - Size: 15.3 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

arthur-ai/arthur-engine
Make AI work for Everyone - Monitoring and governing for your AI/ML
Language: Python - Size: 20.6 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 39 - Forks: 4

lunary-ai/lunary
The production toolkit for LLMs. Observability, prompt management and evaluations.
Language: TypeScript - Size: 6.51 MB - Last synced at: 7 days ago - Pushed at: 10 days ago - Stars: 1,344 - Forks: 159

huggingface/evaluate
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
Language: Python - Size: 2.03 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 2,241 - Forks: 281

NOAA-OWP/inundation-mapping
Flood inundation mapping and evaluation software configured to work with U.S. National Water Model.
Language: Python - Size: 28.2 MB - Last synced at: 5 days ago - Pushed at: 8 days ago - Stars: 113 - Forks: 33

corentin-ryr/MultiMedEval
A Python tool to evaluate the performance of VLM on the medical domain.
Language: Python - Size: 10.9 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 70 - Forks: 4

SAILResearch/awesome-foundation-model-leaderboards
A curated list of awesome leaderboard-oriented resources for foundation models
Size: 1.01 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 271 - Forks: 36

nsourlos/LLM_evaluation_framework
Evaluate performance of LLM models for Q&A in any domain
Language: Jupyter Notebook - Size: 39.7 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

The-Strategy-Unit/1199_core20plus5
Quantitative analysis for the Core20PLUS5 evaluation. Initialy will be an exploratory analysis of metrics of hypertension from the CVDPREVENT audit site informing whether a subsequent impact analysis is undertaken.
Language: R - Size: 5.18 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 0

kolenaIO/kolena
Python client for Kolena's machine learning testing platform
Language: Python - Size: 75.4 MB - Last synced at: 7 days ago - Pushed at: 10 days ago - Stars: 47 - Forks: 5

openheartmind/cOHMunity
We're building an open scientific system for estimating the relative value of community contributions through participatory feedback, enabling automated rewards for value created and coordination between aligned communities.
Size: 1.11 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 11 - Forks: 0

narutatsuri/Unbiased-Perspective-Summarization
[ACL 2025] Reranking-based Generation for Unbiased Perspective Summarization
Language: Python - Size: 49.8 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

jamesmcroft/document-data-extraction-prompt-flow-evaluation
This sample demonstrates how to use GPT-4o with Vision to extract structured JSON data from PDF documents and evaluate them with Azure AI Studio and Prompt Flow
Language: Bicep - Size: 1.17 MB - Last synced at: 3 days ago - Pushed at: 10 months ago - Stars: 4 - Forks: 3

rungalileo/sdk-examples
Examples on how to get started with the Galileo SDKs for AI Evaluation and Observability (both in Python and Typescript)
Language: Python - Size: 3.75 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 7 - Forks: 1

r-lib/evaluate
A version of eval for R that returns more information about what happened
Language: R - Size: 2.77 MB - Last synced at: 5 days ago - Pushed at: 11 days ago - Stars: 138 - Forks: 36

microsoft/genaiops-azureaisdk-template
Implement GenAIOps using Azure AI Foundry with ease and jumpstart
Language: Python - Size: 458 KB - Last synced at: 2 days ago - Pushed at: 2 months ago - Stars: 23 - Forks: 30
