An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: evaluation-framework

SJTUHaiyangYu/BackdoorMBTI

BackdoorMBTI is an open source project expanding the unimodal backdoor learning to a multimodal context. We hope that BackdoorMBTI can facilitate the analysis and development of backdoor defense methods within a multimodal context.

Language: Python - Size: 5.55 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 20 - Forks: 1

EleutherAI/lm-evaluation-harness

A framework for few-shot evaluation of language models.

Language: Python - Size: 30.1 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 9,385 - Forks: 2,492

kolenaIO/kolena

Python client for Kolena's machine learning testing platform

Language: Python - Size: 75.4 MB - Last synced at: about 12 hours ago - Pushed at: 6 days ago - Stars: 46 - Forks: 5

microsoft/eureka-ml-insights

A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.

Language: Python - Size: 20.3 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 159 - Forks: 28

huggingface/lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Language: Python - Size: 5.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1,655 - Forks: 295

kaiko-ai/eva

Evaluation framework for oncology foundation models (FMs)

Language: Python - Size: 15.3 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 106 - Forks: 15

confident-ai/deepeval

The LLM Evaluation Framework

Language: Python - Size: 85 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 8,451 - Forks: 733

sean-zw/SynthECG

This repository hosts advanced models for generating ECG signals using deep learning techniques. Contributions are welcome, so feel free to fork and submit your improvements! 🐙💻

Language: Python - Size: 11.7 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

EuroEval/EuroEval

The robust European language model benchmark.

Language: Python - Size: 89.3 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 106 - Forks: 26

promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Language: TypeScript - Size: 223 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 7,305 - Forks: 584

stack-rs/mitosis

Mitosis: A Unified Transport Evaluation Framework

Language: Rust - Size: 1.62 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 3 - Forks: 0

aidos-lab/rings

Relevant Information in Node features and Graph Structure

Size: 99.6 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

X-PLUG/WritingBench

WritingBench: A Comprehensive Benchmark for Generative Writing

Language: Python - Size: 33.8 MB - Last synced at: 4 days ago - Pushed at: 13 days ago - Stars: 89 - Forks: 10

pyrddlgym-project/pyRDDLGym

A toolkit for auto-generation of OpenAI Gym environments from RDDL description files.

Language: Python - Size: 66.7 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 83 - Forks: 22

lapix-ufsc/lapixdl

Python package with Deep Learning utilities for Computer Vision

Language: Python - Size: 324 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 9 - Forks: 3

ServiceNow/AgentLab

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

Language: Python - Size: 3.19 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 349 - Forks: 71

teilomillet/kushim

eval creator

Language: Python - Size: 505 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2 - Forks: 0

pedrodevog/SynthECG

The first systematic evaluation framework for synthetic 10-second 12-lead ECGs from diagnostic class-conditioned generative models

Language: Python - Size: 160 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

Language: Python - Size: 9.37 MB - Last synced at: 4 days ago - Pushed at: 8 months ago - Stars: 243 - Forks: 41

empirical-run/empirical

Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application

Language: TypeScript - Size: 1.58 MB - Last synced at: 4 days ago - Pushed at: 10 months ago - Stars: 158 - Forks: 12

eduardogr/evalytics

HR tool to orchestrate the Performance Review Cycle of the employees of a company.

Language: Python - Size: 803 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 11 - Forks: 2

alibaba-damo-academy/MedEvalKit

MedEvalKit: A Unified Medical Evaluation Framework

Language: Python - Size: 1.74 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 40 - Forks: 2

zli12321/long_form_rl

grope to train long form QA and instructions with long-form reward model

Language: Python - Size: 22.8 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 3 - Forks: 0

lstrgar/SEG

Evaluation of label free cell and nuclei segmentation

Language: Jupyter Notebook - Size: 19.1 MB - Last synced at: 2 days ago - Pushed at: about 1 year ago - Stars: 8 - Forks: 3

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

Language: Python - Size: 3.13 MB - Last synced at: 6 days ago - Pushed at: 10 months ago - Stars: 124 - Forks: 18

naomibaes/LSCD_method_evaluation

Companion repository with scripts for applying LSC-Eval, a 3-stage evaluation framework to: (1) create theory-driven LLM-generated synthetic suites for LSC dimensions, (2) program experimental settings for comparative method evaluation on a synthetic change detection task, (3) choose the most suitable method for the dimension and domain of interest

Language: Jupyter Notebook - Size: 582 MB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

naomibaes/Synthetic-LSC_pipeline

Synthetic datasets to evaluate key dimensions of LSC (Sentiment, Intensity, Breadth), generated using LLMs and WordNet from the LSC-Eval framework.

Language: Jupyter Notebook - Size: 31.5 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

maximhq/maxim-cookbooks

Maxim is an end-to-end AI evaluation and observability platform that empowers modern AI teams to ship agents with quality, reliability, and speed.

Language: Jupyter Notebook - Size: 123 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 8 - Forks: 6

SYED-M-HUSSAIN/RECAP

RECAP (Review Engine for Critiquing and Advising Pitches) is an LLM-powered agentic system designed to help founders and entrepreneurs receive actionable, multi-perspective, and structured feedback on their startup pitch presentations

Language: Python - Size: 43 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

wassname/open_pref_eval

Hackable, simple, llm evals on preference datasets

Language: Python - Size: 15.8 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 2 - Forks: 0

HPAI-BSC/TuRTLe

A Unified Evaluation of LLMs for RTL Generation

Language: Python - Size: 409 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 15 - Forks: 1

NOAA-OWP/gval

A high-level Python framework to evaluate the skill of geospatial datasets by comparing candidates to benchmark maps producing agreement maps and metrics.

Language: Jupyter Notebook - Size: 29.3 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 25 - Forks: 2

IAAR-Shanghai/GuessArena

[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

Language: Python - Size: 606 KB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 5 - Forks: 0

logic-star-ai/swt-bench

[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating LLM repository-level test-generation

Language: Python - Size: 4.37 MB - Last synced at: 5 days ago - Pushed at: 24 days ago - Stars: 50 - Forks: 6

vcerqueira/modelradar

Aspect-based Forecasting Accuracy

Language: Jupyter Notebook - Size: 9.55 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 4 - Forks: 2

yukinagae/genkitx-promptfoo

Community Plugin for Genkit to use Promptfoo

Language: TypeScript - Size: 553 KB - Last synced at: 23 days ago - Pushed at: 6 months ago - Stars: 4 - Forks: 0

TonicAI/tonic_validate

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

Language: Python - Size: 6.2 MB - Last synced at: 22 days ago - Pushed at: about 1 month ago - Stars: 302 - Forks: 30

aiverify-foundation/moonshot

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

Language: Python - Size: 225 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 243 - Forks: 52

rosinality/halite

Acceleration framework for Human Alignment Learning

Language: Python - Size: 527 KB - Last synced at: 26 days ago - Pushed at: 27 days ago - Stars: 6 - Forks: 1

GoogleCloudPlatform/evalbench

EvalBench is a flexible framework designed to measure the quality of generative AI (GenAI) workflows around database specific tasks.

Language: Python - Size: 1.41 MB - Last synced at: 8 days ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 3

relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

Language: Python - Size: 1.92 MB - Last synced at: 15 days ago - Pushed at: 5 months ago - Stars: 497 - Forks: 35

alok-abhishek/BEATS_Dataset

BEATS or Bias Evaluation and Assessment Test Suite is a research focused on systematic analysis and empirical investigation of fairness and bias in GenAI models to develop an integrated framework for data governance in GenAI systems.

Language: Jupyter Notebook - Size: 5.03 MB - Last synced at: 27 days ago - Pushed at: 28 days ago - Stars: 1 - Forks: 0

Q-Aware-Labs/model-arena

Simple LLM Response Evaluation tool

Language: JavaScript - Size: 623 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 0 - Forks: 0

fmp453/erase-eval

Erasing with Precision: Evaluating Specific Concept Erasure from Text-to-Image Generative Models

Language: Python - Size: 1.18 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 3 - Forks: 0

VLMHyperBenchTeam/VLMHyperBench

VLMHyperBench – open source фреймворк для оценки возможностей Vision language models (VLM) распознавать документы на русском языке с целью оценки их потенциала для автоматизации документооборота.

Language: Python - Size: 3.56 MB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 60 - Forks: 0

LIAAD/tieval

An Evaluation Framework for Temporal Information Extraction Systems

Language: Python - Size: 1.05 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 19 - Forks: 1

tohtsky/irspack

Train, evaluate, and optimize implicit feedback-based recommender systems.

Language: Python - Size: 733 KB - Last synced at: 26 days ago - Pushed at: about 2 years ago - Stars: 31 - Forks: 9

diningphil/PyDGN

A research library for automating experiments on Deep Graph Networks

Language: Python - Size: 10.7 MB - Last synced at: 4 days ago - Pushed at: 10 months ago - Stars: 223 - Forks: 13

kse-ElEvEn/MATEval

MATEval is the first multi-agent framework simulating human collaborative discussion for open-ended text evaluation.

Language: Python - Size: 9.57 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 27 - Forks: 2

cedrickchee/vibe-jet

A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"

Language: HTML - Size: 8.65 MB - Last synced at: 26 days ago - Pushed at: 3 months ago - Stars: 51 - Forks: 9

HKUSTDial/NL2SQL360

🔥[VLDB'24] Official repository for the paper “The Dawn of Natural Language to SQL: Are We Fully Ready?”

Language: Python - Size: 8.61 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 118 - Forks: 10

laurawpaaby/EduChatEval

A structured pipeline and Python package for deploying and evaluating interactive LLM tutor systems in educational settings.

Language: Jupyter Notebook - Size: 5.44 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

BMW-InnovationLab/SORDI-AI-Evaluation-GUI

This repository allows you to evaluate a trained computer vision model and get general information and evaluation metrics with little configuration.

Language: Python - Size: 41.5 MB - Last synced at: 22 days ago - Pushed at: over 1 year ago - Stars: 76 - Forks: 5

act3-ace/CoRL

The Core Reinforcement Learning library is intended to enable scalable deep reinforcement learning experimentation in a manner extensible to new simulations and new ways for the learning agents to interact with them. The hope is that this makes RL research easier by removing lock-in to particular simulations.The work is released under the follow APRS approval. Initial release of CoRL - Part #1 -Approved on 2022-05-2024 12:08:51 - PA Approval # [AFRL-2022-2455]" Documentation https://act3-ace.github.io/CoRL/

Language: Python - Size: 3.41 MB - Last synced at: 11 days ago - Pushed at: about 1 year ago - Stars: 36 - Forks: 5

AstraBert/diRAGnosis

Diagnose the performance of your RAG🩺

Language: Python - Size: 214 KB - Last synced at: 4 days ago - Pushed at: 3 months ago - Stars: 36 - Forks: 3

nhsengland/evalsense

Tools for systematic large language model evaluations

Language: Python - Size: 877 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

athina-ai/athina-evals

Python SDK for running evaluations on LLM generated responses

Language: Python - Size: 1.82 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 280 - Forks: 19

bijington/expressive

Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.

Language: C# - Size: 3.74 MB - Last synced at: about 2 hours ago - Pushed at: 9 months ago - Stars: 171 - Forks: 27

rookie-littleblack/XpertEval

XpertEval: All-in-One Evaluation Framework for Multimodal Large Models

Language: Python - Size: 237 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

codefuse-ai/codefuse-evaluation

Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中

Language: Python - Size: 35.1 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 96 - Forks: 14

xmed-lab/UniEval

UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation

Language: Python - Size: 26 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 1

symflower/eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

Language: Go - Size: 18.8 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 171 - Forks: 8

powerflows/powerflows-dmn

Power Flows DMN - Powerful decisions and rules engine

Language: Java - Size: 545 KB - Last synced at: about 1 month ago - Pushed at: 9 months ago - Stars: 52 - Forks: 6

cowjen01/repsys

Framework for Interactive Evaluation of Recommender Systems

Language: JavaScript - Size: 12.5 MB - Last synced at: 8 days ago - Pushed at: almost 2 years ago - Stars: 35 - Forks: 5

vectara/mirage-bench

Repository for Multililngual Generation, RAG evaluations, and surrogate judge training for Arena RAG leaderboard (NAACL'25)

Language: Python - Size: 2.8 MB - Last synced at: 11 days ago - Pushed at: 3 months ago - Stars: 9 - Forks: 0

eth-lre/mathtutorbench

Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors

Language: Python - Size: 5.02 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 11 - Forks: 1

davidheineman/thresh

🌾 Universal, customizable and deployable fine-grained evaluation for text generation.

Language: Vue - Size: 87.5 MB - Last synced at: 25 days ago - Pushed at: over 1 year ago - Stars: 23 - Forks: 4

neelabhsinha/lm-application-eval-kit

Implementation and analysis toolkit for language models across different task types, domains, and reasoning types using multiple prompt styles.

Language: Python - Size: 1.31 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

yupidevs/pactus

Framework to evaluate Trajectory Classification Algorithms

Language: Python - Size: 1.17 MB - Last synced at: 4 days ago - Pushed at: 11 months ago - Stars: 45 - Forks: 0

ics-unisg/aqudem

Activity and Sequence Detection Evaluation Metrics: A package to evaluate activity detection results, including the sequence of events given multiple activity types.

Language: Python - Size: 94.7 KB - Last synced at: 7 days ago - Pushed at: 2 months ago - Stars: 1 - Forks: 1

tsenst/CrowdFlow

Optical Flow Dataset and Benchmark for Visual Crowd Analysis

Language: Python - Size: 1.94 MB - Last synced at: about 2 months ago - Pushed at: almost 2 years ago - Stars: 115 - Forks: 22

ad-freiburg/elevant

Entity linking evaluation and analysis tool

Language: Python - Size: 142 MB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 23 - Forks: 1

tensorstax/agenttrace

AgentTrace is a lightweight observability library to trace and evaluate agentic systems.

Language: TypeScript - Size: 14.4 MB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 37 - Forks: 0

AstraBert/SenTrEv

Simple customizable evaluation for text retrieval performance of Sentence Transformers embedders on PDFs

Language: Python - Size: 2.52 MB - Last synced at: 4 days ago - Pushed at: 5 months ago - Stars: 26 - Forks: 1

RecSysEvaluation/RecSys_Evaluation

Revisiting the Performance of GNN Models for Session-based Recommendation

Language: Python - Size: 318 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

MinhVuong2000/LLMReasonCert

[ACL'24] Official Implementation of the paper "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs"(https://aclanthology.org/2024.findings-acl.168)

Language: Python - Size: 671 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 42 - Forks: 9

mhamzaerol/Cost-of-Pass

Cost-of-Pass: An Economic Framework for Evaluating Language Models

Language: Python - Size: 939 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 3 - Forks: 0

monetjoe/ccmusic_eval

CCMusic, an open Chinese music database, integrates diverse datasets. It ensures data consistency via cleaning, label refinement and structure unification. A unified evaluation framework is used for benchmark evaluations, supporting classification and detection tasks.|CCMusic是一个开放的中文音乐数据库,整合了多样化数据集。通过数据清洗、标签优化和结构统一化确保数据一致性。使用统一评估框架进行基准评估,支持分类和检测任务。

Language: Python - Size: 2.78 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 19 - Forks: 0

Eustema-S-p-A/SCARF

SCARF (System for Comprehensive Assessment of RAG Frameworks) is a modular evaluation framework for benchmarking deployed Retrieval Augmented Generation (RAG) applications. It offers end-to-end, black-box assessment across multiple configurations, supports automated testing with several vector databases and LLMs.

Language: Python - Size: 592 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 5 - Forks: 0

match-cow/PoseTestBot

Robot-assisted evaluation of the accuracy of the 6D pose estimation for novel objects.

Language: Python - Size: 16.6 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

sb-ai-lab/Sim4Rec

Simulator for training and evaluation of Recommender Systems

Language: Jupyter Notebook - Size: 9.84 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 55 - Forks: 3

ash-hun/BERGEN-UP

E2E Evaluation Pipeline for ONLY RAG. Benchmark to BERGEN from NAVER Labs (a.k.a. BERGEN UP✨)

Language: Python - Size: 17.8 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

jogihood/rrg-metric

A Python package for evaluating radiology report generation using multiple standard and medical-specific metrics.

Language: Python - Size: 79.1 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 4 - Forks: 1

AI21Labs/lm-evaluation

Evaluation suite for large-scale language models.

Language: Python - Size: 19.5 KB - Last synced at: 2 months ago - Pushed at: almost 4 years ago - Stars: 125 - Forks: 16

aigc-apps/PertEval

[NeurIPS '24 Spotlight] PertEval: Unveiling Real Knowledge Capacity of LLMs via Knowledge-invariant Perturbations

Language: Jupyter Notebook - Size: 12.9 MB - Last synced at: 27 days ago - Pushed at: 8 months ago - Stars: 12 - Forks: 2

pentoai/vectory

Vectory provides a collection of tools to track and compare embedding versions.

Language: Python - Size: 1.92 MB - Last synced at: 14 days ago - Pushed at: over 2 years ago - Stars: 71 - Forks: 0

SAP-samples/llm-round-trip-correctness

This repo provides code for evaluation of llm round-trip-correctness on text to process model and vice versa

Language: Jupyter Notebook - Size: 4.45 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 4 - Forks: 1

lartpang/PySODEvalToolkit

PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection

Language: Python - Size: 309 KB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 176 - Forks: 21

jinzhuoran/RWKU

RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024

Language: Python - Size: 3.82 MB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 72 - Forks: 7

MaastrichtU-IDS/fair-enough-metrics

☑️ API to publish FAIR metrics tests written in python

Language: Python - Size: 168 KB - Last synced at: 16 days ago - Pushed at: over 2 years ago - Stars: 4 - Forks: 3

Kaos599/BetterRAG

BetterRAG: Powerful RAG evaluation toolkit for LLMs. Measure, analyze, and optimize how your AI processes text chunks with precision metrics. Perfect for RAG systems, document processing, and embedding quality assessment.

Language: Python - Size: 104 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

adithya-s-k/indic_eval

A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks

Language: Python - Size: 555 KB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 33 - Forks: 7

GAIR-NLP/scaleeval

Scalable Meta-Evaluation of LLMs as Evaluators

Language: Python - Size: 3.89 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 42 - Forks: 3

ibm-self-serve-assets/JudgeIt-LLM-as-a-Judge

Automation Framework using LLM-as-a-judge to Scale Eval of Gen AI solutions (RAG, Multi-turn, Query Rewrite, Text2SQL etc.); that is a good proxy for human judgement.

Language: Python - Size: 7.2 MB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 25 - Forks: 4

MaurizioFD/RecSys2019_DeepLearning_Evaluation

This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.

Language: Python - Size: 216 MB - Last synced at: about 2 months ago - Pushed at: about 2 years ago - Stars: 987 - Forks: 249

OPTML-Group/Diffusion-MU-Attack

The official implementation of ECCV'24 paper "To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now". This work introduces one fast and effective attack method to evaluate the harmful-content generation ability of safety-driven unlearned diffusion models.

Language: Python - Size: 11.9 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 71 - Forks: 3

attackbench/attackbench.github.io

The AttackBench framework wants to fairly compare gradient-based attacks against Machine Learning models. The goal is to find the most reliable attack to assess the robustness of a model.

Language: HTML - Size: 25.6 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

yukinagae/promptfoo-sample

Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models

Size: 334 KB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 1 - Forks: 0

jplane/llm-function-call-eval

Demonstrates a workflow for LLM function calling evaluation. Uses GitHub Copilot to generate synthetic eval data and Azure AI Foundry for handling results.

Language: Jupyter Notebook - Size: 524 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

aryan-jadon/Evaluation-Metrics-for-Recommendation-Systems

This repository contains the implementation of evaluation metrics for recommendation systems. We have compared similarity, candidate generation, rating, ranking metrics performance on 5 different datasets - MovieLens 100k, MovieLens 1m, MovieLens 10m, Amazon Electronics Dataset and Amazon Movies and TV Dataset.

Language: Python - Size: 22.7 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 16 - Forks: 11

Related Keywords
evaluation-framework 190 evaluation 42 evaluation-metrics 34 llm 28 python 20 machine-learning 19 large-language-models 16 llm-evaluation 14 deep-learning 12 benchmark 12 llmops 12 llms 10 nlp 10 rag 9 llm-evaluation-framework 8 ai 8 testing 7 dataset 7 language-model 7 research 6 recommender-system 6 llm-eval 6 synthetic-data 6 benchmarking 5 metrics 5 visualization 5 retrieval-augmented-generation 5 eval 5 information-retrieval 5 natural-language-processing 5 pytorch 5 synthetic-dataset-generation 5 generative-ai 5 artificial-intelligence 4 time-series 4 prompt-engineering 4 prompt-testing 4 genai 4 computer-vision 4 prompts 4 deep-neural-networks 3 diffusion-models 3 openai 3 unlearning 3 earth-science 3 adversarial-attacks 3 agents 3 agent 3 trustworthy-ai 3 benchmark-framework 3 reinforcement-learning 3 transformers 3 docker 3 matrix-factorization 3 promptfoo 3 llm-inference 3 text2sql 3 reproducible-research 3 performance-evaluation 2 environment 2 recommender-systems 2 security 2 text-to-sql 2 medical-imaging 2 generative-model 2 forecasting 2 simulation 2 bias-detection 2 sentence-transformers 2 reinforcement-learning-algorithms 2 evaluator 2 cell-segmentation 2 image-processing 2 observability 2 evaluations 2 analysis-framework 2 web-application 2 prompting 2 recommendation-system 2 transformer 2 multimodal 2 chatbot 2 ant 2 rl-training 2 gan 2 nl2sql 2 large-language-model 2 llm-evaluation-metrics 2 qdrant 2 retrieval 2 vector-database 2 embeddings-extraction 2 science 2 llm-evaluation-toolkit 2 spatial-analysis 2 nlp-machine-learning 2 ecg 2 rag-evaluation 2 spatial-temporal 2 bpr 2