GitHub topics: evaluation-framework
SJTUHaiyangYu/BackdoorMBTI
BackdoorMBTI is an open source project expanding the unimodal backdoor learning to a multimodal context. We hope that BackdoorMBTI can facilitate the analysis and development of backdoor defense methods within a multimodal context.
Language: Python - Size: 5.55 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 20 - Forks: 1

EleutherAI/lm-evaluation-harness
A framework for few-shot evaluation of language models.
Language: Python - Size: 30.1 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 9,385 - Forks: 2,492

kolenaIO/kolena
Python client for Kolena's machine learning testing platform
Language: Python - Size: 75.4 MB - Last synced at: about 12 hours ago - Pushed at: 6 days ago - Stars: 46 - Forks: 5

microsoft/eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
Language: Python - Size: 20.3 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 159 - Forks: 28

huggingface/lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Language: Python - Size: 5.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1,655 - Forks: 295

kaiko-ai/eva
Evaluation framework for oncology foundation models (FMs)
Language: Python - Size: 15.3 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 106 - Forks: 15

confident-ai/deepeval
The LLM Evaluation Framework
Language: Python - Size: 85 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 8,451 - Forks: 733

sean-zw/SynthECG
This repository hosts advanced models for generating ECG signals using deep learning techniques. Contributions are welcome, so feel free to fork and submit your improvements! 🐙💻
Language: Python - Size: 11.7 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

EuroEval/EuroEval
The robust European language model benchmark.
Language: Python - Size: 89.3 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 106 - Forks: 26

promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Language: TypeScript - Size: 223 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 7,305 - Forks: 584

stack-rs/mitosis
Mitosis: A Unified Transport Evaluation Framework
Language: Rust - Size: 1.62 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 3 - Forks: 0

aidos-lab/rings
Relevant Information in Node features and Graph Structure
Size: 99.6 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

X-PLUG/WritingBench
WritingBench: A Comprehensive Benchmark for Generative Writing
Language: Python - Size: 33.8 MB - Last synced at: 4 days ago - Pushed at: 13 days ago - Stars: 89 - Forks: 10

pyrddlgym-project/pyRDDLGym
A toolkit for auto-generation of OpenAI Gym environments from RDDL description files.
Language: Python - Size: 66.7 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 83 - Forks: 22

lapix-ufsc/lapixdl
Python package with Deep Learning utilities for Computer Vision
Language: Python - Size: 324 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 9 - Forks: 3

ServiceNow/AgentLab
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
Language: Python - Size: 3.19 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 349 - Forks: 71

teilomillet/kushim
eval creator
Language: Python - Size: 505 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2 - Forks: 0

pedrodevog/SynthECG
The first systematic evaluation framework for synthetic 10-second 12-lead ECGs from diagnostic class-conditioned generative models
Language: Python - Size: 160 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval.
Language: Python - Size: 9.37 MB - Last synced at: 4 days ago - Pushed at: 8 months ago - Stars: 243 - Forks: 41

empirical-run/empirical
Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application
Language: TypeScript - Size: 1.58 MB - Last synced at: 4 days ago - Pushed at: 10 months ago - Stars: 158 - Forks: 12

eduardogr/evalytics
HR tool to orchestrate the Performance Review Cycle of the employees of a company.
Language: Python - Size: 803 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 11 - Forks: 2

alibaba-damo-academy/MedEvalKit
MedEvalKit: A Unified Medical Evaluation Framework
Language: Python - Size: 1.74 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 40 - Forks: 2

zli12321/long_form_rl
grope to train long form QA and instructions with long-form reward model
Language: Python - Size: 22.8 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 3 - Forks: 0

lstrgar/SEG
Evaluation of label free cell and nuclei segmentation
Language: Jupyter Notebook - Size: 19.1 MB - Last synced at: 2 days ago - Pushed at: about 1 year ago - Stars: 8 - Forks: 3

nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
Language: Python - Size: 3.13 MB - Last synced at: 6 days ago - Pushed at: 10 months ago - Stars: 124 - Forks: 18

naomibaes/LSCD_method_evaluation
Companion repository with scripts for applying LSC-Eval, a 3-stage evaluation framework to: (1) create theory-driven LLM-generated synthetic suites for LSC dimensions, (2) program experimental settings for comparative method evaluation on a synthetic change detection task, (3) choose the most suitable method for the dimension and domain of interest
Language: Jupyter Notebook - Size: 582 MB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

naomibaes/Synthetic-LSC_pipeline
Synthetic datasets to evaluate key dimensions of LSC (Sentiment, Intensity, Breadth), generated using LLMs and WordNet from the LSC-Eval framework.
Language: Jupyter Notebook - Size: 31.5 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

maximhq/maxim-cookbooks
Maxim is an end-to-end AI evaluation and observability platform that empowers modern AI teams to ship agents with quality, reliability, and speed.
Language: Jupyter Notebook - Size: 123 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 8 - Forks: 6

SYED-M-HUSSAIN/RECAP
RECAP (Review Engine for Critiquing and Advising Pitches) is an LLM-powered agentic system designed to help founders and entrepreneurs receive actionable, multi-perspective, and structured feedback on their startup pitch presentations
Language: Python - Size: 43 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

wassname/open_pref_eval
Hackable, simple, llm evals on preference datasets
Language: Python - Size: 15.8 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 2 - Forks: 0

HPAI-BSC/TuRTLe
A Unified Evaluation of LLMs for RTL Generation
Language: Python - Size: 409 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 15 - Forks: 1

NOAA-OWP/gval
A high-level Python framework to evaluate the skill of geospatial datasets by comparing candidates to benchmark maps producing agreement maps and metrics.
Language: Jupyter Notebook - Size: 29.3 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 25 - Forks: 2

IAAR-Shanghai/GuessArena
[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning
Language: Python - Size: 606 KB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 5 - Forks: 0

logic-star-ai/swt-bench
[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating LLM repository-level test-generation
Language: Python - Size: 4.37 MB - Last synced at: 5 days ago - Pushed at: 24 days ago - Stars: 50 - Forks: 6

vcerqueira/modelradar
Aspect-based Forecasting Accuracy
Language: Jupyter Notebook - Size: 9.55 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 4 - Forks: 2

yukinagae/genkitx-promptfoo
Community Plugin for Genkit to use Promptfoo
Language: TypeScript - Size: 553 KB - Last synced at: 23 days ago - Pushed at: 6 months ago - Stars: 4 - Forks: 0

TonicAI/tonic_validate
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
Language: Python - Size: 6.2 MB - Last synced at: 22 days ago - Pushed at: about 1 month ago - Stars: 302 - Forks: 30

aiverify-foundation/moonshot
Moonshot - A simple and modular tool to evaluate and red-team any LLM application.
Language: Python - Size: 225 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 243 - Forks: 52

rosinality/halite
Acceleration framework for Human Alignment Learning
Language: Python - Size: 527 KB - Last synced at: 26 days ago - Pushed at: 27 days ago - Stars: 6 - Forks: 1

GoogleCloudPlatform/evalbench
EvalBench is a flexible framework designed to measure the quality of generative AI (GenAI) workflows around database specific tasks.
Language: Python - Size: 1.41 MB - Last synced at: 8 days ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 3

relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
Language: Python - Size: 1.92 MB - Last synced at: 15 days ago - Pushed at: 5 months ago - Stars: 497 - Forks: 35

alok-abhishek/BEATS_Dataset
BEATS or Bias Evaluation and Assessment Test Suite is a research focused on systematic analysis and empirical investigation of fairness and bias in GenAI models to develop an integrated framework for data governance in GenAI systems.
Language: Jupyter Notebook - Size: 5.03 MB - Last synced at: 27 days ago - Pushed at: 28 days ago - Stars: 1 - Forks: 0

Q-Aware-Labs/model-arena
Simple LLM Response Evaluation tool
Language: JavaScript - Size: 623 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 0 - Forks: 0

fmp453/erase-eval
Erasing with Precision: Evaluating Specific Concept Erasure from Text-to-Image Generative Models
Language: Python - Size: 1.18 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 3 - Forks: 0

VLMHyperBenchTeam/VLMHyperBench
VLMHyperBench – open source фреймворк для оценки возможностей Vision language models (VLM) распознавать документы на русском языке с целью оценки их потенциала для автоматизации документооборота.
Language: Python - Size: 3.56 MB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 60 - Forks: 0

LIAAD/tieval
An Evaluation Framework for Temporal Information Extraction Systems
Language: Python - Size: 1.05 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 19 - Forks: 1

tohtsky/irspack
Train, evaluate, and optimize implicit feedback-based recommender systems.
Language: Python - Size: 733 KB - Last synced at: 26 days ago - Pushed at: about 2 years ago - Stars: 31 - Forks: 9

diningphil/PyDGN
A research library for automating experiments on Deep Graph Networks
Language: Python - Size: 10.7 MB - Last synced at: 4 days ago - Pushed at: 10 months ago - Stars: 223 - Forks: 13

kse-ElEvEn/MATEval
MATEval is the first multi-agent framework simulating human collaborative discussion for open-ended text evaluation.
Language: Python - Size: 9.57 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 27 - Forks: 2

cedrickchee/vibe-jet
A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"
Language: HTML - Size: 8.65 MB - Last synced at: 26 days ago - Pushed at: 3 months ago - Stars: 51 - Forks: 9

HKUSTDial/NL2SQL360
🔥[VLDB'24] Official repository for the paper “The Dawn of Natural Language to SQL: Are We Fully Ready?”
Language: Python - Size: 8.61 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 118 - Forks: 10

laurawpaaby/EduChatEval
A structured pipeline and Python package for deploying and evaluating interactive LLM tutor systems in educational settings.
Language: Jupyter Notebook - Size: 5.44 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

BMW-InnovationLab/SORDI-AI-Evaluation-GUI
This repository allows you to evaluate a trained computer vision model and get general information and evaluation metrics with little configuration.
Language: Python - Size: 41.5 MB - Last synced at: 22 days ago - Pushed at: over 1 year ago - Stars: 76 - Forks: 5

act3-ace/CoRL
The Core Reinforcement Learning library is intended to enable scalable deep reinforcement learning experimentation in a manner extensible to new simulations and new ways for the learning agents to interact with them. The hope is that this makes RL research easier by removing lock-in to particular simulations.The work is released under the follow APRS approval. Initial release of CoRL - Part #1 -Approved on 2022-05-2024 12:08:51 - PA Approval # [AFRL-2022-2455]" Documentation https://act3-ace.github.io/CoRL/
Language: Python - Size: 3.41 MB - Last synced at: 11 days ago - Pushed at: about 1 year ago - Stars: 36 - Forks: 5

AstraBert/diRAGnosis
Diagnose the performance of your RAG🩺
Language: Python - Size: 214 KB - Last synced at: 4 days ago - Pushed at: 3 months ago - Stars: 36 - Forks: 3

nhsengland/evalsense
Tools for systematic large language model evaluations
Language: Python - Size: 877 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Language: Python - Size: 1.82 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 280 - Forks: 19

bijington/expressive
Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.
Language: C# - Size: 3.74 MB - Last synced at: about 2 hours ago - Pushed at: 9 months ago - Stars: 171 - Forks: 27

rookie-littleblack/XpertEval
XpertEval: All-in-One Evaluation Framework for Multimodal Large Models
Language: Python - Size: 237 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

codefuse-ai/codefuse-evaluation
Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中
Language: Python - Size: 35.1 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 96 - Forks: 14

xmed-lab/UniEval
UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation
Language: Python - Size: 26 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 1

symflower/eval-dev-quality
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
Language: Go - Size: 18.8 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 171 - Forks: 8

powerflows/powerflows-dmn
Power Flows DMN - Powerful decisions and rules engine
Language: Java - Size: 545 KB - Last synced at: about 1 month ago - Pushed at: 9 months ago - Stars: 52 - Forks: 6

cowjen01/repsys
Framework for Interactive Evaluation of Recommender Systems
Language: JavaScript - Size: 12.5 MB - Last synced at: 8 days ago - Pushed at: almost 2 years ago - Stars: 35 - Forks: 5

vectara/mirage-bench
Repository for Multililngual Generation, RAG evaluations, and surrogate judge training for Arena RAG leaderboard (NAACL'25)
Language: Python - Size: 2.8 MB - Last synced at: 11 days ago - Pushed at: 3 months ago - Stars: 9 - Forks: 0

eth-lre/mathtutorbench
Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors
Language: Python - Size: 5.02 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 11 - Forks: 1

davidheineman/thresh
🌾 Universal, customizable and deployable fine-grained evaluation for text generation.
Language: Vue - Size: 87.5 MB - Last synced at: 25 days ago - Pushed at: over 1 year ago - Stars: 23 - Forks: 4

neelabhsinha/lm-application-eval-kit
Implementation and analysis toolkit for language models across different task types, domains, and reasoning types using multiple prompt styles.
Language: Python - Size: 1.31 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

yupidevs/pactus
Framework to evaluate Trajectory Classification Algorithms
Language: Python - Size: 1.17 MB - Last synced at: 4 days ago - Pushed at: 11 months ago - Stars: 45 - Forks: 0

ics-unisg/aqudem
Activity and Sequence Detection Evaluation Metrics: A package to evaluate activity detection results, including the sequence of events given multiple activity types.
Language: Python - Size: 94.7 KB - Last synced at: 7 days ago - Pushed at: 2 months ago - Stars: 1 - Forks: 1

tsenst/CrowdFlow
Optical Flow Dataset and Benchmark for Visual Crowd Analysis
Language: Python - Size: 1.94 MB - Last synced at: about 2 months ago - Pushed at: almost 2 years ago - Stars: 115 - Forks: 22

ad-freiburg/elevant
Entity linking evaluation and analysis tool
Language: Python - Size: 142 MB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 23 - Forks: 1

tensorstax/agenttrace
AgentTrace is a lightweight observability library to trace and evaluate agentic systems.
Language: TypeScript - Size: 14.4 MB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 37 - Forks: 0

AstraBert/SenTrEv
Simple customizable evaluation for text retrieval performance of Sentence Transformers embedders on PDFs
Language: Python - Size: 2.52 MB - Last synced at: 4 days ago - Pushed at: 5 months ago - Stars: 26 - Forks: 1

RecSysEvaluation/RecSys_Evaluation
Revisiting the Performance of GNN Models for Session-based Recommendation
Language: Python - Size: 318 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

MinhVuong2000/LLMReasonCert
[ACL'24] Official Implementation of the paper "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs"(https://aclanthology.org/2024.findings-acl.168)
Language: Python - Size: 671 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 42 - Forks: 9

mhamzaerol/Cost-of-Pass
Cost-of-Pass: An Economic Framework for Evaluating Language Models
Language: Python - Size: 939 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 3 - Forks: 0

monetjoe/ccmusic_eval
CCMusic, an open Chinese music database, integrates diverse datasets. It ensures data consistency via cleaning, label refinement and structure unification. A unified evaluation framework is used for benchmark evaluations, supporting classification and detection tasks.|CCMusic是一个开放的中文音乐数据库,整合了多样化数据集。通过数据清洗、标签优化和结构统一化确保数据一致性。使用统一评估框架进行基准评估,支持分类和检测任务。
Language: Python - Size: 2.78 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 19 - Forks: 0

Eustema-S-p-A/SCARF
SCARF (System for Comprehensive Assessment of RAG Frameworks) is a modular evaluation framework for benchmarking deployed Retrieval Augmented Generation (RAG) applications. It offers end-to-end, black-box assessment across multiple configurations, supports automated testing with several vector databases and LLMs.
Language: Python - Size: 592 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 5 - Forks: 0

match-cow/PoseTestBot
Robot-assisted evaluation of the accuracy of the 6D pose estimation for novel objects.
Language: Python - Size: 16.6 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

sb-ai-lab/Sim4Rec
Simulator for training and evaluation of Recommender Systems
Language: Jupyter Notebook - Size: 9.84 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 55 - Forks: 3

ash-hun/BERGEN-UP
E2E Evaluation Pipeline for ONLY RAG. Benchmark to BERGEN from NAVER Labs (a.k.a. BERGEN UP✨)
Language: Python - Size: 17.8 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

jogihood/rrg-metric
A Python package for evaluating radiology report generation using multiple standard and medical-specific metrics.
Language: Python - Size: 79.1 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 4 - Forks: 1

AI21Labs/lm-evaluation
Evaluation suite for large-scale language models.
Language: Python - Size: 19.5 KB - Last synced at: 2 months ago - Pushed at: almost 4 years ago - Stars: 125 - Forks: 16

aigc-apps/PertEval
[NeurIPS '24 Spotlight] PertEval: Unveiling Real Knowledge Capacity of LLMs via Knowledge-invariant Perturbations
Language: Jupyter Notebook - Size: 12.9 MB - Last synced at: 27 days ago - Pushed at: 8 months ago - Stars: 12 - Forks: 2

pentoai/vectory
Vectory provides a collection of tools to track and compare embedding versions.
Language: Python - Size: 1.92 MB - Last synced at: 14 days ago - Pushed at: over 2 years ago - Stars: 71 - Forks: 0

SAP-samples/llm-round-trip-correctness
This repo provides code for evaluation of llm round-trip-correctness on text to process model and vice versa
Language: Jupyter Notebook - Size: 4.45 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 4 - Forks: 1

lartpang/PySODEvalToolkit
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
Language: Python - Size: 309 KB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 176 - Forks: 21

jinzhuoran/RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
Language: Python - Size: 3.82 MB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 72 - Forks: 7

MaastrichtU-IDS/fair-enough-metrics
☑️ API to publish FAIR metrics tests written in python
Language: Python - Size: 168 KB - Last synced at: 16 days ago - Pushed at: over 2 years ago - Stars: 4 - Forks: 3

Kaos599/BetterRAG
BetterRAG: Powerful RAG evaluation toolkit for LLMs. Measure, analyze, and optimize how your AI processes text chunks with precision metrics. Perfect for RAG systems, document processing, and embedding quality assessment.
Language: Python - Size: 104 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

adithya-s-k/indic_eval
A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks
Language: Python - Size: 555 KB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 33 - Forks: 7

GAIR-NLP/scaleeval
Scalable Meta-Evaluation of LLMs as Evaluators
Language: Python - Size: 3.89 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 42 - Forks: 3

ibm-self-serve-assets/JudgeIt-LLM-as-a-Judge
Automation Framework using LLM-as-a-judge to Scale Eval of Gen AI solutions (RAG, Multi-turn, Query Rewrite, Text2SQL etc.); that is a good proxy for human judgement.
Language: Python - Size: 7.2 MB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 25 - Forks: 4

MaurizioFD/RecSys2019_DeepLearning_Evaluation
This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.
Language: Python - Size: 216 MB - Last synced at: about 2 months ago - Pushed at: about 2 years ago - Stars: 987 - Forks: 249

OPTML-Group/Diffusion-MU-Attack
The official implementation of ECCV'24 paper "To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now". This work introduces one fast and effective attack method to evaluate the harmful-content generation ability of safety-driven unlearned diffusion models.
Language: Python - Size: 11.9 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 71 - Forks: 3

attackbench/attackbench.github.io
The AttackBench framework wants to fairly compare gradient-based attacks against Machine Learning models. The goal is to find the most reliable attack to assess the robustness of a model.
Language: HTML - Size: 25.6 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

yukinagae/promptfoo-sample
Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models
Size: 334 KB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 1 - Forks: 0

jplane/llm-function-call-eval
Demonstrates a workflow for LLM function calling evaluation. Uses GitHub Copilot to generate synthetic eval data and Azure AI Foundry for handling results.
Language: Jupyter Notebook - Size: 524 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

aryan-jadon/Evaluation-Metrics-for-Recommendation-Systems
This repository contains the implementation of evaluation metrics for recommendation systems. We have compared similarity, candidate generation, rating, ranking metrics performance on 5 different datasets - MovieLens 100k, MovieLens 1m, MovieLens 10m, Amazon Electronics Dataset and Amazon Movies and TV Dataset.
Language: Python - Size: 22.7 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 16 - Forks: 11
