Topic: "evaluation-framework"
EleutherAI/lm-evaluation-harness
A framework for few-shot evaluation of language models.
Language: Python - Size: 29.5 MB - Last synced at: about 14 hours ago - Pushed at: about 18 hours ago - Stars: 8,931 - Forks: 2,384

promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Language: TypeScript - Size: 361 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 6,506 - Forks: 527

confident-ai/deepeval
The LLM Evaluation Framework
Language: Python - Size: 82.9 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 6,282 - Forks: 549

huggingface/lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Language: Python - Size: 4.79 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1,519 - Forks: 241

MaurizioFD/RecSys2019_DeepLearning_Evaluation
This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.
Language: Python - Size: 216 MB - Last synced at: 4 days ago - Pushed at: almost 2 years ago - Stars: 987 - Forks: 249

relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
Language: Python - Size: 1.92 MB - Last synced at: 3 days ago - Pushed at: 4 months ago - Stars: 492 - Forks: 34

ServiceNow/AgentLab
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
Language: Python - Size: 3.05 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 320 - Forks: 62

TonicAI/tonic_validate
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
Language: Python - Size: 6.19 MB - Last synced at: 8 days ago - Pushed at: 6 months ago - Stars: 297 - Forks: 30

athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Language: Python - Size: 1.82 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 277 - Forks: 17

JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval.
Language: Python - Size: 9.37 MB - Last synced at: 6 days ago - Pushed at: 6 months ago - Stars: 239 - Forks: 40

aiverify-foundation/moonshot
Moonshot - A simple and modular tool to evaluate and red-team any LLM application.
Language: Python - Size: 219 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 234 - Forks: 50

diningphil/PyDGN
A research library for automating experiments on Deep Graph Networks
Language: Python - Size: 10.7 MB - Last synced at: 1 day ago - Pushed at: 8 months ago - Stars: 222 - Forks: 13

zeno-ml/zeno 📦
AI Data Management & Evaluation Platform
Language: Svelte - Size: 51.6 MB - Last synced at: 6 days ago - Pushed at: over 1 year ago - Stars: 215 - Forks: 11

lartpang/PySODEvalToolkit
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
Language: Python - Size: 309 KB - Last synced at: 30 days ago - Pushed at: 8 months ago - Stars: 176 - Forks: 21

symflower/eval-dev-quality
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
Language: Go - Size: 18.8 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 171 - Forks: 8

bijington/expressive
Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.
Language: C# - Size: 3.74 MB - Last synced at: about 18 hours ago - Pushed at: 8 months ago - Stars: 169 - Forks: 27

empirical-run/empirical
Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application
Language: TypeScript - Size: 1.58 MB - Last synced at: 2 days ago - Pushed at: 9 months ago - Stars: 156 - Forks: 12

microsoft/eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
Language: Python - Size: 20.5 MB - Last synced at: about 13 hours ago - Pushed at: about 13 hours ago - Stars: 142 - Forks: 23

AI21Labs/lm-evaluation
Evaluation suite for large-scale language models.
Language: Python - Size: 19.5 KB - Last synced at: 22 days ago - Pushed at: over 3 years ago - Stars: 125 - Forks: 16

nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
Language: Python - Size: 3.13 MB - Last synced at: 26 days ago - Pushed at: 9 months ago - Stars: 122 - Forks: 17

tsenst/CrowdFlow
Optical Flow Dataset and Benchmark for Visual Crowd Analysis
Language: Python - Size: 1.94 MB - Last synced at: 8 days ago - Pushed at: almost 2 years ago - Stars: 115 - Forks: 22

HKUSTDial/NL2SQL360
Official repository for the paper “The Dawn of Natural Language to SQL: Are We Fully Ready?” (VLDB'24)
Language: Python - Size: 8.61 MB - Last synced at: 18 days ago - Pushed at: about 2 months ago - Stars: 114 - Forks: 10

kaiko-ai/eva
Evaluation framework for oncology foundation models (FMs)
Language: Python - Size: 14.4 MB - Last synced at: about 19 hours ago - Pushed at: about 20 hours ago - Stars: 106 - Forks: 11

EuroEval/EuroEval
The robust European language model benchmark.
Language: Python - Size: 88.4 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 101 - Forks: 25

codefuse-ai/codefuse-evaluation
Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中
Language: Python - Size: 35.1 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 94 - Forks: 14

haeyeoni/lidar_slam_evaluator
LiDAR SLAM comparison and evaluation framework
Language: Python - Size: 31.3 MB - Last synced at: about 2 months ago - Pushed at: almost 4 years ago - Stars: 93 - Forks: 16

hpclab/rankeval
Official repository of RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions.
Language: Python - Size: 7.73 MB - Last synced at: 8 months ago - Pushed at: almost 5 years ago - Stars: 88 - Forks: 11

Borda/BIRL 📦
BIRL: Benchmark on Image Registration methods with Landmark validations
Language: Python - Size: 8.49 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 87 - Forks: 26

pyrddlgym-project/pyRDDLGym
A toolkit for auto-generation of OpenAI Gym environments from RDDL description files.
Language: Python - Size: 64.3 MB - Last synced at: about 18 hours ago - Pushed at: 28 days ago - Stars: 83 - Forks: 22

BMW-InnovationLab/SORDI-AI-Evaluation-GUI
This repository allows you to evaluate a trained computer vision model and get general information and evaluation metrics with little configuration.
Language: Python - Size: 41.5 MB - Last synced at: 10 months ago - Pushed at: over 1 year ago - Stars: 75 - Forks: 3

nouhadziri/DialogEntailment
The implementation of the paper "Evaluating Coherence in Dialogue Systems using Entailment"
Language: Python - Size: 85 KB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 74 - Forks: 5

X-PLUG/WritingBench
WritingBench: A Comprehensive Benchmark for Generative Writing
Language: Python - Size: 18 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 73 - Forks: 8

jinzhuoran/RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
Language: Python - Size: 3.82 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 72 - Forks: 7

OPTML-Group/Diffusion-MU-Attack
The official implementation of ECCV'24 paper "To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now". This work introduces one fast and effective attack method to evaluate the harmful-content generation ability of safety-driven unlearned diffusion models.
Language: Python - Size: 11.9 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 71 - Forks: 3

pentoai/vectory
Vectory provides a collection of tools to track and compare embedding versions.
Language: Python - Size: 1.92 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 71 - Forks: 0

ashafaei/OD-test
OD-test: A Less Biased Evaluation of Out-of-Distribution (Outlier) Detectors (PyTorch)
Language: Python - Size: 3.5 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 61 - Forks: 11

sb-ai-lab/Sim4Rec
Simulator for training and evaluation of Recommender Systems
Language: Jupyter Notebook - Size: 9.84 MB - Last synced at: 16 days ago - Pushed at: about 2 months ago - Stars: 55 - Forks: 3

VLMHyperBenchTeam/VLMHyperBench
VLMHyperBench – open source фреймворк для оценки возможностей Vision language models (VLM) распознавать документы на русском языке с целью оценки их потенциала для автоматизации документооборота.
Language: Python - Size: 2.82 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 54 - Forks: 0

powerflows/powerflows-dmn
Power Flows DMN - Powerful decisions and rules engine
Language: Java - Size: 545 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 51 - Forks: 6

SpikeInterface/spiketoolkit 📦
Python-based tools for pre-, post-processing, validating, and curating spike sorting datasets.
Language: Python - Size: 3.94 MB - Last synced at: about 2 months ago - Pushed at: over 3 years ago - Stars: 50 - Forks: 29

kolenaIO/kolena
Python client for Kolena's machine learning testing platform
Language: Python - Size: 75.4 MB - Last synced at: 7 days ago - Pushed at: 27 days ago - Stars: 48 - Forks: 5

logic-star-ai/swt-bench
[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating LLM repository-level test-generation
Language: Python - Size: 4.38 MB - Last synced at: 4 days ago - Pushed at: 27 days ago - Stars: 48 - Forks: 6

cedrickchee/vibe-jet
A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"
Language: HTML - Size: 8.65 MB - Last synced at: 10 days ago - Pushed at: about 1 month ago - Stars: 47 - Forks: 8

yupidevs/pactus
Framework to evaluate Trajectory Classification Algorithms
Language: Python - Size: 1.17 MB - Last synced at: 1 day ago - Pushed at: 10 months ago - Stars: 45 - Forks: 0

MinhVuong2000/LLMReasonCert
[ACL'24] Official Implementation of the paper "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs"(https://aclanthology.org/2024.findings-acl.168)
Language: Python - Size: 671 KB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 42 - Forks: 9

GAIR-NLP/scaleeval
Scalable Meta-Evaluation of LLMs as Evaluators
Language: Python - Size: 3.89 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 42 - Forks: 3

srcclr/efda
Evaluation Framework for Dependency Analysis (EFDA)
Language: C - Size: 27.5 MB - Last synced at: 10 months ago - Pushed at: about 3 years ago - Stars: 39 - Forks: 48

tensorstax/agenttrace
AgentTrace is a lightweight observability library to trace and evaluate agentic systems.
Language: TypeScript - Size: 14.4 MB - Last synced at: 13 days ago - Pushed at: about 1 month ago - Stars: 37 - Forks: 0

cowjen01/repsys
Framework for Interactive Evaluation of Recommender Systems
Language: JavaScript - Size: 12.5 MB - Last synced at: 27 days ago - Pushed at: almost 2 years ago - Stars: 36 - Forks: 5

AstraBert/diRAGnosis
Diagnose the performance of your RAG🩺
Language: Python - Size: 214 KB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 35 - Forks: 3

act3-ace/CoRL
The Core Reinforcement Learning library is intended to enable scalable deep reinforcement learning experimentation in a manner extensible to new simulations and new ways for the learning agents to interact with them. The hope is that this makes RL research easier by removing lock-in to particular simulations.The work is released under the follow APRS approval. Initial release of CoRL - Part #1 -Approved on 2022-05-2024 12:08:51 - PA Approval # [AFRL-2022-2455]" Documentation https://act3-ace.github.io/CoRL/
Language: Python - Size: 3.41 MB - Last synced at: 30 days ago - Pushed at: 11 months ago - Stars: 35 - Forks: 5

adithya-s-k/indic_eval
A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks
Language: Python - Size: 555 KB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 33 - Forks: 7

encord-team/text-to-image-eval
Evaluate custom and HuggingFace text-to-image/zero-shot-image-classification models like CLIP, SigLIP, DFN5B, and EVA-CLIP. Metrics include Zero-shot accuracy, Linear Probe, Image retrieval, and KNN accuracy.
Language: Jupyter Notebook - Size: 14.2 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 30 - Forks: 0

tohtsky/irspack
Train, evaluate, and optimize implicit feedback-based recommender systems.
Language: Python - Size: 733 KB - Last synced at: 15 days ago - Pushed at: almost 2 years ago - Stars: 30 - Forks: 9

AstraBert/SenTrEv
Simple customizable evaluation for text retrieval performance of Sentence Transformers embedders on PDFs
Language: Python - Size: 2.52 MB - Last synced at: 3 days ago - Pushed at: 4 months ago - Stars: 26 - Forks: 2

NOAA-OWP/gval
A high-level Python framework to evaluate the skill of geospatial datasets by comparing candidates to benchmark maps producing agreement maps and metrics.
Language: Jupyter Notebook - Size: 29.3 MB - Last synced at: 2 days ago - Pushed at: 2 months ago - Stars: 25 - Forks: 2

ibm-self-serve-assets/JudgeIt-LLM-as-a-Judge
Automation Framework using LLM-as-a-judge to Scale Eval of Gen AI solutions (RAG, Multi-turn, Query Rewrite, Text2SQL etc.); that is a good proxy for human judgement.
Language: Python - Size: 7.2 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 25 - Forks: 4

MorenoLaQuatra/ARCH
ARCH: Audio Representations benCHmark
Language: Python - Size: 649 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 23 - Forks: 2

vinid/quica
quica is a tool to run inter coder agreement pipelines in an easy and effective ways. Multiple measures are run and results are collected in a single table than can be easily exported in Latex
Language: Python - Size: 112 KB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 23 - Forks: 0

davidheineman/thresh
🌾 Universal, customizable and deployable fine-grained evaluation for text generation.
Language: Vue - Size: 87.5 MB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 22 - Forks: 4

ad-freiburg/elevant
Entity linking evaluation and analysis tool
Language: Python - Size: 142 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 21 - Forks: 1

monetjoe/ccmusic_eval
CCMusic, an open Chinese music database, integrates diverse datasets. It ensures data consistency via cleaning, label refinement and structure unification. A unified evaluation framework is used for benchmark evaluations, supporting classification and detection tasks.|CCMusic是一个开放的中文音乐数据库,整合了多样化数据集。通过数据清洗、标签优化和结构统一化确保数据一致性。使用统一评估框架进行基准评估,支持分类和检测任务。
Language: Python - Size: 2.78 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 19 - Forks: 0

SJTUHaiyangYu/BackdoorMBTI
BackdoorMBTI is an open source project expanding the unimodal backdoor learning to a multimodal context. We hope that BackdoorMBTI can facilitate the analysis and development of backdoor defense methods within a multimodal context.
Language: Python - Size: 5.55 MB - Last synced at: 15 days ago - Pushed at: 6 months ago - Stars: 19 - Forks: 1

ma7555/evalify
Evaluate your biometric verification models literally in seconds.
Language: Python - Size: 3.05 MB - Last synced at: 8 days ago - Pushed at: 6 months ago - Stars: 19 - Forks: 20

LIAAD/tieval
An Evaluation Framework for Temporal Information Extraction Systems
Language: Python - Size: 1.05 MB - Last synced at: 20 days ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 1

GoogleCloudPlatform/evalbench
EvalBench is a flexible framework designed to measure the quality of generative AI (GenAI) workflows around database specific tasks.
Language: Python - Size: 1.39 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 17 - Forks: 2

aryan-jadon/Evaluation-Metrics-for-Recommendation-Systems
This repository contains the implementation of evaluation metrics for recommendation systems. We have compared similarity, candidate generation, rating, ranking metrics performance on 5 different datasets - MovieLens 100k, MovieLens 1m, MovieLens 10m, Amazon Electronics Dataset and Amazon Movies and TV Dataset.
Language: Python - Size: 22.7 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 16 - Forks: 11

vishal-keshav/fast_prototype
This is a machine learning framework that enables developers to iterate fast over different ML architecture designs.
Language: Python - Size: 2.12 MB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 16 - Forks: 3

claws-lab/XLingEval
Code and Resources for the paper, "Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries"
Language: Python - Size: 29.1 MB - Last synced at: 12 months ago - Pushed at: about 1 year ago - Stars: 14 - Forks: 2

ChristophReich1996/Yeast-in-Microstructures-Dataset
Official and maintained implementation of the dataset paper "An Instance Segmentation Dataset of Yeast Cells in Microstructures" [EMBC 2023].
Language: Python - Size: 14.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 14 - Forks: 0

BorgwardtLab/ggme
Official repository for the ICLR 2022 paper "Evaluation Metrics for Graph Generative Models: Problems, Pitfalls, and Practical Solutions" https://openreview.net/forum?id=tBtoZYKd9n
Language: Python - Size: 348 KB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 14 - Forks: 0

Now-Join-Us/OmniEvalKit Fork of AIDC-AI/M3Bench
The code repository for "OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions"
Language: Python - Size: 3.82 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 13 - Forks: 2

hlt-mt/subsonar
Evaluate the quality of SRT files using the multilingual multimodal SONAR model.
Language: Python - Size: 26.4 KB - Last synced at: 4 days ago - Pushed at: 12 months ago - Stars: 13 - Forks: 0

chziakas/redeval
Red-teaming LLM applications.
Language: Python - Size: 1.2 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 13 - Forks: 0

aigc-apps/PertEval
[NeurIPS '24 Spotlight] PertEval: Unveiling Real Knowledge Capacity of LLMs via Knowledge-invariant Perturbations
Language: Jupyter Notebook - Size: 12.9 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 12 - Forks: 2

MUSC-TBIC/etude-engine
ETUDE (Evaluation Tool for Unstructured Data and Extractions) is a Python-based tool that provides consistent evaluation options across a range of annotation schemata and corpus formats
Language: Jupyter Notebook - Size: 2.28 MB - Last synced at: about 1 year ago - Pushed at: almost 3 years ago - Stars: 12 - Forks: 0

SayedNadim/Image-Quality-Evaluation-Metrics
Implementation of Common Image Evaluation Metrics by Sayed Nadim (sayednadim.github.io). The repo is built based on full reference image quality metrics such as L1, L2, PSNR, SSIM, LPIPS. and feature-level quality metrics such as FID, IS. It can be used for evaluating image denoising, colorization, inpainting, deraining, dehazing etc. where we have access to ground truth.
Language: Python - Size: 70.3 KB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 12 - Forks: 4

eth-lre/mathtutorbench
Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors
Language: Python - Size: 5.02 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 11 - Forks: 1

eduardogr/evalytics
HR tool to orchestrate the Performance Review Cycle of the employees of a company.
Language: Python - Size: 449 KB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 11 - Forks: 2

edgerun/galileo
🪐 A framework for distributed load testing experiments
Language: Python - Size: 636 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 11 - Forks: 3

OPTML-Group/Unlearn-WorstCase
"Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning" by Chongyu Fan*, Jiancheng Liu*, Alfred Hero, Sijia Liu
Language: Python - Size: 40.7 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 10 - Forks: 0

mariaangelapellegrino/Evaluation-Framework
It is an evaluation framework for evaluating and comparing graph embedding techniques
Language: Jupyter Notebook - Size: 46.5 MB - Last synced at: 9 months ago - Pushed at: almost 2 years ago - Stars: 10 - Forks: 8

ChenLiu-1996/GAN-evaluator
A pip-installable evaluator for GANs (IS and FID). Accepts either dataloaders or individual batches. Supports on-the-fly evaluation during training. A working DCGAN SVHN demo script provided.
Language: Python - Size: 778 KB - Last synced at: 2 days ago - Pushed at: about 2 years ago - Stars: 10 - Forks: 0

lapix-ufsc/lapixdl
Python package with Deep Learning utilities for Computer Vision
Language: Python - Size: 324 KB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 9 - Forks: 3

Delphi-BFT/tool
Delphi-BFT automates large-scale simulations of unmodified BFT protocol implementations through the Phantom simulator given a simple experimental description. For the first time, experiments with existing BFT protocol implementations can be effortless setup, configured and fed into a simulation engine
Language: JavaScript - Size: 9.4 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 9 - Forks: 1

andrewimpellitteri/llm_poli_compass
A program to automate testing open source LLMs for their political compass scores
Language: Python - Size: 996 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 9 - Forks: 0

GiovanniBaccichet/DNCS-HTTP3
Docker-based virtualized framework for analysing HTTP/3+QUIC performance and compare it to HTTP/2 and TCP.
Language: Shell - Size: 35.7 MB - Last synced at: about 1 year ago - Pushed at: about 4 years ago - Stars: 9 - Forks: 1

maximhq/maxim-cookbooks
Maxim is an end-to-end AI evaluation and observability platform that empowers modern AI teams to ship agents with quality, reliability, and speed.
Language: Jupyter Notebook - Size: 122 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 8 - Forks: 5

fernando-aristizabal/gVal
A Python framework to evaluate geospatial datasets by comparing candidate and benchmark maps to compute agreement maps and statistics.
Language: Python - Size: 65.4 KB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 8 - Forks: 0

e0397123/AM-FM
Language: Python - Size: 1.52 MB - Last synced at: 5 months ago - Pushed at: almost 4 years ago - Stars: 8 - Forks: 0

vectara/mirage-bench
Repository for Multililngual Generation, RAG evaluations, and surrogate judge training for Arena RAG leaderboard (NAACL'25)
Language: Python - Size: 2.8 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 0

KID-22/Cocktail
Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration
Language: Python - Size: 9.13 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 7 - Forks: 0

Daniel0214-oss/faculty-evaluation-system
faculty-evaluation-system
Language: PHP - Size: 2.34 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 7 - Forks: 4

TonicAI/tvallogging
A tool for evaluating and tracking your RAG experiments. This repo contains the Python SDK for logging to Tonic Validate.
Language: Python - Size: 306 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 2

rosinality/halite
Acceleration framework for Human Alignment Learning
Language: Python - Size: 368 KB - Last synced at: 17 days ago - Pushed at: 3 months ago - Stars: 6 - Forks: 0

lstrgar/SEG
Evaluation of label free cell and nuclei segmentation
Language: Jupyter Notebook - Size: 19.1 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 3

isi-nlp/boteval
Chat bot evaluation
Language: Python - Size: 626 KB - Last synced at: 22 days ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 1

ascr-ecx/etk
Evaluation Toolkit is a JavaScript/HTML/CSS set of modules that present different experimental designs in the Qualtrics API.
Language: JavaScript - Size: 3.8 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 6 - Forks: 1

feup-infolab/army-ant
An experimental information retrieval framework and a workbench for innovation in entity-oriented search.
Language: Jupyter Notebook - Size: 77.7 MB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 6 - Forks: 0

Eustema-S-p-A/SCARF
SCARF (System for Comprehensive Assessment of RAG Frameworks) is a modular evaluation framework for benchmarking deployed Retrieval Augmented Generation (RAG) applications. It offers end-to-end, black-box assessment across multiple configurations, supports automated testing with several vector databases and LLMs.
Language: Python - Size: 592 KB - Last synced at: 27 days ago - Pushed at: 28 days ago - Stars: 5 - Forks: 0
