An open API service providing repository metadata for many open source software ecosystems.

Topic: "evaluation-framework"

EleutherAI/lm-evaluation-harness

A framework for few-shot evaluation of language models.

Language: Python - Size: 29.5 MB - Last synced at: about 14 hours ago - Pushed at: about 18 hours ago - Stars: 8,931 - Forks: 2,384

promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Language: TypeScript - Size: 361 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 6,506 - Forks: 527

confident-ai/deepeval

The LLM Evaluation Framework

Language: Python - Size: 82.9 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 6,282 - Forks: 549

huggingface/lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Language: Python - Size: 4.79 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1,519 - Forks: 241

MaurizioFD/RecSys2019_DeepLearning_Evaluation

This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.

Language: Python - Size: 216 MB - Last synced at: 4 days ago - Pushed at: almost 2 years ago - Stars: 987 - Forks: 249

relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

Language: Python - Size: 1.92 MB - Last synced at: 3 days ago - Pushed at: 4 months ago - Stars: 492 - Forks: 34

ServiceNow/AgentLab

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

Language: Python - Size: 3.05 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 320 - Forks: 62

TonicAI/tonic_validate

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

Language: Python - Size: 6.19 MB - Last synced at: 8 days ago - Pushed at: 6 months ago - Stars: 297 - Forks: 30

athina-ai/athina-evals

Python SDK for running evaluations on LLM generated responses

Language: Python - Size: 1.82 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 277 - Forks: 17

JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

Language: Python - Size: 9.37 MB - Last synced at: 6 days ago - Pushed at: 6 months ago - Stars: 239 - Forks: 40

aiverify-foundation/moonshot

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

Language: Python - Size: 219 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 234 - Forks: 50

diningphil/PyDGN

A research library for automating experiments on Deep Graph Networks

Language: Python - Size: 10.7 MB - Last synced at: 1 day ago - Pushed at: 8 months ago - Stars: 222 - Forks: 13

zeno-ml/zeno 📦

AI Data Management & Evaluation Platform

Language: Svelte - Size: 51.6 MB - Last synced at: 6 days ago - Pushed at: over 1 year ago - Stars: 215 - Forks: 11

lartpang/PySODEvalToolkit

PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection

Language: Python - Size: 309 KB - Last synced at: 30 days ago - Pushed at: 8 months ago - Stars: 176 - Forks: 21

symflower/eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

Language: Go - Size: 18.8 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 171 - Forks: 8

bijington/expressive

Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.

Language: C# - Size: 3.74 MB - Last synced at: about 18 hours ago - Pushed at: 8 months ago - Stars: 169 - Forks: 27

empirical-run/empirical

Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application

Language: TypeScript - Size: 1.58 MB - Last synced at: 2 days ago - Pushed at: 9 months ago - Stars: 156 - Forks: 12

microsoft/eureka-ml-insights

A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.

Language: Python - Size: 20.5 MB - Last synced at: about 13 hours ago - Pushed at: about 13 hours ago - Stars: 142 - Forks: 23

AI21Labs/lm-evaluation

Evaluation suite for large-scale language models.

Language: Python - Size: 19.5 KB - Last synced at: 22 days ago - Pushed at: over 3 years ago - Stars: 125 - Forks: 16

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

Language: Python - Size: 3.13 MB - Last synced at: 26 days ago - Pushed at: 9 months ago - Stars: 122 - Forks: 17

tsenst/CrowdFlow

Optical Flow Dataset and Benchmark for Visual Crowd Analysis

Language: Python - Size: 1.94 MB - Last synced at: 8 days ago - Pushed at: almost 2 years ago - Stars: 115 - Forks: 22

HKUSTDial/NL2SQL360

Official repository for the paper “The Dawn of Natural Language to SQL: Are We Fully Ready?” (VLDB'24)

Language: Python - Size: 8.61 MB - Last synced at: 18 days ago - Pushed at: about 2 months ago - Stars: 114 - Forks: 10

kaiko-ai/eva

Evaluation framework for oncology foundation models (FMs)

Language: Python - Size: 14.4 MB - Last synced at: about 19 hours ago - Pushed at: about 20 hours ago - Stars: 106 - Forks: 11

EuroEval/EuroEval

The robust European language model benchmark.

Language: Python - Size: 88.4 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 101 - Forks: 25

codefuse-ai/codefuse-evaluation

Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中

Language: Python - Size: 35.1 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 94 - Forks: 14

haeyeoni/lidar_slam_evaluator

LiDAR SLAM comparison and evaluation framework

Language: Python - Size: 31.3 MB - Last synced at: about 2 months ago - Pushed at: almost 4 years ago - Stars: 93 - Forks: 16

hpclab/rankeval

Official repository of RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions.

Language: Python - Size: 7.73 MB - Last synced at: 8 months ago - Pushed at: almost 5 years ago - Stars: 88 - Forks: 11

Borda/BIRL 📦

BIRL: Benchmark on Image Registration methods with Landmark validations

Language: Python - Size: 8.49 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 87 - Forks: 26

pyrddlgym-project/pyRDDLGym

A toolkit for auto-generation of OpenAI Gym environments from RDDL description files.

Language: Python - Size: 64.3 MB - Last synced at: about 18 hours ago - Pushed at: 28 days ago - Stars: 83 - Forks: 22

BMW-InnovationLab/SORDI-AI-Evaluation-GUI

This repository allows you to evaluate a trained computer vision model and get general information and evaluation metrics with little configuration.

Language: Python - Size: 41.5 MB - Last synced at: 10 months ago - Pushed at: over 1 year ago - Stars: 75 - Forks: 3

nouhadziri/DialogEntailment

The implementation of the paper "Evaluating Coherence in Dialogue Systems using Entailment"

Language: Python - Size: 85 KB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 74 - Forks: 5

X-PLUG/WritingBench

WritingBench: A Comprehensive Benchmark for Generative Writing

Language: Python - Size: 18 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 73 - Forks: 8

jinzhuoran/RWKU

RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024

Language: Python - Size: 3.82 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 72 - Forks: 7

OPTML-Group/Diffusion-MU-Attack

The official implementation of ECCV'24 paper "To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now". This work introduces one fast and effective attack method to evaluate the harmful-content generation ability of safety-driven unlearned diffusion models.

Language: Python - Size: 11.9 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 71 - Forks: 3

pentoai/vectory

Vectory provides a collection of tools to track and compare embedding versions.

Language: Python - Size: 1.92 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 71 - Forks: 0

ashafaei/OD-test

OD-test: A Less Biased Evaluation of Out-of-Distribution (Outlier) Detectors (PyTorch)

Language: Python - Size: 3.5 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 61 - Forks: 11

sb-ai-lab/Sim4Rec

Simulator for training and evaluation of Recommender Systems

Language: Jupyter Notebook - Size: 9.84 MB - Last synced at: 16 days ago - Pushed at: about 2 months ago - Stars: 55 - Forks: 3

VLMHyperBenchTeam/VLMHyperBench

VLMHyperBench – open source фреймворк для оценки возможностей Vision language models (VLM) распознавать документы на русском языке с целью оценки их потенциала для автоматизации документооборота.

Language: Python - Size: 2.82 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 54 - Forks: 0

powerflows/powerflows-dmn

Power Flows DMN - Powerful decisions and rules engine

Language: Java - Size: 545 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 51 - Forks: 6

SpikeInterface/spiketoolkit 📦

Python-based tools for pre-, post-processing, validating, and curating spike sorting datasets.

Language: Python - Size: 3.94 MB - Last synced at: about 2 months ago - Pushed at: over 3 years ago - Stars: 50 - Forks: 29

kolenaIO/kolena

Python client for Kolena's machine learning testing platform

Language: Python - Size: 75.4 MB - Last synced at: 7 days ago - Pushed at: 27 days ago - Stars: 48 - Forks: 5

logic-star-ai/swt-bench

[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating LLM repository-level test-generation

Language: Python - Size: 4.38 MB - Last synced at: 4 days ago - Pushed at: 27 days ago - Stars: 48 - Forks: 6

cedrickchee/vibe-jet

A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"

Language: HTML - Size: 8.65 MB - Last synced at: 10 days ago - Pushed at: about 1 month ago - Stars: 47 - Forks: 8

yupidevs/pactus

Framework to evaluate Trajectory Classification Algorithms

Language: Python - Size: 1.17 MB - Last synced at: 1 day ago - Pushed at: 10 months ago - Stars: 45 - Forks: 0

MinhVuong2000/LLMReasonCert

[ACL'24] Official Implementation of the paper "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs"(https://aclanthology.org/2024.findings-acl.168)

Language: Python - Size: 671 KB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 42 - Forks: 9

GAIR-NLP/scaleeval

Scalable Meta-Evaluation of LLMs as Evaluators

Language: Python - Size: 3.89 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 42 - Forks: 3

srcclr/efda

Evaluation Framework for Dependency Analysis (EFDA)

Language: C - Size: 27.5 MB - Last synced at: 10 months ago - Pushed at: about 3 years ago - Stars: 39 - Forks: 48

tensorstax/agenttrace

AgentTrace is a lightweight observability library to trace and evaluate agentic systems.

Language: TypeScript - Size: 14.4 MB - Last synced at: 13 days ago - Pushed at: about 1 month ago - Stars: 37 - Forks: 0

cowjen01/repsys

Framework for Interactive Evaluation of Recommender Systems

Language: JavaScript - Size: 12.5 MB - Last synced at: 27 days ago - Pushed at: almost 2 years ago - Stars: 36 - Forks: 5

AstraBert/diRAGnosis

Diagnose the performance of your RAG🩺

Language: Python - Size: 214 KB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 35 - Forks: 3

act3-ace/CoRL

The Core Reinforcement Learning library is intended to enable scalable deep reinforcement learning experimentation in a manner extensible to new simulations and new ways for the learning agents to interact with them. The hope is that this makes RL research easier by removing lock-in to particular simulations.The work is released under the follow APRS approval. Initial release of CoRL - Part #1 -Approved on 2022-05-2024 12:08:51 - PA Approval # [AFRL-2022-2455]" Documentation https://act3-ace.github.io/CoRL/

Language: Python - Size: 3.41 MB - Last synced at: 30 days ago - Pushed at: 11 months ago - Stars: 35 - Forks: 5

adithya-s-k/indic_eval

A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks

Language: Python - Size: 555 KB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 33 - Forks: 7

encord-team/text-to-image-eval

Evaluate custom and HuggingFace text-to-image/zero-shot-image-classification models like CLIP, SigLIP, DFN5B, and EVA-CLIP. Metrics include Zero-shot accuracy, Linear Probe, Image retrieval, and KNN accuracy.

Language: Jupyter Notebook - Size: 14.2 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 30 - Forks: 0

tohtsky/irspack

Train, evaluate, and optimize implicit feedback-based recommender systems.

Language: Python - Size: 733 KB - Last synced at: 15 days ago - Pushed at: almost 2 years ago - Stars: 30 - Forks: 9

AstraBert/SenTrEv

Simple customizable evaluation for text retrieval performance of Sentence Transformers embedders on PDFs

Language: Python - Size: 2.52 MB - Last synced at: 3 days ago - Pushed at: 4 months ago - Stars: 26 - Forks: 2

NOAA-OWP/gval

A high-level Python framework to evaluate the skill of geospatial datasets by comparing candidates to benchmark maps producing agreement maps and metrics.

Language: Jupyter Notebook - Size: 29.3 MB - Last synced at: 2 days ago - Pushed at: 2 months ago - Stars: 25 - Forks: 2

ibm-self-serve-assets/JudgeIt-LLM-as-a-Judge

Automation Framework using LLM-as-a-judge to Scale Eval of Gen AI solutions (RAG, Multi-turn, Query Rewrite, Text2SQL etc.); that is a good proxy for human judgement.

Language: Python - Size: 7.2 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 25 - Forks: 4

MorenoLaQuatra/ARCH

ARCH: Audio Representations benCHmark

Language: Python - Size: 649 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 23 - Forks: 2

vinid/quica

quica is a tool to run inter coder agreement pipelines in an easy and effective ways. Multiple measures are run and results are collected in a single table than can be easily exported in Latex

Language: Python - Size: 112 KB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 23 - Forks: 0

davidheineman/thresh

🌾 Universal, customizable and deployable fine-grained evaluation for text generation.

Language: Vue - Size: 87.5 MB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 22 - Forks: 4

ad-freiburg/elevant

Entity linking evaluation and analysis tool

Language: Python - Size: 142 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 21 - Forks: 1

monetjoe/ccmusic_eval

CCMusic, an open Chinese music database, integrates diverse datasets. It ensures data consistency via cleaning, label refinement and structure unification. A unified evaluation framework is used for benchmark evaluations, supporting classification and detection tasks.|CCMusic是一个开放的中文音乐数据库,整合了多样化数据集。通过数据清洗、标签优化和结构统一化确保数据一致性。使用统一评估框架进行基准评估,支持分类和检测任务。

Language: Python - Size: 2.78 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 19 - Forks: 0

SJTUHaiyangYu/BackdoorMBTI

BackdoorMBTI is an open source project expanding the unimodal backdoor learning to a multimodal context. We hope that BackdoorMBTI can facilitate the analysis and development of backdoor defense methods within a multimodal context.

Language: Python - Size: 5.55 MB - Last synced at: 15 days ago - Pushed at: 6 months ago - Stars: 19 - Forks: 1

ma7555/evalify

Evaluate your biometric verification models literally in seconds.

Language: Python - Size: 3.05 MB - Last synced at: 8 days ago - Pushed at: 6 months ago - Stars: 19 - Forks: 20

LIAAD/tieval

An Evaluation Framework for Temporal Information Extraction Systems

Language: Python - Size: 1.05 MB - Last synced at: 20 days ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 1

GoogleCloudPlatform/evalbench

EvalBench is a flexible framework designed to measure the quality of generative AI (GenAI) workflows around database specific tasks.

Language: Python - Size: 1.39 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 17 - Forks: 2

aryan-jadon/Evaluation-Metrics-for-Recommendation-Systems

This repository contains the implementation of evaluation metrics for recommendation systems. We have compared similarity, candidate generation, rating, ranking metrics performance on 5 different datasets - MovieLens 100k, MovieLens 1m, MovieLens 10m, Amazon Electronics Dataset and Amazon Movies and TV Dataset.

Language: Python - Size: 22.7 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 16 - Forks: 11

vishal-keshav/fast_prototype

This is a machine learning framework that enables developers to iterate fast over different ML architecture designs.

Language: Python - Size: 2.12 MB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 16 - Forks: 3

claws-lab/XLingEval

Code and Resources for the paper, "Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries"

Language: Python - Size: 29.1 MB - Last synced at: 12 months ago - Pushed at: about 1 year ago - Stars: 14 - Forks: 2

ChristophReich1996/Yeast-in-Microstructures-Dataset

Official and maintained implementation of the dataset paper "An Instance Segmentation Dataset of Yeast Cells in Microstructures" [EMBC 2023].

Language: Python - Size: 14.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 14 - Forks: 0

BorgwardtLab/ggme

Official repository for the ICLR 2022 paper "Evaluation Metrics for Graph Generative Models: Problems, Pitfalls, and Practical Solutions" https://openreview.net/forum?id=tBtoZYKd9n

Language: Python - Size: 348 KB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 14 - Forks: 0

Now-Join-Us/OmniEvalKit Fork of AIDC-AI/M3Bench

The code repository for "OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions"

Language: Python - Size: 3.82 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 13 - Forks: 2

hlt-mt/subsonar

Evaluate the quality of SRT files using the multilingual multimodal SONAR model.

Language: Python - Size: 26.4 KB - Last synced at: 4 days ago - Pushed at: 12 months ago - Stars: 13 - Forks: 0

chziakas/redeval

Red-teaming LLM applications.

Language: Python - Size: 1.2 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 13 - Forks: 0

aigc-apps/PertEval

[NeurIPS '24 Spotlight] PertEval: Unveiling Real Knowledge Capacity of LLMs via Knowledge-invariant Perturbations

Language: Jupyter Notebook - Size: 12.9 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 12 - Forks: 2

MUSC-TBIC/etude-engine

ETUDE (Evaluation Tool for Unstructured Data and Extractions) is a Python-based tool that provides consistent evaluation options across a range of annotation schemata and corpus formats

Language: Jupyter Notebook - Size: 2.28 MB - Last synced at: about 1 year ago - Pushed at: almost 3 years ago - Stars: 12 - Forks: 0

SayedNadim/Image-Quality-Evaluation-Metrics

Implementation of Common Image Evaluation Metrics by Sayed Nadim (sayednadim.github.io). The repo is built based on full reference image quality metrics such as L1, L2, PSNR, SSIM, LPIPS. and feature-level quality metrics such as FID, IS. It can be used for evaluating image denoising, colorization, inpainting, deraining, dehazing etc. where we have access to ground truth.

Language: Python - Size: 70.3 KB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 12 - Forks: 4

eth-lre/mathtutorbench

Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors

Language: Python - Size: 5.02 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 11 - Forks: 1

eduardogr/evalytics

HR tool to orchestrate the Performance Review Cycle of the employees of a company.

Language: Python - Size: 449 KB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 11 - Forks: 2

edgerun/galileo

🪐 A framework for distributed load testing experiments

Language: Python - Size: 636 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 11 - Forks: 3

OPTML-Group/Unlearn-WorstCase

"Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning" by Chongyu Fan*, Jiancheng Liu*, Alfred Hero, Sijia Liu

Language: Python - Size: 40.7 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 10 - Forks: 0

mariaangelapellegrino/Evaluation-Framework

It is an evaluation framework for evaluating and comparing graph embedding techniques

Language: Jupyter Notebook - Size: 46.5 MB - Last synced at: 9 months ago - Pushed at: almost 2 years ago - Stars: 10 - Forks: 8

ChenLiu-1996/GAN-evaluator

A pip-installable evaluator for GANs (IS and FID). Accepts either dataloaders or individual batches. Supports on-the-fly evaluation during training. A working DCGAN SVHN demo script provided.

Language: Python - Size: 778 KB - Last synced at: 2 days ago - Pushed at: about 2 years ago - Stars: 10 - Forks: 0

lapix-ufsc/lapixdl

Python package with Deep Learning utilities for Computer Vision

Language: Python - Size: 324 KB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 9 - Forks: 3

Delphi-BFT/tool

Delphi-BFT automates large-scale simulations of unmodified BFT protocol implementations through the Phantom simulator given a simple experimental description. For the first time, experiments with existing BFT protocol implementations can be effortless setup, configured and fed into a simulation engine

Language: JavaScript - Size: 9.4 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 9 - Forks: 1

andrewimpellitteri/llm_poli_compass

A program to automate testing open source LLMs for their political compass scores

Language: Python - Size: 996 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 9 - Forks: 0

GiovanniBaccichet/DNCS-HTTP3

Docker-based virtualized framework for analysing HTTP/3+QUIC performance and compare it to HTTP/2 and TCP.

Language: Shell - Size: 35.7 MB - Last synced at: about 1 year ago - Pushed at: about 4 years ago - Stars: 9 - Forks: 1

maximhq/maxim-cookbooks

Maxim is an end-to-end AI evaluation and observability platform that empowers modern AI teams to ship agents with quality, reliability, and speed.

Language: Jupyter Notebook - Size: 122 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 8 - Forks: 5

fernando-aristizabal/gVal

A Python framework to evaluate geospatial datasets by comparing candidate and benchmark maps to compute agreement maps and statistics.

Language: Python - Size: 65.4 KB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 8 - Forks: 0

e0397123/AM-FM

Language: Python - Size: 1.52 MB - Last synced at: 5 months ago - Pushed at: almost 4 years ago - Stars: 8 - Forks: 0

vectara/mirage-bench

Repository for Multililngual Generation, RAG evaluations, and surrogate judge training for Arena RAG leaderboard (NAACL'25)

Language: Python - Size: 2.8 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 0

KID-22/Cocktail

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

Language: Python - Size: 9.13 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 7 - Forks: 0

Daniel0214-oss/faculty-evaluation-system

faculty-evaluation-system

Language: PHP - Size: 2.34 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 7 - Forks: 4

TonicAI/tvallogging

A tool for evaluating and tracking your RAG experiments. This repo contains the Python SDK for logging to Tonic Validate.

Language: Python - Size: 306 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 2

rosinality/halite

Acceleration framework for Human Alignment Learning

Language: Python - Size: 368 KB - Last synced at: 17 days ago - Pushed at: 3 months ago - Stars: 6 - Forks: 0

lstrgar/SEG

Evaluation of label free cell and nuclei segmentation

Language: Jupyter Notebook - Size: 19.1 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 3

isi-nlp/boteval

Chat bot evaluation

Language: Python - Size: 626 KB - Last synced at: 22 days ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 1

ascr-ecx/etk

Evaluation Toolkit is a JavaScript/HTML/CSS set of modules that present different experimental designs in the Qualtrics API.

Language: JavaScript - Size: 3.8 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 6 - Forks: 1

feup-infolab/army-ant

An experimental information retrieval framework and a workbench for innovation in entity-oriented search.

Language: Jupyter Notebook - Size: 77.7 MB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 6 - Forks: 0

Eustema-S-p-A/SCARF

SCARF (System for Comprehensive Assessment of RAG Frameworks) is a modular evaluation framework for benchmarking deployed Retrieval Augmented Generation (RAG) applications. It offers end-to-end, black-box assessment across multiple configurations, supports automated testing with several vector databases and LLMs.

Language: Python - Size: 592 KB - Last synced at: 27 days ago - Pushed at: 28 days ago - Stars: 5 - Forks: 0

Related Topics
evaluation 41 evaluation-metrics 34 llm 26 machine-learning 19 python 17 large-language-models 14 llm-evaluation 14 llmops 12 benchmark 11 deep-learning 10 nlp 10 llms 10 rag 9 ai 8 llm-evaluation-framework 7 testing 7 recommender-system 6 dataset 6 research 6 language-model 6 information-retrieval 5 natural-language-processing 5 metrics 5 benchmarking 5 retrieval-augmented-generation 5 visualization 5 llm-eval 5 prompts 4 prompt-testing 4 computer-vision 4 artificial-intelligence 4 prompt-engineering 4 eval 4 genai 4 transformers 3 earth-science 3 text2sql 3 synthetic-dataset-generation 3 unlearning 3 benchmark-framework 3 llm-inference 3 trustworthy-ai 3 agents 3 matrix-factorization 3 reinforcement-learning 3 promptfoo 3 reproducible-research 3 pytorch 3 deep-neural-networks 3 docker 3 synthetic-data 3 adversarial-attacks 3 generative-ai 3 recommender-systems 2 image-processing 2 python3 2 sentence-transformers 2 datasets 2 vector-database 2 pr-curve 2 retrieval 2 simulation 2 qdrant 2 python-package 2 text-to-sql 2 tensorflow 2 observability 2 recommendation-system 2 fastapi 2 security 2 llm-evaluation-metrics 2 javascript 2 spatial-analysis 2 spatial-temporal 2 statistics 2 science 2 huggingface 2 evaluations 2 geospatial 2 environment 2 mllm 2 rag-evaluation 2 ant 2 red-teaming 2 nl2sql 2 embeddings-extraction 2 medical-imaging 2 prompting 2 cell-segmentation 2 classification 2 performance-evaluation 2 agent 2 knn 2 llm-evaluation-toolkit 2 time-series 2 bpr 2 forecasting 2 analysis-framework 2 nlp-machine-learning 2 web-application 2