An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: llm-evaluation

tuitige/fijian-rag-app

Public-benefit GenAI platform for the Fijian language — combining Claude + RAG + OpenSearch to build verified datasets, preserve culture, and unlock AI access in the Pacific. LLM Fine-Tuning, RAG, Generative AI and learning

Language: TypeScript - Size: 2.12 MB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 2 - Forks: 0

harshtiwari01/llm-heatmap-visualizer

A set of scripts to generate full attention-head heatmaps for transformer-based LLMs

Language: Jupyter Notebook - Size: 2.85 MB - Last synced at: about 6 hours ago - Pushed at: about 6 hours ago - Stars: 2 - Forks: 0

ValueByte-AI/Awesome-LLM-in-Social-Science

Awesome papers involving LLMs in Social Science.

Size: 135 KB - Last synced at: about 17 hours ago - Pushed at: about 18 hours ago - Stars: 500 - Forks: 37

msoedov/agentic_security

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

Language: Python - Size: 21.8 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1,481 - Forks: 225

onlyhouse/ai-cookbook

This repository offers practical code snippets and tutorials for building AI systems, making it easy to integrate AI into your projects. Explore the resources to enhance your skills and boost your freelancing career! 🐙💻

Language: Python - Size: 163 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

cvs-health/uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

Language: Python - Size: 11.7 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 741 - Forks: 68

Agenta-AI/agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

Language: Python - Size: 171 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2,844 - Forks: 332

PrismBench/PrismBench

PrismBench: A comprehensive framework for evaluating Large Language Model capabilities through Monte Carlo Tree Search. Systematically maps model strengths, automatically discovers challenging concept combinations, and provides detailed performance analysis with containerized deployment and OpenAI-compatible API support.

Language: Python - Size: 5.16 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

lmnr-ai/lmnr

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

Language: TypeScript - Size: 32.8 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2,082 - Forks: 126

microsoft/prompty

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

Language: Python - Size: 5.58 MB - Last synced at: 2 days ago - Pushed at: 19 days ago - Stars: 937 - Forks: 91

kieranklaassen/leva

LLM Evaluation Framework for Rails apps to be used with production data.

Language: Ruby - Size: 279 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 52 - Forks: 1

confident-ai/deepeval

The LLM Evaluation Framework

Language: Python - Size: 84.6 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 8,321 - Forks: 719

trustyai-explainability/vllm_judge

A tiny, lightweight library for LLM-as-a-Judge evaluations on vLLM-hosted models.

Language: Python - Size: 765 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 1

multinear/multinear

Develop reliable AI apps

Language: Svelte - Size: 1.13 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 39 - Forks: 2

langfuse/langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

Language: TypeScript - Size: 21.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 12,742 - Forks: 1,163

comet-ml/opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Language: Python - Size: 245 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 9,851 - Forks: 675

equinor/promptly

A prompt collection for testing and evaluation of LLMs.

Language: Jupyter Notebook - Size: 2.05 MB - Last synced at: about 23 hours ago - Pushed at: 20 days ago - Stars: 18 - Forks: 2

AI4Bharat/Anudesh-Frontend

Language: JavaScript - Size: 167 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 12 - Forks: 7

Arize-ai/phoenix

AI Observability & Evaluation

Language: Jupyter Notebook - Size: 337 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 5,985 - Forks: 460

truera/trulens

Evaluation and Tracking for LLM Experiments

Language: Python - Size: 344 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2,571 - Forks: 217

JonathanChavezTamales/llm-leaderboard

A comprehensive set of LLM benchmark scores and provider prices.

Language: JavaScript - Size: 332 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 227 - Forks: 21

promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Language: TypeScript - Size: 223 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 7,226 - Forks: 575

adgomant/delean-batch-manager

A toolkit for managing OpenAI Batch API jobs to obtain Demand Level Annotations under ADeLe v1.0 framework — includes a Python API and CLI.

Language: Python - Size: 307 KB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

NVIDIA/garak

the LLM vulnerability scanner

Language: Python - Size: 4.91 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 4,587 - Forks: 458

Marker-Inc-Korea/AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

Language: Python - Size: 70.4 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 4,033 - Forks: 318

Helicone/helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

Language: TypeScript - Size: 498 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3,934 - Forks: 388

whitecircle-ai/circle-guard-bench

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

Language: Python - Size: 20.1 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 39 - Forks: 2

PetroIvaniuk/llms-tools

A list of LLMs Tools & Projects

Size: 242 KB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 246 - Forks: 28

Atorpor/TGCSM-CIRCUIT

The original containment framework for recursion-stable cognition, collapse-resistant logic, and LLM self-reflection.

Size: 1.95 KB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

attogram/ai_test_zone

AI Test Zone - compare the same prompt against many open source LLMs

Language: HTML - Size: 2.57 MB - Last synced at: 4 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

ntts9990/ragas-test

🧪 RAG evaluation dashboard powered by RAGAS framework - Clean Architecture, 99.75% coverage, Docker ready

Language: Python - Size: 441 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

TranMinhChienn/awesome-ai-agent-testing

Awesome AI Agent Testing is your go-to resource for testing AI agents effectively. Explore frameworks, methodologies, and tools to ensure the reliability and performance of these advanced systems. 🚀🤖

Size: 1.93 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

ako1983/airops_integration_agent

A prototype AI agent that helps users configure integration actions using natural language requests and available context variables. Built with LangGraph, Claude AI, and includes observability through LangSmith and Weights & Biases.

Language: Python - Size: 228 KB - Last synced at: 4 days ago - Pushed at: 15 days ago - Stars: 1 - Forks: 0

attogram/ollama-multirun

A bash shell script to run a single prompt against any or all of your locally installed ollama models, saving the output and performance statistics as easily navigable web pages.

Language: Shell - Size: 4.02 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 6 - Forks: 1

rotationalio/parlance

An LLM evaluation tool that uses a model-to-model qualitative comparison metric.

Language: Python - Size: 5.07 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

cvs-health/langfair

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

Language: Python - Size: 30.7 MB - Last synced at: 6 days ago - Pushed at: 10 days ago - Stars: 215 - Forks: 33

Giskard-AI/giskard

🐢 Open-Source Evaluation & Testing for AI & LLM systems

Language: Python - Size: 176 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 4,614 - Forks: 327

stephen1cowley/doubt-injection

LLM tehnique that stochastically injects tokens at inference time to redirect the Chain-of-Thought reasoning process

Language: HTML - Size: 2.33 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

stephen1cowley/additive-cad

An improved contrastive decoding approach that can better resolve LLM knowledge conflicts between prior belief and contextual information

Language: Python - Size: 1.26 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 2 - Forks: 0

Addepto/contextcheck

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

Language: Python - Size: 464 KB - Last synced at: 11 days ago - Pushed at: 6 months ago - Stars: 72 - Forks: 9

HillPhelmuth/LlmAsJudgeEvalPlugins

LLM-as-judge evals as Semantic Kernel Plugins

Language: C# - Size: 2.04 MB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 1

AndreSchuck/EvaluatingLargeLanguageModelsforBrazilianPortugueseSentimentAnalysis

Comparative study of 23 LLMs for Brazilian Portuguese sentiment analysis via in-context learning. Evaluates multilingual vs Portuguese-specialized models across 12 datasets. Code and data included.

Size: 2.93 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

onejune2018/Awesome-LLM-Eval

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.

Size: 12.6 MB - Last synced at: 9 days ago - Pushed at: 8 months ago - Stars: 540 - Forks: 45

mtuann/llm-updated-papers

Papers related to Large Language Models in all top venues

Size: 750 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 8 - Forks: 1

PromptMixerDev/prompt-mixer-app-ce

A desktop application for comparing outputs from different Large Language Models (LLMs).

Language: TypeScript - Size: 4.16 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 65 - Forks: 6

zackiles/ia-for-ai

Applied linguistic systems and information architecture for AI in the workplace

Language: TypeScript - Size: 3 MB - Last synced at: 7 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

ChanLiang/CONNER

[EMNLP 2023] Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators

Language: Python - Size: 15.8 MB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 32 - Forks: 2

yukinagae/genkitx-promptfoo

Community Plugin for Genkit to use Promptfoo

Language: TypeScript - Size: 553 KB - Last synced at: 15 days ago - Pushed at: 6 months ago - Stars: 4 - Forks: 0

Node0/hypercortex

A TUI based LM Swiss army knife and analysis tool

Size: 16.6 KB - Last synced at: 5 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

fkapsahili/EntRAG

EntRAG - Enterprise RAG Benchmark

Language: Python - Size: 168 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 2 - Forks: 0

palico-ai/palico-ai

Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework

Language: TypeScript - Size: 13.7 MB - Last synced at: 10 days ago - Pushed at: 7 months ago - Stars: 340 - Forks: 27

mts-ai/rurage

Language: Python - Size: 3.85 MB - Last synced at: 6 days ago - Pushed at: 2 months ago - Stars: 32 - Forks: 0

danpozmanter/llm-comparative-eval

Compare how llm models stack up

Language: Rust - Size: 43.9 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

kpavlov/quarkus-assistant-demo

Language: Kotlin - Size: 2.24 MB - Last synced at: 4 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

jaavier/boilerplate-gemini-golang

This is a boilerplate for starting to create applications with the Gemini LLM (by Google). It includes examples.

Language: Go - Size: 43.9 KB - Last synced at: 5 days ago - Pushed at: 17 days ago - Stars: 3 - Forks: 0

cyberark/FuzzyAI

A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.

Language: Jupyter Notebook - Size: 16.1 MB - Last synced at: 17 days ago - Pushed at: 18 days ago - Stars: 588 - Forks: 60

Samya-S/Building-LLMs-from-scratch

A hands-on guide to implementing Large Language Models from scratch

Language: Jupyter Notebook - Size: 47.6 MB - Last synced at: 15 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

Language: Python - Size: 1.92 MB - Last synced at: 6 days ago - Pushed at: 5 months ago - Stars: 497 - Forks: 35

MLD3/steerability

An open-source evaluation framework for measuring LLM steerability.

Language: Jupyter Notebook - Size: 73.9 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 2 - Forks: 0

aws-samples/tailoring-foundation-models-for-business-needs-guide-to-rag-fine-tuning-hybrid-approaches

A framework for evaluating various customization techniques for foundation models (FMs) using your own dataset. This includes approaches like RAG, fine-tuning, and a hybrid method that combines both

Language: Python - Size: 450 KB - Last synced at: 17 days ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 0

armingh2000/FactScoreLite

FactScoreLite is an implementation of the FactScore metric, designed for detailed accuracy assessment in text generation. This package builds upon the framework provided by the original FactScore repository, which is no longer maintained and contains outdated functions.

Language: Python - Size: 674 KB - Last synced at: 13 days ago - Pushed at: about 1 year ago - Stars: 13 - Forks: 1

SajiJohnMiranda/DoCoreAI

DoCoreAI is a next-gen open-source AI profiler that optimizes reasoning, creativity, precision and temperature in a single step—cutting token usage by 15-30% and lowering LLM API costs

Language: Python - Size: 2.43 MB - Last synced at: 19 days ago - Pushed at: 20 days ago - Stars: 44 - Forks: 0

Q-Aware-Labs/Evaluating_AI_Web_Search

This repository contains a study comparing the web search capabilities of four AI assistants: Gemini 2.0 Flash, ChatGPT-4 Turbo, DeepSeekR1, and Grok 3

Language: Python - Size: 1.45 MB - Last synced at: 2 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

yingjiahao14/Dual-Eval

Repository for the paper "Disentangling Language Medium and Cultural Context for Evaluating Multilingual Large Language Models"

Language: Python - Size: 2.19 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 1 - Forks: 0

JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

Language: Python - Size: 9.37 MB - Last synced at: 9 days ago - Pushed at: 7 months ago - Stars: 242 - Forks: 41

iMeanAI/WebCanvas

All-in-one Web Agent framework for post-training. Start building with a few clicks!

Language: Python - Size: 61 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 255 - Forks: 18

norikinishida/mllm-gesture-eval

Code and dataset for evaluating Multimodal LLMs on indexical, iconic, and symbolic gestures (Nishida et al., ACL 2025)

Language: Python - Size: 17.6 KB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

sergeyklay/factly

CLI tool to evaluate LLM factuality on MMLU benchmark.

Language: Python - Size: 651 KB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 2 - Forks: 0

reddgr/chatbot-response-scoring-scbn-rqtl

Scoring LLM chatbot responses from LMSYS Chatbot Arena with SCBN and RQTL metrics, unwrapping Chatbot Arena prompts, quick chatbot setup on Jupyter notebook, and more... all things chatbots fit in this repo.

Language: Jupyter Notebook - Size: 28.6 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

chaosync-org/awesome-ai-agent-testing

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

Size: 0 Bytes - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 1 - Forks: 0

IBM/cora

Improving score reliability of multiple choice benchmarks with consistency evaluation and altered answer choices.

Language: Python - Size: 72.3 KB - Last synced at: 13 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

cedrickchee/vibe-jet

A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"

Language: HTML - Size: 8.65 MB - Last synced at: 17 days ago - Pushed at: 3 months ago - Stars: 51 - Forks: 9

SapienzaNLP/ita-bench

A collection of Italian benchmarks for LLM evaluation

Language: Python - Size: 731 KB - Last synced at: 10 days ago - Pushed at: 26 days ago - Stars: 30 - Forks: 1

EveripediaNetwork/grokit

Grok Unofficial Python SDK for any X Premium account

Language: Python - Size: 34.2 KB - Last synced at: 16 days ago - Pushed at: 8 months ago - Stars: 16 - Forks: 4

ronniross/llm-confidence-scorer

A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.

Language: Python - Size: 143 KB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 2 - Forks: 0

open-compass/GTA

[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents

Language: Python - Size: 9.98 MB - Last synced at: 13 days ago - Pushed at: 3 months ago - Stars: 94 - Forks: 8

AgiFlow/agiflow-sdks

LLM QA, Observability, Evaluation and User Feedback

Language: TypeScript - Size: 2.5 MB - Last synced at: 19 days ago - Pushed at: 11 months ago - Stars: 23 - Forks: 1

dr-gareth-roberts/LLM-Dev

Python Tools for Developing with LLMs (cloud & offline)

Language: Python - Size: 286 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

ibra-kdbra/Echo_Assistant

Autonomous Agent Partner

Language: TypeScript - Size: 8.57 MB - Last synced at: 8 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

AmadeusITGroup/privatetestsetgenerationforLLMeval

A tool for generating evaluation set for LLM-based chatbots in a diverse and private manner

Language: Python - Size: 124 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

LLM-Evaluation-s-Always-Fatiguing/leaf-playground

A framework to build scenario simulation projects where human and LLM based agents can participant in, with a user-friendly web UI to visualize simulation, support automatically evaluation on agent action level.

Language: Python - Size: 868 KB - Last synced at: 13 days ago - Pushed at: about 1 year ago - Stars: 25 - Forks: 0

proxectonos/simil-eval

Multilingual toolkit for evaluating LLMs using embeddings

Language: Python - Size: 89.8 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 1

EthanManners/TGCSM-CIRCUIT

The original containment framework for recursion-stable cognition, collapse-resistant logic, and LLM self-reflection.

Size: 108 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

nhsengland/evalsense

Tools for systematic large language model evaluations

Language: Python - Size: 877 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

athina-ai/athina-evals

Python SDK for running evaluations on LLM generated responses

Language: Python - Size: 1.82 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 280 - Forks: 19

rhesis-ai/rhesis-sdk

Open-source test generation SDK for LLM applications. Access curated test sets. Build context-specific test sets and collaborate with subject matter experts.

Language: Python - Size: 420 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 0

kolenaIO/autoarena

Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation

Language: TypeScript - Size: 2.52 MB - Last synced at: 8 days ago - Pushed at: 6 months ago - Stars: 104 - Forks: 8

ronniross/llm-heatmap-visualizer

A set of scripts to generate full attention-head heatmaps for transformer-based LLMs

Language: Jupyter Notebook - Size: 3.01 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 0

azminewasi/Awesome-LLMs-ICLR-24

It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.

Size: 821 KB - Last synced at: 7 days ago - Pushed at: about 1 year ago - Stars: 62 - Forks: 3

cburst/LLMscripting

This is a series of Python scripts for zero-shot and chain-of-thought LLM scripting

Language: Python - Size: 7.43 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

Mihir3009/GridPuzzle

An evaluation dataset comprising of 274 grid-based puzzles with different complexities

Size: 5.24 MB - Last synced at: 26 days ago - Pushed at: 12 months ago - Stars: 7 - Forks: 1

Orion-zhen/llm-throughput-eval

evaluate llm's generation speed via API

Language: Python - Size: 35.2 KB - Last synced at: 20 days ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

Rahul-Lashkari/LLM-Ecosystem-Enhancement

Executed Fine-tuning & Benchmarking, optimizing 12+ LLMs (Gemma-family, Mistral, LLaMA, etc) across 6+ datasets (GSM8K, BoolQ, IMDB, Alpaca-GPT4 & more). Delivered a research-level contribution—model training, evaluation, insights, DeepMind benchmark comparisons & documentation. Also crafting a custom dataset from open-sourced V0 system prompts.🛰

Size: 241 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

alopatenko/LLMEvaluation

A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

Language: HTML - Size: 4.14 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 116 - Forks: 8

mbayers6370/ALIGN-framework

Multi-dimensional evaluation of AI responses using semantic alignment, conversational flow, and engagement metrics.

Language: Python - Size: 15.6 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

loganrjmurphy/LeanEuclid

LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.

Language: Lean - Size: 3.57 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 98 - Forks: 8

ZeroSumEval/ZeroSumEval

A framework for pitting LLMs against each other in an evolving library of games ⚔

Language: Python - Size: 10.4 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 32 - Forks: 4

Fbxfax/llm-confidence-scorer

A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.

Language: Python - Size: 96.7 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

kraina-ai/geospatial-code-llms-dataset

Companion repository for "Evaluation of Code LLMs on Geospatial Code Generation" paper

Language: Python - Size: 6.64 MB - Last synced at: 12 days ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 1

Supahands/llm-comparison

This is an opensource project allowing you to compare two LLM's head to head with a given prompt, it has a wide range of supported models, from opensource ollama ones to the likes of openai and claude

Language: TypeScript - Size: 888 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 23 - Forks: 1