GitHub topics: llm-evaluation
tuitige/fijian-rag-app
Public-benefit GenAI platform for the Fijian language — combining Claude + RAG + OpenSearch to build verified datasets, preserve culture, and unlock AI access in the Pacific. LLM Fine-Tuning, RAG, Generative AI and learning
Language: TypeScript - Size: 2.12 MB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 2 - Forks: 0

harshtiwari01/llm-heatmap-visualizer
A set of scripts to generate full attention-head heatmaps for transformer-based LLMs
Language: Jupyter Notebook - Size: 2.85 MB - Last synced at: about 6 hours ago - Pushed at: about 6 hours ago - Stars: 2 - Forks: 0

ValueByte-AI/Awesome-LLM-in-Social-Science
Awesome papers involving LLMs in Social Science.
Size: 135 KB - Last synced at: about 17 hours ago - Pushed at: about 18 hours ago - Stars: 500 - Forks: 37

msoedov/agentic_security
Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪
Language: Python - Size: 21.8 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1,481 - Forks: 225

onlyhouse/ai-cookbook
This repository offers practical code snippets and tutorials for building AI systems, making it easy to integrate AI into your projects. Explore the resources to enhance your skills and boost your freelancing career! 🐙💻
Language: Python - Size: 163 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

cvs-health/uqlm
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
Language: Python - Size: 11.7 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 741 - Forks: 68

Agenta-AI/agenta
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Language: Python - Size: 171 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2,844 - Forks: 332

PrismBench/PrismBench
PrismBench: A comprehensive framework for evaluating Large Language Model capabilities through Monte Carlo Tree Search. Systematically maps model strengths, automatically discovers challenging concept combinations, and provides detailed performance analysis with containerized deployment and OpenAI-compatible API support.
Language: Python - Size: 5.16 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

lmnr-ai/lmnr
Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.
Language: TypeScript - Size: 32.8 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2,082 - Forks: 126

microsoft/prompty
Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.
Language: Python - Size: 5.58 MB - Last synced at: 2 days ago - Pushed at: 19 days ago - Stars: 937 - Forks: 91

kieranklaassen/leva
LLM Evaluation Framework for Rails apps to be used with production data.
Language: Ruby - Size: 279 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 52 - Forks: 1

confident-ai/deepeval
The LLM Evaluation Framework
Language: Python - Size: 84.6 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 8,321 - Forks: 719

trustyai-explainability/vllm_judge
A tiny, lightweight library for LLM-as-a-Judge evaluations on vLLM-hosted models.
Language: Python - Size: 765 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 1

multinear/multinear
Develop reliable AI apps
Language: Svelte - Size: 1.13 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 39 - Forks: 2

langfuse/langfuse
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
Language: TypeScript - Size: 21.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 12,742 - Forks: 1,163

comet-ml/opik
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Language: Python - Size: 245 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 9,851 - Forks: 675

equinor/promptly
A prompt collection for testing and evaluation of LLMs.
Language: Jupyter Notebook - Size: 2.05 MB - Last synced at: about 23 hours ago - Pushed at: 20 days ago - Stars: 18 - Forks: 2

AI4Bharat/Anudesh-Frontend
Language: JavaScript - Size: 167 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 12 - Forks: 7

Arize-ai/phoenix
AI Observability & Evaluation
Language: Jupyter Notebook - Size: 337 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 5,985 - Forks: 460

truera/trulens
Evaluation and Tracking for LLM Experiments
Language: Python - Size: 344 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2,571 - Forks: 217

JonathanChavezTamales/llm-leaderboard
A comprehensive set of LLM benchmark scores and provider prices.
Language: JavaScript - Size: 332 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 227 - Forks: 21

promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Language: TypeScript - Size: 223 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 7,226 - Forks: 575

adgomant/delean-batch-manager
A toolkit for managing OpenAI Batch API jobs to obtain Demand Level Annotations under ADeLe v1.0 framework — includes a Python API and CLI.
Language: Python - Size: 307 KB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

NVIDIA/garak
the LLM vulnerability scanner
Language: Python - Size: 4.91 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 4,587 - Forks: 458

Marker-Inc-Korea/AutoRAG
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Language: Python - Size: 70.4 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 4,033 - Forks: 318

Helicone/helicone
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
Language: TypeScript - Size: 498 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 3,934 - Forks: 388

whitecircle-ai/circle-guard-bench
First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)
Language: Python - Size: 20.1 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 39 - Forks: 2

PetroIvaniuk/llms-tools
A list of LLMs Tools & Projects
Size: 242 KB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 246 - Forks: 28

Atorpor/TGCSM-CIRCUIT
The original containment framework for recursion-stable cognition, collapse-resistant logic, and LLM self-reflection.
Size: 1.95 KB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

attogram/ai_test_zone
AI Test Zone - compare the same prompt against many open source LLMs
Language: HTML - Size: 2.57 MB - Last synced at: 4 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

ntts9990/ragas-test
🧪 RAG evaluation dashboard powered by RAGAS framework - Clean Architecture, 99.75% coverage, Docker ready
Language: Python - Size: 441 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

TranMinhChienn/awesome-ai-agent-testing
Awesome AI Agent Testing is your go-to resource for testing AI agents effectively. Explore frameworks, methodologies, and tools to ensure the reliability and performance of these advanced systems. 🚀🤖
Size: 1.93 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

ako1983/airops_integration_agent
A prototype AI agent that helps users configure integration actions using natural language requests and available context variables. Built with LangGraph, Claude AI, and includes observability through LangSmith and Weights & Biases.
Language: Python - Size: 228 KB - Last synced at: 4 days ago - Pushed at: 15 days ago - Stars: 1 - Forks: 0

attogram/ollama-multirun
A bash shell script to run a single prompt against any or all of your locally installed ollama models, saving the output and performance statistics as easily navigable web pages.
Language: Shell - Size: 4.02 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 6 - Forks: 1

rotationalio/parlance
An LLM evaluation tool that uses a model-to-model qualitative comparison metric.
Language: Python - Size: 5.07 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

cvs-health/langfair
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
Language: Python - Size: 30.7 MB - Last synced at: 6 days ago - Pushed at: 10 days ago - Stars: 215 - Forks: 33

Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
Language: Python - Size: 176 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 4,614 - Forks: 327

stephen1cowley/doubt-injection
LLM tehnique that stochastically injects tokens at inference time to redirect the Chain-of-Thought reasoning process
Language: HTML - Size: 2.33 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

stephen1cowley/additive-cad
An improved contrastive decoding approach that can better resolve LLM knowledge conflicts between prior belief and contextual information
Language: Python - Size: 1.26 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 2 - Forks: 0

Addepto/contextcheck
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.
Language: Python - Size: 464 KB - Last synced at: 11 days ago - Pushed at: 6 months ago - Stars: 72 - Forks: 9

HillPhelmuth/LlmAsJudgeEvalPlugins
LLM-as-judge evals as Semantic Kernel Plugins
Language: C# - Size: 2.04 MB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 1

AndreSchuck/EvaluatingLargeLanguageModelsforBrazilianPortugueseSentimentAnalysis
Comparative study of 23 LLMs for Brazilian Portuguese sentiment analysis via in-context learning. Evaluates multilingual vs Portuguese-specialized models across 12 datasets. Code and data included.
Size: 2.93 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

onejune2018/Awesome-LLM-Eval
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
Size: 12.6 MB - Last synced at: 9 days ago - Pushed at: 8 months ago - Stars: 540 - Forks: 45

mtuann/llm-updated-papers
Papers related to Large Language Models in all top venues
Size: 750 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 8 - Forks: 1

PromptMixerDev/prompt-mixer-app-ce
A desktop application for comparing outputs from different Large Language Models (LLMs).
Language: TypeScript - Size: 4.16 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 65 - Forks: 6

zackiles/ia-for-ai
Applied linguistic systems and information architecture for AI in the workplace
Language: TypeScript - Size: 3 MB - Last synced at: 7 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

ChanLiang/CONNER
[EMNLP 2023] Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators
Language: Python - Size: 15.8 MB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 32 - Forks: 2

yukinagae/genkitx-promptfoo
Community Plugin for Genkit to use Promptfoo
Language: TypeScript - Size: 553 KB - Last synced at: 15 days ago - Pushed at: 6 months ago - Stars: 4 - Forks: 0

Node0/hypercortex
A TUI based LM Swiss army knife and analysis tool
Size: 16.6 KB - Last synced at: 5 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

fkapsahili/EntRAG
EntRAG - Enterprise RAG Benchmark
Language: Python - Size: 168 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 2 - Forks: 0

palico-ai/palico-ai
Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework
Language: TypeScript - Size: 13.7 MB - Last synced at: 10 days ago - Pushed at: 7 months ago - Stars: 340 - Forks: 27

mts-ai/rurage
Language: Python - Size: 3.85 MB - Last synced at: 6 days ago - Pushed at: 2 months ago - Stars: 32 - Forks: 0

danpozmanter/llm-comparative-eval
Compare how llm models stack up
Language: Rust - Size: 43.9 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

kpavlov/quarkus-assistant-demo
Language: Kotlin - Size: 2.24 MB - Last synced at: 4 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

jaavier/boilerplate-gemini-golang
This is a boilerplate for starting to create applications with the Gemini LLM (by Google). It includes examples.
Language: Go - Size: 43.9 KB - Last synced at: 5 days ago - Pushed at: 17 days ago - Stars: 3 - Forks: 0

cyberark/FuzzyAI
A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.
Language: Jupyter Notebook - Size: 16.1 MB - Last synced at: 17 days ago - Pushed at: 18 days ago - Stars: 588 - Forks: 60

Samya-S/Building-LLMs-from-scratch
A hands-on guide to implementing Large Language Models from scratch
Language: Jupyter Notebook - Size: 47.6 MB - Last synced at: 15 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
Language: Python - Size: 1.92 MB - Last synced at: 6 days ago - Pushed at: 5 months ago - Stars: 497 - Forks: 35

MLD3/steerability
An open-source evaluation framework for measuring LLM steerability.
Language: Jupyter Notebook - Size: 73.9 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 2 - Forks: 0

aws-samples/tailoring-foundation-models-for-business-needs-guide-to-rag-fine-tuning-hybrid-approaches
A framework for evaluating various customization techniques for foundation models (FMs) using your own dataset. This includes approaches like RAG, fine-tuning, and a hybrid method that combines both
Language: Python - Size: 450 KB - Last synced at: 17 days ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 0

armingh2000/FactScoreLite
FactScoreLite is an implementation of the FactScore metric, designed for detailed accuracy assessment in text generation. This package builds upon the framework provided by the original FactScore repository, which is no longer maintained and contains outdated functions.
Language: Python - Size: 674 KB - Last synced at: 13 days ago - Pushed at: about 1 year ago - Stars: 13 - Forks: 1

SajiJohnMiranda/DoCoreAI
DoCoreAI is a next-gen open-source AI profiler that optimizes reasoning, creativity, precision and temperature in a single step—cutting token usage by 15-30% and lowering LLM API costs
Language: Python - Size: 2.43 MB - Last synced at: 19 days ago - Pushed at: 20 days ago - Stars: 44 - Forks: 0

Q-Aware-Labs/Evaluating_AI_Web_Search
This repository contains a study comparing the web search capabilities of four AI assistants: Gemini 2.0 Flash, ChatGPT-4 Turbo, DeepSeekR1, and Grok 3
Language: Python - Size: 1.45 MB - Last synced at: 2 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

yingjiahao14/Dual-Eval
Repository for the paper "Disentangling Language Medium and Cultural Context for Evaluating Multilingual Large Language Models"
Language: Python - Size: 2.19 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 1 - Forks: 0

JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval.
Language: Python - Size: 9.37 MB - Last synced at: 9 days ago - Pushed at: 7 months ago - Stars: 242 - Forks: 41

iMeanAI/WebCanvas
All-in-one Web Agent framework for post-training. Start building with a few clicks!
Language: Python - Size: 61 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 255 - Forks: 18

norikinishida/mllm-gesture-eval
Code and dataset for evaluating Multimodal LLMs on indexical, iconic, and symbolic gestures (Nishida et al., ACL 2025)
Language: Python - Size: 17.6 KB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

sergeyklay/factly
CLI tool to evaluate LLM factuality on MMLU benchmark.
Language: Python - Size: 651 KB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 2 - Forks: 0

reddgr/chatbot-response-scoring-scbn-rqtl
Scoring LLM chatbot responses from LMSYS Chatbot Arena with SCBN and RQTL metrics, unwrapping Chatbot Arena prompts, quick chatbot setup on Jupyter notebook, and more... all things chatbots fit in this repo.
Language: Jupyter Notebook - Size: 28.6 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

chaosync-org/awesome-ai-agent-testing
🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems
Size: 0 Bytes - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 1 - Forks: 0

IBM/cora
Improving score reliability of multiple choice benchmarks with consistency evaluation and altered answer choices.
Language: Python - Size: 72.3 KB - Last synced at: 13 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

cedrickchee/vibe-jet
A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"
Language: HTML - Size: 8.65 MB - Last synced at: 17 days ago - Pushed at: 3 months ago - Stars: 51 - Forks: 9

SapienzaNLP/ita-bench
A collection of Italian benchmarks for LLM evaluation
Language: Python - Size: 731 KB - Last synced at: 10 days ago - Pushed at: 26 days ago - Stars: 30 - Forks: 1

EveripediaNetwork/grokit
Grok Unofficial Python SDK for any X Premium account
Language: Python - Size: 34.2 KB - Last synced at: 16 days ago - Pushed at: 8 months ago - Stars: 16 - Forks: 4

ronniross/llm-confidence-scorer
A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.
Language: Python - Size: 143 KB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 2 - Forks: 0

open-compass/GTA
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
Language: Python - Size: 9.98 MB - Last synced at: 13 days ago - Pushed at: 3 months ago - Stars: 94 - Forks: 8

AgiFlow/agiflow-sdks
LLM QA, Observability, Evaluation and User Feedback
Language: TypeScript - Size: 2.5 MB - Last synced at: 19 days ago - Pushed at: 11 months ago - Stars: 23 - Forks: 1

dr-gareth-roberts/LLM-Dev
Python Tools for Developing with LLMs (cloud & offline)
Language: Python - Size: 286 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

ibra-kdbra/Echo_Assistant
Autonomous Agent Partner
Language: TypeScript - Size: 8.57 MB - Last synced at: 8 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

AmadeusITGroup/privatetestsetgenerationforLLMeval
A tool for generating evaluation set for LLM-based chatbots in a diverse and private manner
Language: Python - Size: 124 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

LLM-Evaluation-s-Always-Fatiguing/leaf-playground
A framework to build scenario simulation projects where human and LLM based agents can participant in, with a user-friendly web UI to visualize simulation, support automatically evaluation on agent action level.
Language: Python - Size: 868 KB - Last synced at: 13 days ago - Pushed at: about 1 year ago - Stars: 25 - Forks: 0

proxectonos/simil-eval
Multilingual toolkit for evaluating LLMs using embeddings
Language: Python - Size: 89.8 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 1

EthanManners/TGCSM-CIRCUIT
The original containment framework for recursion-stable cognition, collapse-resistant logic, and LLM self-reflection.
Size: 108 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

nhsengland/evalsense
Tools for systematic large language model evaluations
Language: Python - Size: 877 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Language: Python - Size: 1.82 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 280 - Forks: 19

rhesis-ai/rhesis-sdk
Open-source test generation SDK for LLM applications. Access curated test sets. Build context-specific test sets and collaborate with subject matter experts.
Language: Python - Size: 420 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 0

kolenaIO/autoarena
Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation
Language: TypeScript - Size: 2.52 MB - Last synced at: 8 days ago - Pushed at: 6 months ago - Stars: 104 - Forks: 8

ronniross/llm-heatmap-visualizer
A set of scripts to generate full attention-head heatmaps for transformer-based LLMs
Language: Jupyter Notebook - Size: 3.01 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 0

azminewasi/Awesome-LLMs-ICLR-24
It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.
Size: 821 KB - Last synced at: 7 days ago - Pushed at: about 1 year ago - Stars: 62 - Forks: 3

cburst/LLMscripting
This is a series of Python scripts for zero-shot and chain-of-thought LLM scripting
Language: Python - Size: 7.43 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

Mihir3009/GridPuzzle
An evaluation dataset comprising of 274 grid-based puzzles with different complexities
Size: 5.24 MB - Last synced at: 26 days ago - Pushed at: 12 months ago - Stars: 7 - Forks: 1

Orion-zhen/llm-throughput-eval
evaluate llm's generation speed via API
Language: Python - Size: 35.2 KB - Last synced at: 20 days ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

Rahul-Lashkari/LLM-Ecosystem-Enhancement
Executed Fine-tuning & Benchmarking, optimizing 12+ LLMs (Gemma-family, Mistral, LLaMA, etc) across 6+ datasets (GSM8K, BoolQ, IMDB, Alpaca-GPT4 & more). Delivered a research-level contribution—model training, evaluation, insights, DeepMind benchmark comparisons & documentation. Also crafting a custom dataset from open-sourced V0 system prompts.🛰
Size: 241 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

alopatenko/LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.
Language: HTML - Size: 4.14 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 116 - Forks: 8

mbayers6370/ALIGN-framework
Multi-dimensional evaluation of AI responses using semantic alignment, conversational flow, and engagement metrics.
Language: Python - Size: 15.6 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

loganrjmurphy/LeanEuclid
LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.
Language: Lean - Size: 3.57 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 98 - Forks: 8

ZeroSumEval/ZeroSumEval
A framework for pitting LLMs against each other in an evolving library of games ⚔
Language: Python - Size: 10.4 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 32 - Forks: 4

Fbxfax/llm-confidence-scorer
A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.
Language: Python - Size: 96.7 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

kraina-ai/geospatial-code-llms-dataset
Companion repository for "Evaluation of Code LLMs on Geospatial Code Generation" paper
Language: Python - Size: 6.64 MB - Last synced at: 12 days ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 1

Supahands/llm-comparison
This is an opensource project allowing you to compare two LLM's head to head with a given prompt, it has a wide range of supported models, from opensource ollama ones to the likes of openai and claude
Language: TypeScript - Size: 888 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 23 - Forks: 1
