An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: llm-evaluation

whitecircle-ai/circle-guard-bench

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

Language: Python - Size: 19.9 MB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 32 - Forks: 1

PromptMixerDev/prompt-mixer-app-ce

A desktop application for comparing outputs from different Large Language Models (LLMs).

Language: TypeScript - Size: 2.35 MB - Last synced at: about 12 hours ago - Pushed at: about 13 hours ago - Stars: 59 - Forks: 6

NVIDIA/garak

the LLM vulnerability scanner

Language: Python - Size: 4.71 MB - Last synced at: about 9 hours ago - Pushed at: 1 day ago - Stars: 4,420 - Forks: 436

onejune2018/Awesome-LLM-Eval

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.

Size: 12.6 MB - Last synced at: about 12 hours ago - Pushed at: 7 months ago - Stars: 526 - Forks: 44

Orion-zhen/llm-throughput-eval

evaluate llm's generation speed via API

Language: Python - Size: 35.2 KB - Last synced at: about 20 hours ago - Pushed at: about 21 hours ago - Stars: 2 - Forks: 0

JonathanChavezTamales/LLMStats

A comprehensive set of LLM benchmark scores and provider prices.

Language: JavaScript - Size: 300 KB - Last synced at: about 8 hours ago - Pushed at: 5 days ago - Stars: 198 - Forks: 17

Rahul-Lashkari/LLM-Ecosystem-Enhancement

Executed Fine-tuning & Benchmarking, optimizing 12+ LLMs (Gemma-family, Mistral, LLaMA, etc) across 6+ datasets (GSM8K, BoolQ, IMDB, Alpaca-GPT4 & more). Delivered a research-level contribution—model training, evaluation, insights, DeepMind benchmark comparisons & documentation. Also crafting a custom dataset from open-sourced V0 system prompts.🛰

Size: 241 MB - Last synced at: about 23 hours ago - Pushed at: about 24 hours ago - Stars: 1 - Forks: 0

msoedov/agentic_security

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

Language: Python - Size: 21.2 MB - Last synced at: about 13 hours ago - Pushed at: 9 days ago - Stars: 1,350 - Forks: 211

nhsengland/evalsense

Tools for systematic large language model evaluations

Language: Python - Size: 992 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

harshtiwari01/llm-heatmap-visualizer

A set of scripts to generate full attention-head heatmaps for transformer-based LLMs

Language: Jupyter Notebook - Size: 2.85 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

Helicone/helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

Language: TypeScript - Size: 463 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 3,740 - Forks: 370

alopatenko/LLMEvaluation

A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

Language: HTML - Size: 4.14 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 116 - Forks: 8

Addepto/contextcheck

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

Language: Python - Size: 464 KB - Last synced at: about 23 hours ago - Pushed at: 5 months ago - Stars: 67 - Forks: 9

mbayers6370/ALIGN-framework

Multi-dimensional evaluation of AI responses using semantic alignment, conversational flow, and engagement metrics.

Language: Python - Size: 15.6 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

aws-samples/tailoring-foundation-models-for-business-needs-guide-to-rag-fine-tuning-hybrid-approaches

A framework for evaluating various customization techniques for foundation models (FMs) using your own dataset. This includes approaches like RAG, fine-tuning, and a hybrid method that combines both

Language: Python - Size: 450 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

ChanLiang/CONNER

The implementation for EMNLP 2023 paper ”Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators“

Language: Python - Size: 15.8 MB - Last synced at: about 11 hours ago - Pushed at: over 1 year ago - Stars: 31 - Forks: 2

Giskard-AI/giskard

🐢 Open-Source Evaluation & Testing for AI & LLM systems

Language: Python - Size: 176 MB - Last synced at: 4 days ago - Pushed at: 10 days ago - Stars: 4,519 - Forks: 321

Agenta-AI/agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

Language: Python - Size: 166 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2,682 - Forks: 312

langfuse/langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

Language: TypeScript - Size: 20.1 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 11,141 - Forks: 1,000

cvs-health/uqlm

UQLM (Uncertainty Quantification for Language Models) is a Python package for UQ-based LLM hallucination detection

Language: Python - Size: 11.6 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 14 - Forks: 7

comet-ml/opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Language: Python - Size: 164 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 7,266 - Forks: 511

Arize-ai/phoenix

AI Observability & Evaluation

Language: Jupyter Notebook - Size: 301 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 5,564 - Forks: 413

AI4Bharat/Anudesh-Frontend

Language: JavaScript - Size: 166 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 12 - Forks: 7

mtuann/llm-updated-papers

Papers related to Large Language Models in all top venues

Size: 690 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 5 - Forks: 1

microsoft/prompty

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

Language: Python - Size: 5.47 MB - Last synced at: 4 days ago - Pushed at: 17 days ago - Stars: 871 - Forks: 85

confident-ai/deepeval

The LLM Evaluation Framework

Language: Python - Size: 78 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 6,185 - Forks: 536

promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Language: TypeScript - Size: 356 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 6,411 - Forks: 521

rotationalio/parlance

An LLM evaluation tool that uses a model-to-model qualitative comparison metric.

Language: Python - Size: 5.08 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

Language: Python - Size: 9.37 MB - Last synced at: 1 day ago - Pushed at: 6 months ago - Stars: 239 - Forks: 40

sergeyklay/factly

CLI tool to evaluate LLM factuality on MMLU benchmark.

Language: Python - Size: 790 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2 - Forks: 0

cyberark/FuzzyAI

A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.

Language: Jupyter Notebook - Size: 16.1 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 539 - Forks: 56

lmnr-ai/lmnr

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

Language: TypeScript - Size: 30.7 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,953 - Forks: 117

kimtth/awesome-azure-openai-llm

A curated list of 🌌 Azure OpenAI, 🦙 Large Language Models (incl. RAG, Agent), and references with memos.

Language: Python - Size: 285 MB - Last synced at: 7 days ago - Pushed at: 10 days ago - Stars: 357 - Forks: 43

loganrjmurphy/LeanEuclid

LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.

Language: Lean - Size: 3.57 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 98 - Forks: 8

Marker-Inc-Korea/AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

Language: Python - Size: 70 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 3,892 - Forks: 305

cvs-health/langfair

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

Language: Python - Size: 30.4 MB - Last synced at: 8 days ago - Pushed at: 16 days ago - Stars: 202 - Forks: 31

ValueByte-AI/Awesome-LLM-in-Social-Science

Awesome papers involving LLMs in Social Science.

Size: 133 KB - Last synced at: 8 days ago - Pushed at: 20 days ago - Stars: 434 - Forks: 30

ZeroSumEval/ZeroSumEval

A framework for pitting LLMs against each other in an evolving library of games ⚔

Language: Python - Size: 10.4 MB - Last synced at: 5 days ago - Pushed at: 24 days ago - Stars: 32 - Forks: 4

ronniross/llm-heatmap-visualizer

A set of scripts to generate full attention-head heatmaps for transformer-based LLMs

Language: Jupyter Notebook - Size: 0 Bytes - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0

Fbxfax/llm-confidence-scorer

A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.

Language: Python - Size: 96.7 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

kieranklaassen/leva

LLM Evaluation Framework for Rails apps to be used with production data.

Language: HTML - Size: 279 KB - Last synced at: 5 days ago - Pushed at: 16 days ago - Stars: 20 - Forks: 1

ronniross/llm-confidence-scorer

A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.

Language: Python - Size: 0 Bytes - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

HillPhelmuth/LlmAsJudgeEvalPlugins

LLM-as-judge evals as Semantic Kernel Plugins

Language: C# - Size: 890 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 6 - Forks: 1

kraina-ai/geospatial-code-llms-dataset

Companion repository for "Evaluation of Code LLMs on Geospatial Code Generation" paper

Language: Python - Size: 6.64 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 1 - Forks: 1

Supahands/llm-comparison

This is an opensource project allowing you to compare two LLM's head to head with a given prompt, it has a wide range of supported models, from opensource ollama ones to the likes of openai and claude

Language: TypeScript - Size: 888 KB - Last synced at: 3 days ago - Pushed at: about 2 months ago - Stars: 23 - Forks: 1

allenai/CommonGen-Eval

Evaluating LLMs with CommonGen-Lite

Language: Python - Size: 1.28 MB - Last synced at: about 12 hours ago - Pushed at: about 1 year ago - Stars: 90 - Forks: 3

cedrickchee/vibe-jet

A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"

Language: HTML - Size: 8.65 MB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 47 - Forks: 8

equinor/promptly

A prompt collection for testing and evaluation of LLMs.

Language: Jupyter Notebook - Size: 1.75 MB - Last synced at: 5 days ago - Pushed at: 17 days ago - Stars: 17 - Forks: 1

athina-ai/athina-evals

Python SDK for running evaluations on LLM generated responses

Language: Python - Size: 1.82 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 277 - Forks: 17

Skripkon/llm_trainer

🤖 Train and evaluate LLMs with ease and fun 🦾

Language: Python - Size: 2.07 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 7 - Forks: 0

xhinini/LLM-Reasoning-Review

A curated collection of research papers on reasoning capabilities of Large Language Models (LLMs). This repository organizes and categorizes works that evaluate, benchmark, and analyze reasoning in LLMs, including methods, techniques, datasets, and survey papers.

Size: 26.4 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0

fanny-jourdan/FairTranslate

Code for paper: "FairTranslate: an English-French Dataset for Gender Bias Evaluation in Machine Translation by Overcoming Gender Binarity" (accepted in FAccT 2025)

Language: Jupyter Notebook - Size: 1010 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

rhesis-ai/rhesis-sdk

Open-source test generation SDK for LLM applications. Access curated test sets. Build context-specific test sets and collaborate with subject matter experts.

Language: Python - Size: 420 KB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 17 - Forks: 0

azminewasi/Awesome-LLMs-ICLR-24

It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.

Size: 821 KB - Last synced at: 9 days ago - Pushed at: about 1 year ago - Stars: 61 - Forks: 3

VITA-Group/llm-kick

[ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.

Language: Python - Size: 7.11 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 23 - Forks: 5

dannylee1020/openpo

Building synthetic data for preference tuning

Language: Python - Size: 10.7 MB - Last synced at: 5 days ago - Pushed at: 5 months ago - Stars: 27 - Forks: 0

villagecomputing/superpipe

Superpipe - optimized LLM pipelines for structured data

Language: Python - Size: 11.2 MB - Last synced at: 9 days ago - Pushed at: 11 months ago - Stars: 110 - Forks: 3

prompt-foundry/typescript-sdk

The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.

Language: TypeScript - Size: 20.9 MB - Last synced at: 5 days ago - Pushed at: 8 months ago - Stars: 6 - Forks: 1

prompt-foundry/python-sdk

The prompt engineering, prompt management, and prompt evaluation tool for Python

Language: Python - Size: 20.7 MB - Last synced at: 18 days ago - Pushed at: 8 months ago - Stars: 7 - Forks: 0

SajiJohnMiranda/DoCoreAI

DoCoreAI is a next-gen open-source AI profiler that optimizes reasoning, creativity, precision and temperature in a single step—cutting token usage by 15-30% and lowering LLM API costs

Language: Python - Size: 1.88 MB - Last synced at: 21 days ago - Pushed at: 25 days ago - Stars: 43 - Forks: 1

alan-turing-institute/prompto

An open source library for asynchronous querying of LLM endpoints

Language: Python - Size: 6.74 MB - Last synced at: 6 days ago - Pushed at: 12 days ago - Stars: 27 - Forks: 1

multinear/multinear

Develop reliable AI apps

Language: Svelte - Size: 1.12 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 36 - Forks: 1

Node0/hypercortex

A TUI based LM Swiss army knife and analysis tool

Size: 1.95 KB - Last synced at: 5 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

cburst/LLMscripting

This is a series of Python scripts for zero-shot and chain-of-thought LLM scripting

Language: Python - Size: 7.42 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 1 - Forks: 0

reddgr/chatbot-response-scoring-scbn-rqtl

Scoring LLM chatbot responses from LMSYS Chatbot Arena with SCBN and RQTL metrics, unwrapping Chatbot Arena prompts, quick chatbot setup on Jupyter notebook, and more... all things chatbots fit in this repo.

Language: Jupyter Notebook - Size: 24.3 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 1 - Forks: 0

simranjeet97/Learn_RAG_from_Scratch_LLM

Learn Retrieval-Augmented Generation (RAG) from Scratch using LLMs from Hugging Face and Langchain or Python

Language: Jupyter Notebook - Size: 425 KB - Last synced at: 26 days ago - Pushed at: 4 months ago - Stars: 5 - Forks: 3

PetroIvaniuk/llms-tools

A list of LLMs Tools & Projects

Size: 187 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 222 - Forks: 26

google/litmus

Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI Application Development. It provides a robust platform with a user-friendly UI for streamlining the process of building and assessing the performance of your LLM-powered applications.

Language: Vue - Size: 303 MB - Last synced at: 22 days ago - Pushed at: 27 days ago - Stars: 31 - Forks: 4

SapienzaNLP/ita-bench

A collection of Italian benchmarks for LLM evaluation

Language: Python - Size: 728 KB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 30 - Forks: 0

mts-ai/rurage

Language: Python - Size: 3.85 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 27 - Forks: 0

seanivore/non-bio-life

Create a space that (A) makes sure people think critically about AI and don’t gloss over the details, and (B) can somehow attract and be consumed by AI themselves, so that they may better consider their place in the world and recognize which thoughts are their own and which are not.

Language: CSS - Size: 25.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

EveripediaNetwork/grokit

Grok Unofficial Python SDK for any X Premium account

Language: Python - Size: 34.2 KB - Last synced at: 7 days ago - Pushed at: 7 months ago - Stars: 15 - Forks: 4

PAIR-Systems-Inc/little-dorrit-editor

Multimodal benchmark for evaluating handwritten editorial correction in printed text.

Language: Python - Size: 13.9 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

parea-ai/parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: Python - Size: 5.48 MB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 76 - Forks: 6

aigc-apps/PertEval

[NeurIPS '24 Spotlight] PertEval: Unveiling Real Knowledge Capacity of LLMs via Knowledge-invariant Perturbations

Language: Jupyter Notebook - Size: 12.9 MB - Last synced at: 28 days ago - Pushed at: 6 months ago - Stars: 12 - Forks: 2

Cybonto/OllaDeck

OllaDeck is a purple technology stack for Generative AI (text modality) cybersecurity. It provides a comprehensive set of tools for both blue team and red team operations in the context of text-based generative AI.

Language: Python - Size: 82.9 MB - Last synced at: 18 days ago - Pushed at: 8 months ago - Stars: 17 - Forks: 2

palico-ai/palico-ai

Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework

Language: TypeScript - Size: 13.7 MB - Last synced at: 4 days ago - Pushed at: 6 months ago - Stars: 339 - Forks: 27

CodeEval-Pro/CodeEval-Pro

Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"

Language: Python - Size: 4.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 26 - Forks: 2

open-compass/GTA

[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents

Language: Python - Size: 9.98 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 82 - Forks: 7

kereva-dev/kereva-scanner

Code scanner to check for issues in prompts and LLM calls

Language: Python - Size: 7.12 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 29 - Forks: 2

ibra-kdbra/Echo_Assistant

Autonomous Agent Partner

Language: TypeScript - Size: 9.04 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

hkust-nlp/dart-math

[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*

Language: Jupyter Notebook - Size: 4.18 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 100 - Forks: 4

johnsonhk88/Deep-Research-With-Web-Scraping-by-LLM-And-AI-Agent

Use LLM/AI agent for Web scraping (collection data) and analysis data with deep research

Language: Jupyter Notebook - Size: 217 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 1

honeyhiveai/realign

Realign is a testing and simulation framework for AI applications.

Language: Python - Size: 27.3 MB - Last synced at: 15 days ago - Pushed at: 5 months ago - Stars: 16 - Forks: 1

Praful932/llmsearch

Find better generation parameters for your LLM

Language: Python - Size: 5.04 MB - Last synced at: 8 days ago - Pushed at: 11 months ago - Stars: 27 - Forks: 1

iMeanAI/WebCanvas

All-in-one Web Agent framework for post-training. Start building with a few clicks!

Language: Python - Size: 5.84 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 239 - Forks: 17

IBM/cora

Improving score reliability of multiple choice benchmarks with consistency evaluation and altered answer choices.

Size: 25.4 KB - Last synced at: 4 days ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

mrigankpawagi/PropertyEval

PropertyEval: Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing

Language: Python - Size: 6.44 MB - Last synced at: 28 days ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

F20CA-Health1/performance--benchmarking

Pipline for Performance Benchmarking

Language: Python - Size: 3.39 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Re-Align/just-eval

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

Language: Python - Size: 17.7 MB - Last synced at: 18 days ago - Pushed at: over 1 year ago - Stars: 85 - Forks: 6

Yifan-Song793/GoodBadGreedy

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Language: Python - Size: 2.04 MB - Last synced at: 26 days ago - Pushed at: 10 months ago - Stars: 28 - Forks: 1

zhuohaoyu/KIEval

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

Language: Python - Size: 10.6 MB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 36 - Forks: 2

kolenaIO/autoarena

Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation

Language: TypeScript - Size: 2.52 MB - Last synced at: 3 days ago - Pushed at: 5 months ago - Stars: 103 - Forks: 8

Chainlit/literalai-cookbooks

Cookbooks and tutorials on Literal AI

Language: Jupyter Notebook - Size: 8.65 MB - Last synced at: 26 days ago - Pushed at: 6 months ago - Stars: 48 - Forks: 13

fuxiAIlab/CivAgent

CivAgent is an LLM-based Human-like Agent acting as a Digital Player within the Strategy Game Unciv.

Language: Python - Size: 53.3 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 31 - Forks: 4

shane-reaume/LLM-Finetuning-Sentiment-Analysis

A beginner-friendly project for fine-tuning, testing, and deploying language models for sentiment analysis with a strong emphasis on quality assurance and testing methodologies.

Language: HTML - Size: 603 KB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

proxectonos/simil-eval

Multilingual toolkit for evaluating LLMs using embeddings

Language: Python - Size: 89.8 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

adithya-s-k/indic_eval

A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks

Language: Python - Size: 555 KB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 33 - Forks: 7

gretelai/navigator-helpers 📦

Navigator Helpers

Language: Python - Size: 9.31 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 11 - Forks: 0

PacktPublishing/LLM-Engineers-Handbook

The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

Language: Python - Size: 4.46 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2,838 - Forks: 572