GitHub topics: llm-evaluation
whitecircle-ai/circle-guard-bench
First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)
Language: Python - Size: 19.9 MB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 32 - Forks: 1

PromptMixerDev/prompt-mixer-app-ce
A desktop application for comparing outputs from different Large Language Models (LLMs).
Language: TypeScript - Size: 2.35 MB - Last synced at: about 12 hours ago - Pushed at: about 13 hours ago - Stars: 59 - Forks: 6

NVIDIA/garak
the LLM vulnerability scanner
Language: Python - Size: 4.71 MB - Last synced at: about 9 hours ago - Pushed at: 1 day ago - Stars: 4,420 - Forks: 436

onejune2018/Awesome-LLM-Eval
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
Size: 12.6 MB - Last synced at: about 12 hours ago - Pushed at: 7 months ago - Stars: 526 - Forks: 44

Orion-zhen/llm-throughput-eval
evaluate llm's generation speed via API
Language: Python - Size: 35.2 KB - Last synced at: about 20 hours ago - Pushed at: about 21 hours ago - Stars: 2 - Forks: 0

JonathanChavezTamales/LLMStats
A comprehensive set of LLM benchmark scores and provider prices.
Language: JavaScript - Size: 300 KB - Last synced at: about 8 hours ago - Pushed at: 5 days ago - Stars: 198 - Forks: 17

Rahul-Lashkari/LLM-Ecosystem-Enhancement
Executed Fine-tuning & Benchmarking, optimizing 12+ LLMs (Gemma-family, Mistral, LLaMA, etc) across 6+ datasets (GSM8K, BoolQ, IMDB, Alpaca-GPT4 & more). Delivered a research-level contribution—model training, evaluation, insights, DeepMind benchmark comparisons & documentation. Also crafting a custom dataset from open-sourced V0 system prompts.🛰
Size: 241 MB - Last synced at: about 23 hours ago - Pushed at: about 24 hours ago - Stars: 1 - Forks: 0

msoedov/agentic_security
Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪
Language: Python - Size: 21.2 MB - Last synced at: about 13 hours ago - Pushed at: 9 days ago - Stars: 1,350 - Forks: 211

nhsengland/evalsense
Tools for systematic large language model evaluations
Language: Python - Size: 992 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

harshtiwari01/llm-heatmap-visualizer
A set of scripts to generate full attention-head heatmaps for transformer-based LLMs
Language: Jupyter Notebook - Size: 2.85 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

Helicone/helicone
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
Language: TypeScript - Size: 463 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 3,740 - Forks: 370

alopatenko/LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.
Language: HTML - Size: 4.14 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 116 - Forks: 8

Addepto/contextcheck
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.
Language: Python - Size: 464 KB - Last synced at: about 23 hours ago - Pushed at: 5 months ago - Stars: 67 - Forks: 9

mbayers6370/ALIGN-framework
Multi-dimensional evaluation of AI responses using semantic alignment, conversational flow, and engagement metrics.
Language: Python - Size: 15.6 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

aws-samples/tailoring-foundation-models-for-business-needs-guide-to-rag-fine-tuning-hybrid-approaches
A framework for evaluating various customization techniques for foundation models (FMs) using your own dataset. This includes approaches like RAG, fine-tuning, and a hybrid method that combines both
Language: Python - Size: 450 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

ChanLiang/CONNER
The implementation for EMNLP 2023 paper ”Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators“
Language: Python - Size: 15.8 MB - Last synced at: about 11 hours ago - Pushed at: over 1 year ago - Stars: 31 - Forks: 2

Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
Language: Python - Size: 176 MB - Last synced at: 4 days ago - Pushed at: 10 days ago - Stars: 4,519 - Forks: 321

Agenta-AI/agenta
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Language: Python - Size: 166 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2,682 - Forks: 312

langfuse/langfuse
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
Language: TypeScript - Size: 20.1 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 11,141 - Forks: 1,000

cvs-health/uqlm
UQLM (Uncertainty Quantification for Language Models) is a Python package for UQ-based LLM hallucination detection
Language: Python - Size: 11.6 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 14 - Forks: 7

comet-ml/opik
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Language: Python - Size: 164 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 7,266 - Forks: 511

Arize-ai/phoenix
AI Observability & Evaluation
Language: Jupyter Notebook - Size: 301 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 5,564 - Forks: 413

AI4Bharat/Anudesh-Frontend
Language: JavaScript - Size: 166 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 12 - Forks: 7

mtuann/llm-updated-papers
Papers related to Large Language Models in all top venues
Size: 690 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 5 - Forks: 1

microsoft/prompty
Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.
Language: Python - Size: 5.47 MB - Last synced at: 4 days ago - Pushed at: 17 days ago - Stars: 871 - Forks: 85

confident-ai/deepeval
The LLM Evaluation Framework
Language: Python - Size: 78 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 6,185 - Forks: 536

promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Language: TypeScript - Size: 356 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 6,411 - Forks: 521

rotationalio/parlance
An LLM evaluation tool that uses a model-to-model qualitative comparison metric.
Language: Python - Size: 5.08 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval.
Language: Python - Size: 9.37 MB - Last synced at: 1 day ago - Pushed at: 6 months ago - Stars: 239 - Forks: 40

sergeyklay/factly
CLI tool to evaluate LLM factuality on MMLU benchmark.
Language: Python - Size: 790 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2 - Forks: 0

cyberark/FuzzyAI
A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.
Language: Jupyter Notebook - Size: 16.1 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 539 - Forks: 56

lmnr-ai/lmnr
Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.
Language: TypeScript - Size: 30.7 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,953 - Forks: 117

kimtth/awesome-azure-openai-llm
A curated list of 🌌 Azure OpenAI, 🦙 Large Language Models (incl. RAG, Agent), and references with memos.
Language: Python - Size: 285 MB - Last synced at: 7 days ago - Pushed at: 10 days ago - Stars: 357 - Forks: 43

loganrjmurphy/LeanEuclid
LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.
Language: Lean - Size: 3.57 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 98 - Forks: 8

Marker-Inc-Korea/AutoRAG
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Language: Python - Size: 70 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 3,892 - Forks: 305

cvs-health/langfair
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
Language: Python - Size: 30.4 MB - Last synced at: 8 days ago - Pushed at: 16 days ago - Stars: 202 - Forks: 31

ValueByte-AI/Awesome-LLM-in-Social-Science
Awesome papers involving LLMs in Social Science.
Size: 133 KB - Last synced at: 8 days ago - Pushed at: 20 days ago - Stars: 434 - Forks: 30

ZeroSumEval/ZeroSumEval
A framework for pitting LLMs against each other in an evolving library of games ⚔
Language: Python - Size: 10.4 MB - Last synced at: 5 days ago - Pushed at: 24 days ago - Stars: 32 - Forks: 4

ronniross/llm-heatmap-visualizer
A set of scripts to generate full attention-head heatmaps for transformer-based LLMs
Language: Jupyter Notebook - Size: 0 Bytes - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0

Fbxfax/llm-confidence-scorer
A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.
Language: Python - Size: 96.7 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

kieranklaassen/leva
LLM Evaluation Framework for Rails apps to be used with production data.
Language: HTML - Size: 279 KB - Last synced at: 5 days ago - Pushed at: 16 days ago - Stars: 20 - Forks: 1

ronniross/llm-confidence-scorer
A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.
Language: Python - Size: 0 Bytes - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

HillPhelmuth/LlmAsJudgeEvalPlugins
LLM-as-judge evals as Semantic Kernel Plugins
Language: C# - Size: 890 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 6 - Forks: 1

kraina-ai/geospatial-code-llms-dataset
Companion repository for "Evaluation of Code LLMs on Geospatial Code Generation" paper
Language: Python - Size: 6.64 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 1 - Forks: 1

Supahands/llm-comparison
This is an opensource project allowing you to compare two LLM's head to head with a given prompt, it has a wide range of supported models, from opensource ollama ones to the likes of openai and claude
Language: TypeScript - Size: 888 KB - Last synced at: 3 days ago - Pushed at: about 2 months ago - Stars: 23 - Forks: 1

allenai/CommonGen-Eval
Evaluating LLMs with CommonGen-Lite
Language: Python - Size: 1.28 MB - Last synced at: about 12 hours ago - Pushed at: about 1 year ago - Stars: 90 - Forks: 3

cedrickchee/vibe-jet
A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"
Language: HTML - Size: 8.65 MB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 47 - Forks: 8

equinor/promptly
A prompt collection for testing and evaluation of LLMs.
Language: Jupyter Notebook - Size: 1.75 MB - Last synced at: 5 days ago - Pushed at: 17 days ago - Stars: 17 - Forks: 1

athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Language: Python - Size: 1.82 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 277 - Forks: 17

Skripkon/llm_trainer
🤖 Train and evaluate LLMs with ease and fun 🦾
Language: Python - Size: 2.07 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 7 - Forks: 0

xhinini/LLM-Reasoning-Review
A curated collection of research papers on reasoning capabilities of Large Language Models (LLMs). This repository organizes and categorizes works that evaluate, benchmark, and analyze reasoning in LLMs, including methods, techniques, datasets, and survey papers.
Size: 26.4 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0

fanny-jourdan/FairTranslate
Code for paper: "FairTranslate: an English-French Dataset for Gender Bias Evaluation in Machine Translation by Overcoming Gender Binarity" (accepted in FAccT 2025)
Language: Jupyter Notebook - Size: 1010 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

rhesis-ai/rhesis-sdk
Open-source test generation SDK for LLM applications. Access curated test sets. Build context-specific test sets and collaborate with subject matter experts.
Language: Python - Size: 420 KB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 17 - Forks: 0

azminewasi/Awesome-LLMs-ICLR-24
It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.
Size: 821 KB - Last synced at: 9 days ago - Pushed at: about 1 year ago - Stars: 61 - Forks: 3

VITA-Group/llm-kick
[ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.
Language: Python - Size: 7.11 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 23 - Forks: 5

dannylee1020/openpo
Building synthetic data for preference tuning
Language: Python - Size: 10.7 MB - Last synced at: 5 days ago - Pushed at: 5 months ago - Stars: 27 - Forks: 0

villagecomputing/superpipe
Superpipe - optimized LLM pipelines for structured data
Language: Python - Size: 11.2 MB - Last synced at: 9 days ago - Pushed at: 11 months ago - Stars: 110 - Forks: 3

prompt-foundry/typescript-sdk
The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.
Language: TypeScript - Size: 20.9 MB - Last synced at: 5 days ago - Pushed at: 8 months ago - Stars: 6 - Forks: 1

prompt-foundry/python-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Python
Language: Python - Size: 20.7 MB - Last synced at: 18 days ago - Pushed at: 8 months ago - Stars: 7 - Forks: 0

SajiJohnMiranda/DoCoreAI
DoCoreAI is a next-gen open-source AI profiler that optimizes reasoning, creativity, precision and temperature in a single step—cutting token usage by 15-30% and lowering LLM API costs
Language: Python - Size: 1.88 MB - Last synced at: 21 days ago - Pushed at: 25 days ago - Stars: 43 - Forks: 1

alan-turing-institute/prompto
An open source library for asynchronous querying of LLM endpoints
Language: Python - Size: 6.74 MB - Last synced at: 6 days ago - Pushed at: 12 days ago - Stars: 27 - Forks: 1

multinear/multinear
Develop reliable AI apps
Language: Svelte - Size: 1.12 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 36 - Forks: 1

Node0/hypercortex
A TUI based LM Swiss army knife and analysis tool
Size: 1.95 KB - Last synced at: 5 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

cburst/LLMscripting
This is a series of Python scripts for zero-shot and chain-of-thought LLM scripting
Language: Python - Size: 7.42 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 1 - Forks: 0

reddgr/chatbot-response-scoring-scbn-rqtl
Scoring LLM chatbot responses from LMSYS Chatbot Arena with SCBN and RQTL metrics, unwrapping Chatbot Arena prompts, quick chatbot setup on Jupyter notebook, and more... all things chatbots fit in this repo.
Language: Jupyter Notebook - Size: 24.3 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 1 - Forks: 0

simranjeet97/Learn_RAG_from_Scratch_LLM
Learn Retrieval-Augmented Generation (RAG) from Scratch using LLMs from Hugging Face and Langchain or Python
Language: Jupyter Notebook - Size: 425 KB - Last synced at: 26 days ago - Pushed at: 4 months ago - Stars: 5 - Forks: 3

PetroIvaniuk/llms-tools
A list of LLMs Tools & Projects
Size: 187 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 222 - Forks: 26

google/litmus
Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI Application Development. It provides a robust platform with a user-friendly UI for streamlining the process of building and assessing the performance of your LLM-powered applications.
Language: Vue - Size: 303 MB - Last synced at: 22 days ago - Pushed at: 27 days ago - Stars: 31 - Forks: 4

SapienzaNLP/ita-bench
A collection of Italian benchmarks for LLM evaluation
Language: Python - Size: 728 KB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 30 - Forks: 0

mts-ai/rurage
Language: Python - Size: 3.85 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 27 - Forks: 0

seanivore/non-bio-life
Create a space that (A) makes sure people think critically about AI and don’t gloss over the details, and (B) can somehow attract and be consumed by AI themselves, so that they may better consider their place in the world and recognize which thoughts are their own and which are not.
Language: CSS - Size: 25.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

EveripediaNetwork/grokit
Grok Unofficial Python SDK for any X Premium account
Language: Python - Size: 34.2 KB - Last synced at: 7 days ago - Pushed at: 7 months ago - Stars: 15 - Forks: 4

PAIR-Systems-Inc/little-dorrit-editor
Multimodal benchmark for evaluating handwritten editorial correction in printed text.
Language: Python - Size: 13.9 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

parea-ai/parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language: Python - Size: 5.48 MB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 76 - Forks: 6

aigc-apps/PertEval
[NeurIPS '24 Spotlight] PertEval: Unveiling Real Knowledge Capacity of LLMs via Knowledge-invariant Perturbations
Language: Jupyter Notebook - Size: 12.9 MB - Last synced at: 28 days ago - Pushed at: 6 months ago - Stars: 12 - Forks: 2

Cybonto/OllaDeck
OllaDeck is a purple technology stack for Generative AI (text modality) cybersecurity. It provides a comprehensive set of tools for both blue team and red team operations in the context of text-based generative AI.
Language: Python - Size: 82.9 MB - Last synced at: 18 days ago - Pushed at: 8 months ago - Stars: 17 - Forks: 2

palico-ai/palico-ai
Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework
Language: TypeScript - Size: 13.7 MB - Last synced at: 4 days ago - Pushed at: 6 months ago - Stars: 339 - Forks: 27

CodeEval-Pro/CodeEval-Pro
Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"
Language: Python - Size: 4.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 26 - Forks: 2

open-compass/GTA
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
Language: Python - Size: 9.98 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 82 - Forks: 7

kereva-dev/kereva-scanner
Code scanner to check for issues in prompts and LLM calls
Language: Python - Size: 7.12 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 29 - Forks: 2

ibra-kdbra/Echo_Assistant
Autonomous Agent Partner
Language: TypeScript - Size: 9.04 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

hkust-nlp/dart-math
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
Language: Jupyter Notebook - Size: 4.18 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 100 - Forks: 4

johnsonhk88/Deep-Research-With-Web-Scraping-by-LLM-And-AI-Agent
Use LLM/AI agent for Web scraping (collection data) and analysis data with deep research
Language: Jupyter Notebook - Size: 217 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 1

honeyhiveai/realign
Realign is a testing and simulation framework for AI applications.
Language: Python - Size: 27.3 MB - Last synced at: 15 days ago - Pushed at: 5 months ago - Stars: 16 - Forks: 1

Praful932/llmsearch
Find better generation parameters for your LLM
Language: Python - Size: 5.04 MB - Last synced at: 8 days ago - Pushed at: 11 months ago - Stars: 27 - Forks: 1

iMeanAI/WebCanvas
All-in-one Web Agent framework for post-training. Start building with a few clicks!
Language: Python - Size: 5.84 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 239 - Forks: 17

IBM/cora
Improving score reliability of multiple choice benchmarks with consistency evaluation and altered answer choices.
Size: 25.4 KB - Last synced at: 4 days ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

mrigankpawagi/PropertyEval
PropertyEval: Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing
Language: Python - Size: 6.44 MB - Last synced at: 28 days ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

F20CA-Health1/performance--benchmarking
Pipline for Performance Benchmarking
Language: Python - Size: 3.39 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Re-Align/just-eval
A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
Language: Python - Size: 17.7 MB - Last synced at: 18 days ago - Pushed at: over 1 year ago - Stars: 85 - Forks: 6

Yifan-Song793/GoodBadGreedy
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
Language: Python - Size: 2.04 MB - Last synced at: 26 days ago - Pushed at: 10 months ago - Stars: 28 - Forks: 1

zhuohaoyu/KIEval
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
Language: Python - Size: 10.6 MB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 36 - Forks: 2

kolenaIO/autoarena
Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation
Language: TypeScript - Size: 2.52 MB - Last synced at: 3 days ago - Pushed at: 5 months ago - Stars: 103 - Forks: 8

Chainlit/literalai-cookbooks
Cookbooks and tutorials on Literal AI
Language: Jupyter Notebook - Size: 8.65 MB - Last synced at: 26 days ago - Pushed at: 6 months ago - Stars: 48 - Forks: 13

fuxiAIlab/CivAgent
CivAgent is an LLM-based Human-like Agent acting as a Digital Player within the Strategy Game Unciv.
Language: Python - Size: 53.3 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 31 - Forks: 4

shane-reaume/LLM-Finetuning-Sentiment-Analysis
A beginner-friendly project for fine-tuning, testing, and deploying language models for sentiment analysis with a strong emphasis on quality assurance and testing methodologies.
Language: HTML - Size: 603 KB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

proxectonos/simil-eval
Multilingual toolkit for evaluating LLMs using embeddings
Language: Python - Size: 89.8 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

adithya-s-k/indic_eval
A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks
Language: Python - Size: 555 KB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 33 - Forks: 7

gretelai/navigator-helpers 📦
Navigator Helpers
Language: Python - Size: 9.31 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 11 - Forks: 0

PacktPublishing/LLM-Engineers-Handbook
The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices
Language: Python - Size: 4.46 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2,838 - Forks: 572
