GitHub topics: llm-as-a-judge
saichandrapandraju/vllm_judge
A tiny, lightweight library for LLM-as-a-Judge evaluations on vLLM-hosted models.
Language: Python - Size: 738 KB - Last synced at: about 2 hours ago - Pushed at: about 3 hours ago - Stars: 0 - Forks: 0

haizelabs/verdict
Scale your LLM-as-a-judge.
Language: Jupyter Notebook - Size: 10 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 236 - Forks: 16

IAAR-Shanghai/xVerify
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
Language: Python - Size: 826 KB - Last synced at: about 5 hours ago - Pushed at: about 2 months ago - Stars: 109 - Forks: 7

Agenta-AI/agenta
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Language: Python - Size: 171 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 2,795 - Forks: 330

trotacodigos/Rubric-MQM
Rubric-MQM: LLM-as-judge in MT for high-end models
Language: Python - Size: 2.06 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

root-signals/rs-python-sdk
Root Signals Python SDK
Language: Python - Size: 1.46 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 12 - Forks: 1

docling-project/docling-sdg
A set of tools to create synthetically-generated data from documents
Language: Python - Size: 3.47 MB - Last synced at: about 13 hours ago - Pushed at: about 1 month ago - Stars: 15 - Forks: 5

prometheus-eval/prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
Language: Python - Size: 15.1 MB - Last synced at: 14 days ago - Pushed at: about 2 months ago - Stars: 946 - Forks: 54

root-signals/root-signals-mcp
MCP for Root Signals Evaluation Platform
Language: Python - Size: 235 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 5 - Forks: 1

HillPhelmuth/LlmAsJudgeEvalPlugins
LLM-as-judge evals as Semantic Kernel Plugins
Language: C# - Size: 949 KB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 6 - Forks: 1

IAAR-Shanghai/xFinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
Language: Python - Size: 1.36 MB - Last synced at: 17 days ago - Pushed at: 3 months ago - Stars: 169 - Forks: 7

metauto-ai/agent-as-a-judge
⚖️ The First Coding Agent-as-a-Judge
Language: Python - Size: 18.5 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 495 - Forks: 76

whitecircle-ai/circle-guard-bench
First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)
Language: Python - Size: 19.9 MB - Last synced at: 29 days ago - Pushed at: 30 days ago - Stars: 32 - Forks: 1

Largonarco/Eval
LLM system evaluations for a mock system
Language: Python - Size: 88.9 KB - Last synced at: 21 days ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Non-NeutralZero/LLM-EvalSys
automated evaluation of llm generated responses on aws
Language: Python - Size: 36.1 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

KID-22/LLM-IR-Bias-Fairness-Survey
This is the repo for the survey of Bias and Fairness in IR with LLMs.
Size: 919 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 52 - Forks: 3

aws-samples/genai-system-evaluation
A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques
Language: Jupyter Notebook - Size: 551 KB - Last synced at: 5 days ago - Pushed at: 9 months ago - Stars: 8 - Forks: 1

OussamaSghaier/CuREV
Harnessing Large Language Models for Curated Code Reviews
Language: Python - Size: 592 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 12 - Forks: 1

MJ-Bench/MJ-Bench
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
Language: Jupyter Notebook - Size: 218 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 43 - Forks: 5

PKU-ONELab/Themis
The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.
Language: Python - Size: 354 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 18 - Forks: 0

PKU-ONELab/LLM-evaluator-reliability
The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?
Language: Python - Size: 7.95 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 7 - Forks: 1

djokester/groqeval
Use groq for evaluations
Language: Python - Size: 98.6 KB - Last synced at: 8 days ago - Pushed at: 11 months ago - Stars: 2 - Forks: 0

UMass-Meta-LLM-Eval/llm_eval
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.
Language: Python - Size: 1.88 MB - Last synced at: 7 months ago - Pushed at: 8 months ago - Stars: 8 - Forks: 1

aws-samples/model-as-a-judge-eval
Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.
Language: Jupyter Notebook - Size: 53.7 KB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 1

martin-wey/CodeUltraFeedback
CodeUltraFeedback: aligning large language models to coding preferences
Language: Python - Size: 12.4 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 54 - Forks: 2

rafaelsandroni/antibodies
Antibodies for LLMs hallucinations (grouping LLM as a judge, NLI, reward models)
Language: Python - Size: 3.91 KB - Last synced at: about 1 month ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

minnesotanlp/cobbler
Code and data for ACL ARR 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Language: Jupyter Notebook - Size: 3.92 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 13 - Forks: 1
