GitHub topics: llm-as-a-judge
IAAR-Shanghai/xVerify
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
Language: Python - Size: 826 KB - Last synced at: about 11 hours ago - Pushed at: 4 days ago - Stars: 71 - Forks: 5

Agenta-AI/agenta
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Language: TypeScript - Size: 163 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 2,606 - Forks: 305

root-signals/root-signals-mcp
MCP for Root Signals Evaluation Platform
Language: Python - Size: 201 KB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 5 - Forks: 0

Non-NeutralZero/LLM-EvalSys
automated evaluation of llm generated responses on aws
Language: Python - Size: 36.1 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

docling-project/docling-sdg
A set of tools to create synthetically-generated data from documents
Language: Python - Size: 3.29 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 6 - Forks: 3

KID-22/LLM-IR-Bias-Fairness-Survey
This is the repo for the survey of Bias and Fairness in IR with LLMs.
Size: 919 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 52 - Forks: 3

prometheus-eval/prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
Language: Python - Size: 15 MB - Last synced at: 17 days ago - Pushed at: about 1 month ago - Stars: 898 - Forks: 55

IAAR-Shanghai/xFinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
Language: Python - Size: 1.36 MB - Last synced at: 15 days ago - Pushed at: about 2 months ago - Stars: 162 - Forks: 7

root-signals/rs-python-sdk
Root Signals Python SDK
Language: Python - Size: 331 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 11 - Forks: 0

OussamaSghaier/CuREV
Harnessing Large Language Models for Curated Code Reviews
Language: Python - Size: 592 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 12 - Forks: 1

metauto-ai/agent-as-a-judge
🤠Agent-as-a-Judge and DevAI dataset
Language: Python - Size: 5.49 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 331 - Forks: 44

MJ-Bench/MJ-Bench
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
Language: Jupyter Notebook - Size: 218 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 43 - Forks: 5

PKU-ONELab/Themis
The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.
Language: Python - Size: 354 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 18 - Forks: 0

PKU-ONELab/LLM-evaluator-reliability
The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?
Language: Python - Size: 7.95 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 7 - Forks: 1

djokester/groqeval
Use groq for evaluations
Language: Python - Size: 98.6 KB - Last synced at: 2 days ago - Pushed at: 9 months ago - Stars: 2 - Forks: 0

HillPhelmuth/LlmAsJudgeEvalPlugins
LLM-as-judge evals as Semantic Kernel Plugins
Language: C# - Size: 890 KB - Last synced at: 12 days ago - Pushed at: 3 months ago - Stars: 6 - Forks: 1

UMass-Meta-LLM-Eval/llm_eval
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.
Language: Python - Size: 1.88 MB - Last synced at: 6 months ago - Pushed at: 7 months ago - Stars: 8 - Forks: 1

aws-samples/genai-system-evaluation
A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques
Language: Jupyter Notebook - Size: 548 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 2 - Forks: 0

martin-wey/CodeUltraFeedback
CodeUltraFeedback: aligning large language models to coding preferences
Language: Python - Size: 12.4 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 54 - Forks: 2

rafaelsandroni/antibodies
Antibodies for LLMs hallucinations (grouping LLM as a judge, NLI, reward models)
Language: Python - Size: 3.91 KB - Last synced at: 14 days ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

aws-samples/model-as-a-judge-eval
Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.
Language: Jupyter Notebook - Size: 45.9 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

minnesotanlp/cobbler
Code and data for ACL ARR 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Language: Jupyter Notebook - Size: 3.92 MB - Last synced at: 12 months ago - Pushed at: about 1 year ago - Stars: 13 - Forks: 1
