GitHub topics: llm-as-a-judge

Repositories

root-signals/root-signals-mcp

MCP for Root Signals Evaluation Platform

Language: Python - Size: 236 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 6 - Forks: 1

root-signals/rs-python-sdk

Root Signals Python SDK

Language: Python - Size: 1.56 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 12 - Forks: 1

saichandrapandraju/vllm_judge

A tiny, lightweight library for LLM-as-a-Judge evaluations on vLLM-hosted models.

Language: Python - Size: 738 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

haizelabs/verdict

Scale your LLM-as-a-judge.

Language: Jupyter Notebook - Size: 10 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 236 - Forks: 16

IAAR-Shanghai/xVerify

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Language: Python - Size: 826 KB - Last synced at: 4 days ago - Pushed at: about 2 months ago - Stars: 109 - Forks: 7

Agenta-AI/agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

Language: Python - Size: 171 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 2,795 - Forks: 330

trotacodigos/Rubric-MQM

Rubric-MQM: LLM-as-judge in MT for high-end models

Language: Python - Size: 2.06 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

docling-project/docling-sdg

A set of tools to create synthetically-generated data from documents

Language: Python - Size: 3.47 MB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 15 - Forks: 5

prometheus-eval/prometheus-eval

Evaluate your LLM's response with Prometheus and GPT4 💯

Language: Python - Size: 15.1 MB - Last synced at: 18 days ago - Pushed at: about 2 months ago - Stars: 946 - Forks: 54

HillPhelmuth/LlmAsJudgeEvalPlugins

LLM-as-judge evals as Semantic Kernel Plugins

Language: C# - Size: 949 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 6 - Forks: 1

IAAR-Shanghai/xFinder

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Language: Python - Size: 1.36 MB - Last synced at: 21 days ago - Pushed at: 4 months ago - Stars: 169 - Forks: 7

metauto-ai/agent-as-a-judge

⚖️ The First Coding Agent-as-a-Judge

Language: Python - Size: 18.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 495 - Forks: 76

whitecircle-ai/circle-guard-bench

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

Language: Python - Size: 19.9 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 32 - Forks: 1

Largonarco/Eval

LLM system evaluations for a mock system

Language: Python - Size: 88.9 KB - Last synced at: 25 days ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Non-NeutralZero/LLM-EvalSys

automated evaluation of llm generated responses on aws

Language: Python - Size: 36.1 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

KID-22/LLM-IR-Bias-Fairness-Survey

This is the repo for the survey of Bias and Fairness in IR with LLMs.

Size: 919 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 52 - Forks: 3

aws-samples/genai-system-evaluation

A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques

Language: Jupyter Notebook - Size: 551 KB - Last synced at: 8 days ago - Pushed at: 9 months ago - Stars: 8 - Forks: 1

OussamaSghaier/CuREV

Harnessing Large Language Models for Curated Code Reviews

Language: Python - Size: 592 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 12 - Forks: 1

MJ-Bench/MJ-Bench

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

Language: Jupyter Notebook - Size: 218 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 43 - Forks: 5

PKU-ONELab/Themis

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

Language: Python - Size: 354 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 18 - Forks: 0

PKU-ONELab/LLM-evaluator-reliability

The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?

Language: Python - Size: 7.95 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 7 - Forks: 1

djokester/groqeval

Use groq for evaluations

Language: Python - Size: 98.6 KB - Last synced at: about 22 hours ago - Pushed at: 11 months ago - Stars: 2 - Forks: 0

UMass-Meta-LLM-Eval/llm_eval

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.

Language: Python - Size: 1.88 MB - Last synced at: 8 months ago - Pushed at: 9 months ago - Stars: 8 - Forks: 1

aws-samples/model-as-a-judge-eval

Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.

Language: Jupyter Notebook - Size: 53.7 KB - Last synced at: 8 days ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 1

martin-wey/CodeUltraFeedback

CodeUltraFeedback: aligning large language models to coding preferences

Language: Python - Size: 12.4 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 54 - Forks: 2

rafaelsandroni/antibodies

Antibodies for LLMs hallucinations (grouping LLM as a judge, NLI, reward models)

Language: Python - Size: 3.91 KB - Last synced at: 2 days ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

minnesotanlp/cobbler

Code and data for ACL ARR 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

Language: Jupyter Notebook - Size: 3.92 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 13 - Forks: 1

Related Keywords

llm-as-a-judge 27 llm 10 llm-as-evaluator 9 evaluation 8 large-language-models 6 llm-evaluation 6 benchmark 3 llmops 3 llms 3 evals 3 generative-ai 3 nlg 2 reliability 2 information-retrieval 2 ai 2 regex 2 bias 2 python 2 open-compass 2 judge-model 2 chatgpt 2 cc-by-nc-nd-4 2 agent-as-a-judge 1 genai 1 recommender-systems 1 llm4rs 1 llm4rec 1 benchmarking 1 llm4ir 1 ir 1 guardrail 1 guardrails 1 fairness 1 jailbreak 1 large-language-model 1 llm-eval 1 llm-jailbreaks 1 llm-security 1 openai 1 safeguard 1 nlp 1 llms-benchmarking 1 llm-as-judge 1 bias-detection 1 nli 1 hallucinations 1 hallucination-detection 1 dpo 1 codeultrafeedback 1 code-generation 1 codal-bench 1 alignment 1 nlp-machine-learning 1 mixtral 1 llama3 1 groq 1 reward-models 1 multimodal-judge 1 multimodal-foundation-model 1 llm-benchmarking 1 software-maintenance 1 empirical-software-engineering 1 dataset-curation 1 code-review 1 llm-tools 1 llm-playground 1 llm-platform 1 llm-observability 1 llm-monitoring 1 llm-framework 1 xverify 1 reliability-tools 1 reasoning-models 1 open-r1 1 math-verify 1 deepseek-math 1 test-time-scaling 1 test-time-compute 1 reward-shaping 1 llm-judge 1 inference-time-compute 1 evaluation-metrics 1 observability 1 pydantic-ai 1 model-context-protocol 1 mcp 1 agentic-ai 1 xfinder 1 reliable-evaluation 1 qwen 1 phi 1 lm-evaluation 1 key-answer-extraction 1 gpt 1 dataset 1 chatglm 1 semantickernel 1 vllm 1 litellm 1 gpt4 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Repos